Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] net: airoha: Clean up RX queues in airoha_dev_stop
From: Wayen Yan @ 2026-06-16 10:50 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

When the last port is stopped, airoha_dev_stop() clears TX queues
but neglects to clean up RX queues. This can lead to:
- RX ring buffer descriptors remaining valid after device close
- Potential DMA synchronization issues on device reopen
- Risk of use-after-free if pages are freed while DMA is still active

Add cleanup loop for RX queues to mirror the TX queue cleanup,
ensuring symmetric resource management.

Fixes: 20bf7d07c956 ("net: airoha: add QDMA support for Airoha EN7581 Ethernet")
Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 31cdb11cd7..9ca5bbf64d 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1771,6 +1771,13 @@ static int airoha_dev_stop(struct net_device *dev)
 
 			airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
 		}
+
+		for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
+			if (!qdma->q_rx[i].ndesc)
+				continue;
+
+			airoha_qdma_cleanup_rx_queue(&qdma->q_rx[i]);
+		}
 	}
 
 	return 0;
-- 
2.51.0



^ permalink raw reply related

* [PATCH] net: airoha: Stop TX queues on error path in airoha_dev_open
From: Wayen Yan @ 2026-06-16 10:50 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

In airoha_dev_open(), if airoha_set_vip_for_gdm_port() fails after
netif_tx_start_all_queues() has been called, the TX queues remain
started while the device configuration is incomplete. This leaves
the device in an inconsistent state where packets could be
transmitted before the VIP/IFC port configuration is complete.

Add netif_tx_stop_all_queues() call on the error path to properly
roll back the TX queue state.

Fixes: 20bf7d07c956 ("net: airoha: add QDMA support for Airoha EN7581 Ethernet")
Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 31cdb11cd7..cf9c366907 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1715,8 +1715,10 @@ static int airoha_dev_open(struct net_device *dev)
 
 	netif_tx_start_all_queues(dev);
 	err = airoha_set_vip_for_gdm_port(port, true);
-	if (err)
+	if (err) {
+		netif_tx_stop_all_queues(dev);
 		return err;
+	}
 
 	if (netdev_uses_dsa(dev))
 		airoha_fe_set(qdma->eth, REG_GDM_INGRESS_CFG(port->id),
-- 
2.51.0



^ permalink raw reply related

* [PATCH net] dpaa2-switch: fix VLAN upper check not rejecting bridge join
From: Ioana Ciornei @ 2026-06-16 10:54 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, netdev
  Cc: f.fainelli, vladimir.oltean, linux-kernel

The blamed commit refactored the prechangeupper event handling but
failed to actually return an error in case
dpaa2_switch_prevent_bridging_with_8021q_upper() detected a 802.1q upper
on a port which tries to join a bridge. Fix this by returning err
instead of 0.

Fixes: 45035febc495 ("net: dpaa2-switch: refactor prechangeupper sanity checks")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index 52c1cb9cb7e0..46ae81c2fa01 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -2177,7 +2177,7 @@ dpaa2_switch_prechangeupper_sanity_checks(struct net_device *netdev,
 	if (err) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "Cannot join a bridge while VLAN uppers are present");
-		return 0;
+		return err;
 	}
 
 	netdev_for_each_lower_dev(upper_dev, other_dev, iter) {
-- 
2.25.1


^ permalink raw reply related

* [PATCH] net: airoha: Fix QoS counter configuration for Tx-fwd channels
From: Wayen Yan @ 2026-06-16 10:50 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

In airoha_qdma_init_qos_stats(), the Tx-fwd counter was incorrectly
using register index (i << 1) instead of ((i << 1) + 1). This caused
the Tx-fwd configuration to overwrite the Tx-cpu configuration for
each QoS channel, resulting in incorrect QoS statistics.

Fix by using the correct register index ((i << 1) + 1) for Tx-fwd
counter configuration.

Fixes: 20bf7d07c956 ("net: airoha: add QDMA support for Airoha EN7581 Ethernet")
Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 31cdb11cd7..329988a840 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1256,7 +1256,7 @@ static void airoha_qdma_init_qos_stats(struct airoha_qdma *qdma)
 			       FIELD_PREP(CNTR_CHAN_MASK, i));
 		/* Tx-fwd transferred count */
 		airoha_qdma_wr(qdma, REG_CNTR_VAL((i << 1) + 1), 0);
-		airoha_qdma_wr(qdma, REG_CNTR_CFG(i << 1),
+		airoha_qdma_wr(qdma, REG_CNTR_CFG((i << 1) + 1),
 			       CNTR_EN_MASK | CNTR_ALL_QUEUE_EN_MASK |
 			       CNTR_ALL_DSCP_RING_EN_MASK |
 			       FIELD_PREP(CNTR_SRC_MASK, 1) |
-- 
2.51.0



^ permalink raw reply related

* [PATCH 5.10/5.15/6.1/6.6/6.12/6.18] tap: free page on error paths in tap_get_user_xdp()
From: Nazar Kalashnikov @ 2026-06-16  9:02 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Nazar Kalashnikov, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Dongli Zhang, netdev,
	linux-kernel, bpf, Si-Wei Liu, Willem de Bruijn, lvc-project,
	Xiang Mei, Weiming Shi

From: Weiming Shi <bestswngs@gmail.com>

commit 3bcf7aec6a9d16438f2cec29f5d7c8d5b8edf9b2 upstream.

tap_get_user_xdp() rejects a frame shorter than ETH_HLEN with -EINVAL,
and returns -ENOMEM when build_skb() fails. Both paths jump to the err
label without freeing the page that vhost_net_build_xdp() allocated for
the frame. tap_sendmsg() discards the per-buffer return value and always
returns 0, so vhost_tx_batch() takes the success path and never frees
the page; each rejected frame in a batch leaks one page-frag chunk.

Free the page on both error paths, before the skb is built. This is the
tap counterpart of the same leak in tun_xdp_one().

Fixes: 0efac27791ee ("tap: accept an array of XDP buffs through sendmsg()")
Fixes: ed7f2afdd0e0 ("tap: add missing verification for short frame")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260521163230.1478627-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Nazar Kalashnikov <nazarkalashnikov0@gmail.com>
---
Backport fix for CVE-2026-46320
 drivers/net/tap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 6fd3b14273b3..b51ce7af1b20 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1052,6 +1052,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 	int err, depth;
 
 	if (unlikely(xdp->data_end - xdp->data < ETH_HLEN)) {
+		put_page(virt_to_head_page(xdp->data));
 		err = -EINVAL;
 		goto err;
 	}
@@ -1061,6 +1062,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
+		put_page(virt_to_head_page(xdp->data));
 		err = -ENOMEM;
 		goto err;
 	}
-- 
2.47.3

^ permalink raw reply related

* [PATCH] ice: retry reading NVM if admin queue returns EBUSY
From: Robert Malz @ 2026-06-16 10:45 UTC (permalink / raw)
  To: anthony.l.nguyen, przemyslaw.kitszel; +Cc: intel-wired-lan, netdev

When the admin queue command to read NVM returns EBUSY, the driver
currently treats it as a fatal error and aborts the entire read
operation. This can cause spurious NVM read failures during periods of
high firmware activity.

Add retry logic to ice_read_flat_nvm() that handles EBUSY responses
from the admin queue. When an EBUSY error is encountered, release the
NVM resource lock, wait for ICE_SQ_SEND_DELAY_TIME_MS, re-acquire it,
and retry the failed read. The retry is attempted up to
ICE_SQ_SEND_MAX_EXECUTE times before giving up.

Code was extracted from OOT ice driver 1.15.4 release. Additional
change was made to reset last_cmd in case of retry to make sure that
all commands are retried properly.

Fixes: e94509906d6b ("ice: create function to read a section of the NVM and Shadow RAM")
Signed-off-by: Robert Malz <robert.malz@canonical.com>
---
 drivers/net/ethernet/intel/ice/ice_nvm.c | 25 +++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_nvm.c b/drivers/net/ethernet/intel/ice/ice_nvm.c
index 7e187a804dfa..cbe21ef9d18e 100644
--- a/drivers/net/ethernet/intel/ice/ice_nvm.c
+++ b/drivers/net/ethernet/intel/ice/ice_nvm.c
@@ -67,6 +67,7 @@ ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
 {
 	u32 inlen = *length;
 	u32 bytes_read = 0;
+	int retry_cnt = 0;
 	bool last_cmd;
 	int status;
 
@@ -96,11 +97,25 @@ ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
 					 offset, read_size,
 					 data + bytes_read, last_cmd,
 					 read_shadow_ram, NULL);
-		if (status)
-			break;
-
-		bytes_read += read_size;
-		offset += read_size;
+		if (status) {
+			if (hw->adminq.sq_last_status != ICE_AQ_RC_EBUSY ||
+			    retry_cnt > ICE_SQ_SEND_MAX_EXECUTE)
+				break;
+			ice_debug(hw, ICE_DBG_NVM,
+				  "NVM read EBUSY error, retry %d\n",
+				  retry_cnt + 1);
+			last_cmd = false;
+			ice_release_nvm(hw);
+			msleep(ICE_SQ_SEND_DELAY_TIME_MS);
+			status = ice_acquire_nvm(hw, ICE_RES_READ);
+			if (status)
+				break;
+			retry_cnt++;
+		} else {
+			bytes_read += read_size;
+			offset += read_size;
+			retry_cnt = 0;
+		}
 	} while (!last_cmd);
 
 	*length = bytes_read;
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Sebastian Andrzej Siewior @ 2026-06-16 10:35 UTC (permalink / raw)
  To: Jakub Kicinski, Petr Mladek, John Ogness, Sergey Senozhatsky,
	Peter Zijlstra
  Cc: Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260611191114.5bc43a59@kernel.org>

On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > +	local_bh_disable();
> > + 	poll_napi(dev);
> > +	_local_bh_enable();
> 
> tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> softirq into not waking ksoftirqd? The problematic path is:
> 
>   scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)
> 
> so the softirq may never get serviced.
> 
> In netcons we try to avoid touching the network driver if the Tx path
> locks are already held. Ideally we'd do something similar with the
> scheduler. Try to do bare minimum if we may be in the scheduler.
> Failing that - don't poll the driver if we were called with irqs
> already disabled.
> 
> Or maybe we only poll from console->write_thread ?

So this is not an issue since commit 7eab73b18630e ("netconsole: convert
to NBCON console infrastructure"). Because from here now on writes are
deferred to the nbcon thread. So this purely about -stable in this case.

Looking at the patch and the amount of comments vs code changes look
somehow hackish. That ifdef for PREEMPT_RT is not needed because on
PREEMPT_RT we have either nbcon or the legacy console (including
netconsole before the mentioned commit) wrapped in a dedicated thread
(via force_legacy_kthread()).
That means in both cases the flow never ends there and the problem is
limited to !PREEMPT_RT.

Now. The scheduler usually does printk_deferred() because of the rq lock
so it does not deadlock for various reasons. It is kind of a pity that
the various WARN macros don't do that.
I don't think that patch is enough. It works around the problem in this
scenario but should the NIC driver invoke schedule_work() then we are
back here again.
Should the network driver acquire a lock then lockdep might observe
rq -> driver-lock and then driver-lock -> rq and yell dead lock (CPU1
doing AB and CPU2 doing BA). This includes also other console driver so
it is not limited to netconsole.

Point being made is that we should avoid the callchain:

|  console_unlock
|  vprintk_emit
|  __warn
|  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
|  put_prev_entity
|  put_prev_task_fair
|  __schedule

basically a printk under the rq lock.

We could add printk_deferred_enter/exit() to all the rq_lock() variants.
I think PeterZ loves this the most. And Greg will appreciate it too
while backporting because of all the context changes.

We could also introduce WARN_ON_DEFERRED +variants which do the
printk_deferred_enter/exit() thingy should around the printk and replace
all the WARNs in kernel/sched/.
I *think* the tty/console layer has also a deadlock problem where it
holds locks and then the WARN(), that never triggers, asks for the same
locks again so we might have a second user…

Adding sched and printk folks for opinions while eyeballing
WARN_ON_DEFERRED().

Sebastian

^ permalink raw reply

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Pedro Falcato @ 2026-06-16 10:28 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
	edumazet, kuba, pabeni, gregkh, rafael, akpm, david, netdev,
	linux-mm, iommu, driver-core, linux-kernel,
	Jesper Dangaard Brouer, Ilias Apalodimas
In-Reply-To: <CAMOZA0+L2+=FEQ5ORvv07JaJix0R+6Q6u01CyMKCbd842To9nA@mail.gmail.com>

On Tue, Jun 16, 2026 at 11:48:36AM +0200, Luigi Rizzo wrote:
> On Tue, Jun 16, 2026 at 11:20 AM Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > (+cc page pool maintainers)
> > On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote:
> > > The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> > > especially with greedy senders, this has a high chance of happening in
> > > the softirq handler for tx network interrupts, creating a significant
> > > performance bottleneck.
> > >
> > > Allow tx sockets to allocate socket buffers directly from the bounce
> > > buffers. This avoids the second copy and removes the above bottleneck.
> > > The fraction of swiotlb buffers allowed for this feature is set with
> > >    /sys/module/swiotlb/parameters/zerocopy_tx_percent
> > > (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
> > >
> > > Implementation:
> > > - define a new page type to unambiguously identify bounce buffers used
> > >   as backing storage for socket buffers
> > > - modify skb_page_frag_refill to perform the modified allocation
> > > - modify the destructors __free_frozen_pages(), free_unref_folio() to
> > >   handle those pages and return them to the pool.
> > >
> > > The savings are especially visible with fewer queues. In synthetic
> > > benchmarks, senders with 1-2 queues would cap around 50Gbps with
> > > conventional swiotlb, and reach over 170Gbps with the feature enabled.
> >
> > I could be wrong, but I genuinely think that the way to go about this is
> > using page_pool for regular TX as well. page_pool pages are all dma-mapped
> > (so whatever swiotlb optimization you want can be done there), and the net
> > stack already has awareness of these special pages and special skbs, so it
> > won't Just Return Them back to the page allocator.
> 
> I am not sure I follow your comment above, can you expand/clarify?
> 
> The problem I am dealing with is that the copy from the socket buffer
> to the bounce buffer is done in the device xmit function. Under high
> it is almost always done by the tx softirq.
> This means that even if we move the copy outside the HARD_TX_LOCK(),
> it would still be almost completely serialized.
> Hence the proposed method to make skb_page_frag_refill() allocate
> directly a bounce buffer (under specific conditions) so there is a single copy
> done directly to the dma-able buffer, and ii is done  in the user threads/CPUs
> and is not seriallized in the softirq thread.
> 
> I am not sure how page_pool on tx could help here.

Page pool would provide both the means of passing around an iommu-mapped page,
and a concrete "this is where we allocate these pages" spot. Then introducing
a "zero-copy" swiotlb allocation would be a simple matter of introducing this
on page pool's side. In pseudo-code, something like:

static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
						 gfp_t gfp)
{
	struct page *page;

	gfp |= __GFP_COMP;
	
	if (pool->dma_map && /* is_swiotlb */) {
		page = swiotlb_alloc_pages(pool->p.nid, gfp, pool->p.order, ...);
		if (!page)
			return NULL;
		/* page is implicitly swiotlb mapped (well, _actually_ it's
		 * not that simple, because of the dma_mapped tracking that
		 * was introduced, but PoC anyway..). */
	} else {
		page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
		if (unlikely(!page))
			return NULL;

		if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
			put_page(page);
			return NULL;
		}
	}
}

(plus other spots, obviously). No copying should be required, and the
netmem desc will keep the dma_addr around. The network stack will notice
pp_recycle on all of these skbs and simply refuse to throw the pages away to
the page allocator.

In any case, it might be that this is not feasible for XYZ reasons, but I've
thought about this (making net use and reuse page pool pre-iommu-mapped pages
exclusively) for a while and I definitely see a lot of similarities with your
problem (that more or less reduces down to "I want to get an iommu-mapped page
from the get-go").

-- 
Pedro

^ permalink raw reply

* RE: [PATCH net-next v2] net: dsa: Fix skb ownership in taggers
From: Wei Fang @ 2026-06-16 10:19 UTC (permalink / raw)
  To: Linus Walleij
  Cc: netdev@vger.kernel.org, Sashiko AI Review, Andrew Lunn,
	Vladimir Oltean, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Florian Fainelli, Jonas Gorski,
	Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh,
	UNGLinuxDriver@microchip.com, Chester A. Unal, Daniel Golle,
	Matthias Brugger, AngeloGioacchino Del Regno, Clark Wang,
	Clément Léger, George McCollister, David Yang
In-Reply-To: <20260616-dsa-fix-free-skb-v2-1-9dbda6a19e97@kernel.org>

> The tag_8021q.c tagger calls vlan_insert_tag() in dsa_8021q_xmit().
> vlan_insert_tag() will consume the skb with kfree_skb() on failure
> and return NULL.
> 
> When NULL is returned as error code to ->xmit() in dsa_user_xmit()
> it will free the same skb again leading to a double-free.
> 
> The idea of dsa_user_xmit() and dsa_switch_rcv() dropping the skb
> they held before the call to ->xmit() and ->rcv() is conceptually
> wrong: the pattern elsewhere in the networking code is that consumers
> drop their skb:s on failure.
> 
> Modify the ->xmit() and ->rcv() call sites to not drop the SKB if
> the taggers return NULL from any of these calls. Move those drops into
> the taggers so every callback error path that retains ownership consumes
> the skb before returning NULL.
> 
> Keep the existing helper ownership rules: VLAN insertion helpers already
> free on failure (this is the case in tag_8021q.c), while deferred
> transmit paths either transfer the skb reference to worker context or
> hold a worker reference with skb_get() and drop the caller's reference.
> 
> For SJA1105 meta RX, transfer the buffered stampable skb under the meta
> lock and return NULL while the skb is waiting for its meta frame: the
> skb is not dropped in this case.

Reviewed-by: Wei Fang <wei.fang@nxp.com> # netc


^ permalink raw reply

* Re: [PATCH bpf] bpf, sockmap: fix lock inversion between stab->lock and sk_callback_lock
From: Jiayuan Chen @ 2026-06-16 10:17 UTC (permalink / raw)
  To: Sechang Lim, John Fastabend, Jakub Sitnicki
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Simon Horman, netdev, bpf,
	linux-kernel
In-Reply-To: <20260616091153.2966617-1-rhkrqnwk98@gmail.com>


On 6/16/26 5:11 PM, Sechang Lim wrote:
> sock_map_update_common() and __sock_map_delete() hold stab->lock and call
> sock_map_unref() -> sock_map_del_link() under it. sock_map_del_link() takes
> sk_callback_lock for write to stop the strparser and verdict, giving the
> lock order stab->lock -> sk_callback_lock.
>
> The opposite order comes from an SK_SKB stream parser. On RX,
> sk_psock_strp_data_ready() holds sk_callback_lock for read while running
> the parser. The verdict redirects the skb to egress, where a sched_cls


The commit message is wrong. A verdict does not redirect to egress
synchronously — sk_psock_skb_redirect() only queues the skb and
schedule_delayed_work()s sk_psock_backlog, so egress runs in workqueue
context, not under sk_callback_lock.


> program calls bpf_map_delete_elem() on a sockmap, which takes stab->lock:
>
>    WARNING: possible circular locking dependency detected
>    7.1.0-rc6 Not tainted
>    ------------------------------------------------------
>    syz.9.8824 is trying to acquire lock:
>    (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
>    but task is already holding lock:
>    (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173
>
>    -> #1 (clock-AF_INET){++.-}-{3:3}:
>           _raw_write_lock_bh
>           sock_map_del_link net/core/sock_map.c:167
>           sock_map_unref net/core/sock_map.c:184
>           sock_map_update_common net/core/sock_map.c:509
>           sock_map_update_elem_sys net/core/sock_map.c:588
>           map_update_elem kernel/bpf/syscall.c:1805
>
>    -> #0 (&stab->lock){+.-.}-{3:3}:
>           _raw_spin_lock_bh
>           __sock_map_delete net/core/sock_map.c:421
>           sock_map_delete_elem net/core/sock_map.c:452
>           bpf_prog_06044d24140080b6
>           tcx_run net/core/dev.c:4451
>           sch_handle_egress net/core/dev.c:4541
>           __dev_queue_xmit net/core/dev.c:4808
>           ...
>           tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701


I guess it is an ACK. What is the actual purpose of a sched_cls program 
calling

sockmap delete on the TX path of an ACK? If there is no real use case 
for it, this is

just broken BPF usage, not a kernel bug worth this change.



^ permalink raw reply

* [PATCH] net: airoha: fix foe_check_time allocation size
From: Wayen Yan @ 2026-06-16  9:49 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

foe_check_time is declared as u16 pointer but was allocated with
only ppe_num_entries bytes instead of ppe_num_entries * sizeof(u16).

When airoha_ppe_foe_verify_entry() is called with hash >= ppe_num_entries/2,
it writes beyond the allocated buffer, causing heap buffer overflow and
potential kernel crash.

Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_ppe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c
index 5c9dff6bcc..8fb8ecf909 100644
--- a/drivers/net/ethernet/airoha/airoha_ppe.c
+++ b/drivers/net/ethernet/airoha/airoha_ppe.c
@@ -1578,7 +1578,8 @@ int airoha_ppe_init(struct airoha_eth *eth)
 			return -ENOMEM;
 	}
 
-	ppe->foe_check_time = devm_kzalloc(eth->dev, ppe_num_entries,
+	ppe->foe_check_time = devm_kzalloc(eth->dev,
+					   ppe_num_entries * sizeof(*ppe->foe_check_time),
 					   GFP_KERNEL);
 	if (!ppe->foe_check_time)
 		return -ENOMEM;
-- 
2.51.0



^ permalink raw reply related

* Re: [PATCH net-next 3/5] selftests/bpf: remove sockmap + ktls tests
From: Jakub Sitnicki @ 2026-06-16 10:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, bpf,
	john.fastabend, sd
In-Reply-To: <20260614014102.461064-4-kuba@kernel.org>

On Sat, Jun 13, 2026 at 06:40 PM -07, Jakub Kicinski wrote:
> The combination of sockmap and TLS is no longer supported - installing
> the TLS ULP on a sockmap socket (and vice versa) is now rejected. Remove
> the tests that exercise the combination along with their BPF program;
> the file covered nothing but sockmap sockets holding kTLS contexts.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/sockmap_ktls.c   | 355 ------------------
>  .../selftests/bpf/progs/test_sockmap_ktls.c   |  61 ---
>  tools/testing/selftests/bpf/test_sockmap.c    | 227 +----------
>  3 files changed, 1 insertion(+), 642 deletions(-)
>  delete mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_ktls.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
> index 6ed8e149e3d5..cda6b22cf759 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c

[...]

>  static void run_ktls_test(int family, int sotype)
>  {
>  	if (test__start_subtest("tls simple offload"))
>  		test_sockmap_ktls_offload(family, sotype);

Nit: We probably don't need to keep this one test around.
It tests pure kTLS and overlaps with selftests/net/tls.c.

> -	if (test__start_subtest("tls tx cork"))
> -		test_sockmap_ktls_tx_cork(family, sotype, false);
> -	if (test__start_subtest("tls tx cork with push"))
> -		test_sockmap_ktls_tx_cork(family, sotype, true);
> -	if (test__start_subtest("tls tx egress with no buf"))
> -		test_sockmap_ktls_tx_no_buf(family, sotype, true);
> -	if (test__start_subtest("tls tx with pop"))
> -		test_sockmap_ktls_tx_pop(family, sotype);
> -	if (test__start_subtest("tls verdict with tls rx"))
> -		test_sockmap_ktls_verdict_with_tls_rx(family, sotype);
>  }

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply

* [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-16 10:03 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
	linux-hardening, llvm, Ilya Maximets, Johan Thomsen

kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
structure to the options_len, which is then initialized to zero.
Later, we're initializing the structure by copying the tunnel info
together with the options, and this triggers a warning for a potential
memcpy overflow, since the compiler estimates that the options can't
fit into the structure, even though the memory for them is actually
allocated.

 memcpy: detected buffer overflow: 104 byte write of buffer size 96
 WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
  skb_tunnel_info_unclone+0x179/0x190
  geneve_xmit+0x7fe/0xe00

The issue is triggered when built with clang and source fortification.

Fix that by doing the copy in two stages: first - the main data with
the options_len, then the options.  This way the correct length should
be known at the time of the copy.

It would be better if the options_len never changed after allocation,
but the allocation code is a little separate from the initialization
and it would be awkward and potentially dangerous to return a struct
with options_len set to a non-zero value from the metadata_dst_alloc().

Another option would be to use ip_tunnel_info_opts_set(), but it is
doing too many unnecessary operations for the use case here.

Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Reported-by: Johan Thomsen <write@ownrisk.dk>
Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

Johan, if you can test this one in your setup as well, that would
be great.  Thanks.

 include/net/dst_metadata.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 1fc2fb03ce3f..f45d1e3163f0 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
 	if (!new_md)
 		return ERR_PTR(-ENOMEM);

-	memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
-	       sizeof(struct ip_tunnel_info) + md_size);
+	/* Copy in two stages to keep the __counted_by happy. */
+	new_md->u.tun_info = md_dst->u.tun_info;
+	memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
+	       ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
+
 #ifdef CONFIG_DST_CACHE
 	/* Unclone the dst cache if there is one */
 	if (new_md->u.tun_info.dst_cache.cache) {
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH 07/23] driver core: platform: provide platform_device_set_fwnode()
From: Bartosz Golaszewski @ 2026-06-16  9:51 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Lee Jones, Mark Brown, Thierry Reding, Sebastian Hesselbarth,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Srinivas Kandagatla, Greg Kroah-Hartman, Vinod Koul,
	Rafael J. Wysocki, Danilo Krummrich, Rob Herring, Saravana Kannan,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
	Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
	Broadcom internal kernel review list, Ulf Hansson, Frank Li,
	Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
	Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
	Maximilian Luz, Hans de Goede, Ilpo Järvinen,
	Krzysztof Kozlowski, Benjamin Herrenschmidt, linux-kernel, netdev,
	linux-arm-msm, linux-sound, driver-core, devicetree, linuxppc-dev,
	linux-i2c, iommu, linux-pm, imx, linux-arm-kernel, intel-xe,
	dri-devel, linux-usb, linux-mips, platform-driver-x86,
	Bartosz Golaszewski, Bartosz Golaszewski
In-Reply-To: <ajEcDq0S067wMFaK@black.igk.intel.com>

On Tue, 16 Jun 2026 11:49:02 +0200, Andy Shevchenko
<andriy.shevchenko@linux.intel.com> said:
> On Thu, Jun 04, 2026 at 05:32:27AM -0700, Bartosz Golaszewski wrote:
>> On Tue, 2 Jun 2026 23:41:53 +0200, Andy Shevchenko
>> <andriy.shevchenko@linux.intel.com> said:
>> > On Thu, May 21, 2026 at 10:36:30AM +0200, Bartosz Golaszewski wrote:
>> >> Provide a helper function encapsulating the logic of assigning firmware
>> >> nodes to platform devices created with platform_device_alloc(). Make the
>> >> kerneldoc state that this is the proper interface for assigning firmware
>> >> nodes to dynamically allocated platform devices. This will allow us to
>> >> switch to counting the references of the device's firmware nodes in the
>> >> future, not only the OF nodes.
>> >
>> > But why different for of_node and fwnode to begin with?!
>>
>> I'm not following. What are you suggesting?
>
> After re-reading of this thread, I think I'm suggesting the same what you have
> in plans to do in the future as you put it as "This will allow us to switch to
> counting the references of the device's firmware nodes in the future, not only
> the OF nodes."
>
> // Offtopic
> I haven't heard from you for more than a month on this:
> https://lore.kernel.org/r/af18zdP5HF3_P9Vo@black.igk.intel.com
> Anything should I do? Please, answer to that thread.
>

Eek, sorry, must have flown under the radar.

I'll pull it now, I will do a second PR for this merge window anyway.

Bart

^ permalink raw reply

* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Konrad Dybcio @ 2026-06-16  9:50 UTC (permalink / raw)
  To: Mohd Ayaan Anwar, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Richard Cochran, Bjorn Andersson, Konrad Dybcio,
	Maxime Coquelin, Alexandre Torgue, Russell King
  Cc: linux-arm-msm, netdev, devicetree, linux-kernel, linux-stm32,
	linux-arm-kernel
In-Reply-To: <20260612-shikra_ethernet-v1-8-f0f4a1d19929@oss.qualcomm.com>

On 6/11/26 8:37 PM, Mohd Ayaan Anwar wrote:
> Enable the first Gigabit Ethernet controller.  The board layout is
> identical to the CQM EVK.
> 
> Signed-off-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
> ---
>  arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts | 119 ++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts b/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts
> index 26ff8007a819e46bbc9ffa3dddc6fee6530a4a7a..1f2e4f6dd7cca436f62ba9f09cd328e5a2079095 100644
> --- a/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts
> +++ b/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts
> @@ -7,6 +7,7 @@
>  
>  #include "shikra-cqm-som.dtsi"
>  #include "shikra-evk.dtsi"
> +#include <dt-bindings/net/ti-dp83867.h>
>  
>  / {
>  	model = "Qualcomm Technologies, Inc. Shikra CQS EVK";
> @@ -60,6 +61,92 @@ vreg_pmu_ch1: ldo4 {
>  	};
>  };
>  
> +&ethernet0 {
> +	status = "okay";

'status' should go last, with a \n before it

> +	phy-handle = <&ethphy0>;
> +	phy-mode = "rgmii-id";
> +
> +	pinctrl-names = "default";
> +	pinctrl-0 = <&ethernet0_defaults>;

property-n
property-names

in this order, please

[...]

> +&tlmm {
> +	ethernet0_defaults: ethernet0-defaults-state {

s/defaults/default

Please move this definition to shikra.dtsi

> +		rgmii-rx-pins {
> +			pins = "gpio121", "gpio122", "gpio123",
> +			       "gpio124", "gpio125", "gpio126";
> +			function = "rgmii";
> +			bias-disable;
> +			drive-strength = <16>;

Let's move drive-strength before bias (that's the order used in other
places)

> +		};
> +		rgmii-tx-pins {

Please separate subsequent subnodes with \n

> +			pins = "gpio127", "gpio128", "gpio129",
> +			       "gpio130", "gpio131", "gpio132";
> +			function = "rgmii";
> +			bias-pull-up;
> +			drive-strength = <16>;
> +		};
> +		rgmii-mdio-pins {
> +			pins = "gpio133", "gpio134";
> +			function = "rgmii";
> +			bias-pull-up;
> +			drive-strength = <16>;
> +		};

> +	};
> +
> +	emac0_phy_en_hog: emac0-phy-en-hog {
> +		gpio-hog;
> +		gpios = <149 GPIO_ACTIVE_HIGH>;
> +		output-high;
> +		line-name = "emac0-phy-en";
> +	};

This looks like a hack - what does this pin actually do?

Konrad

^ permalink raw reply

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Luigi Rizzo @ 2026-06-16  9:48 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
	edumazet, kuba, pabeni, gregkh, rafael, akpm, david, netdev,
	linux-mm, iommu, driver-core, linux-kernel,
	Jesper Dangaard Brouer, Ilias Apalodimas
In-Reply-To: <ajESl4osXP7roz5q@pedro-suse.lan>

On Tue, Jun 16, 2026 at 11:20 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> (+cc page pool maintainers)
> On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote:
> > The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> > especially with greedy senders, this has a high chance of happening in
> > the softirq handler for tx network interrupts, creating a significant
> > performance bottleneck.
> >
> > Allow tx sockets to allocate socket buffers directly from the bounce
> > buffers. This avoids the second copy and removes the above bottleneck.
> > The fraction of swiotlb buffers allowed for this feature is set with
> >    /sys/module/swiotlb/parameters/zerocopy_tx_percent
> > (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
> >
> > Implementation:
> > - define a new page type to unambiguously identify bounce buffers used
> >   as backing storage for socket buffers
> > - modify skb_page_frag_refill to perform the modified allocation
> > - modify the destructors __free_frozen_pages(), free_unref_folio() to
> >   handle those pages and return them to the pool.
> >
> > The savings are especially visible with fewer queues. In synthetic
> > benchmarks, senders with 1-2 queues would cap around 50Gbps with
> > conventional swiotlb, and reach over 170Gbps with the feature enabled.
>
> I could be wrong, but I genuinely think that the way to go about this is
> using page_pool for regular TX as well. page_pool pages are all dma-mapped
> (so whatever swiotlb optimization you want can be done there), and the net
> stack already has awareness of these special pages and special skbs, so it
> won't Just Return Them back to the page allocator.

I am not sure I follow your comment above, can you expand/clarify?

The problem I am dealing with is that the copy from the socket buffer
to the bounce buffer is done in the device xmit function. Under high
it is almost always done by the tx softirq.
This means that even if we move the copy outside the HARD_TX_LOCK(),
it would still be almost completely serialized.
Hence the proposed method to make skb_page_frag_refill() allocate
directly a bounce buffer (under specific conditions) so there is a single copy
done directly to the dma-able buffer, and ii is done  in the user threads/CPUs
and is not seriallized in the softirq thread.

I am not sure how page_pool on tx could help here.

cheers
luigi

^ permalink raw reply

* Re: [PATCH 07/23] driver core: platform: provide platform_device_set_fwnode()
From: Andy Shevchenko @ 2026-06-16  9:49 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Lee Jones, Mark Brown, Thierry Reding, Sebastian Hesselbarth,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Srinivas Kandagatla, Greg Kroah-Hartman, Vinod Koul,
	Rafael J. Wysocki, Danilo Krummrich, Rob Herring, Saravana Kannan,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
	Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
	Broadcom internal kernel review list, Ulf Hansson, Frank Li,
	Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
	Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
	Maximilian Luz, Hans de Goede, Ilpo Järvinen,
	Krzysztof Kozlowski, Benjamin Herrenschmidt, linux-kernel, netdev,
	linux-arm-msm, linux-sound, driver-core, devicetree, linuxppc-dev,
	linux-i2c, iommu, linux-pm, imx, linux-arm-kernel, intel-xe,
	dri-devel, linux-usb, linux-mips, platform-driver-x86,
	Bartosz Golaszewski
In-Reply-To: <CAMRc=McLN9Ovoqo3om-3uC=q+=rcKCoiWMctC=yvwiaHacU0PQ@mail.gmail.com>

On Thu, Jun 04, 2026 at 05:32:27AM -0700, Bartosz Golaszewski wrote:
> On Tue, 2 Jun 2026 23:41:53 +0200, Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> said:
> > On Thu, May 21, 2026 at 10:36:30AM +0200, Bartosz Golaszewski wrote:
> >> Provide a helper function encapsulating the logic of assigning firmware
> >> nodes to platform devices created with platform_device_alloc(). Make the
> >> kerneldoc state that this is the proper interface for assigning firmware
> >> nodes to dynamically allocated platform devices. This will allow us to
> >> switch to counting the references of the device's firmware nodes in the
> >> future, not only the OF nodes.
> >
> > But why different for of_node and fwnode to begin with?!
> 
> I'm not following. What are you suggesting?

After re-reading of this thread, I think I'm suggesting the same what you have
in plans to do in the future as you put it as "This will allow us to switch to
counting the references of the device's firmware nodes in the future, not only
the OF nodes."

// Offtopic
I haven't heard from you for more than a month on this:
https://lore.kernel.org/r/af18zdP5HF3_P9Vo@black.igk.intel.com
Anything should I do? Please, answer to that thread.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH 08/23] driver core: platform: provide platform_device_set_of_node_from_dev()
From: Andy Shevchenko @ 2026-06-16  9:41 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Bartosz Golaszewski, Lee Jones, Mark Brown, Thierry Reding,
	Sebastian Hesselbarth, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Srinivas Kandagatla,
	Greg Kroah-Hartman, Vinod Koul, Rafael J. Wysocki,
	Danilo Krummrich, Rob Herring, Saravana Kannan,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
	Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
	Broadcom internal kernel review list, Ulf Hansson, Frank Li,
	Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
	Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
	Maximilian Luz, Hans de Goede, Ilpo Järvinen,
	Krzysztof Kozlowski, Benjamin Herrenschmidt, brgl, linux-kernel,
	netdev, linux-arm-msm, linux-sound, driver-core, devicetree,
	linuxppc-dev, linux-i2c, iommu, linux-pm, imx, linux-arm-kernel,
	intel-xe, dri-devel, linux-usb, linux-mips, platform-driver-x86
In-Reply-To: <aiZpJkQBXg2pcczy@hovoldconsulting.com>

On Mon, Jun 08, 2026 at 09:03:02AM +0200, Johan Hovold wrote:
> On Fri, Jun 05, 2026 at 05:53:04PM +0300, Andy Shevchenko wrote:
> > On Fri, Jun 05, 2026 at 02:16:17PM +0200, Johan Hovold wrote:
> > > On Wed, Jun 03, 2026 at 12:44:55AM +0300, Andy Shevchenko wrote:
> > > > On Thu, May 21, 2026 at 10:36:31AM +0200, Bartosz Golaszewski wrote:
> > > > > Provide a platform-specific variant of device_set_of_node_from_dev(). In
> > > > > addition to bumping the reference count of the OF node being assigned,
> > > > > it also assigns the fwnode of the platform device.
> > > > 
> > > > Can we rather investigate the way how to make that of node reuse thingy
> > > > (which is used solely by pin control) differently and then drop this confusing
> > > > device_set_of_node_from_dev() call altogether?
> > > 
> > > No, that call is needed. See commit 4e75e1d7dac9 ("driver core: add
> > > helper to reuse a device-tree node") for details.
> > 
> > Bart fixes the problem with the platform driver. At the result this will be
> > the only device_set_node() + 'reused = true'.  As for 'reused' flag, the need
> > is only for pinmux/pin control stuff.
> 
> And any other resource which may (eventually) be claimed by driver core
> or bus code.
> 
> > The question here is if there is a better
> > way to make that 'reused' be done automatically without need of setting some
> > flag explicitly.
> 
> That's not really relevant to the series at hand.

It's not, but it's relevant in a long-term for understanding how we can get
this done in a better way.

> If this is something we want to merge then you need to continue setting
> the flag in order not to cause regressions.

Yes, that's how it's now.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH net-next v2] net: dsa: Fix skb ownership in taggers
From: Linus Walleij @ 2026-06-16  9:36 UTC (permalink / raw)
  To: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Florian Fainelli,
	Jonas Gorski, Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh,
	UNGLinuxDriver, Chester A. Unal, Daniel Golle, Matthias Brugger,
	AngeloGioacchino Del Regno, Wei Fang, Clark Wang,
	Clément Léger, George McCollister, David Yang
  Cc: netdev, Sashiko AI Review, Linus Walleij

The tag_8021q.c tagger calls vlan_insert_tag() in dsa_8021q_xmit().
vlan_insert_tag() will consume the skb with kfree_skb() on failure
and return NULL.

When NULL is returned as error code to ->xmit() in dsa_user_xmit()
it will free the same skb again leading to a double-free.

The idea of dsa_user_xmit() and dsa_switch_rcv() dropping the skb
they held before the call to ->xmit() and ->rcv() is conceptually
wrong: the pattern elsewhere in the networking code is that consumers
drop their skb:s on failure.

Modify the ->xmit() and ->rcv() call sites to not drop the SKB if
the taggers return NULL from any of these calls. Move those drops into
the taggers so every callback error path that retains ownership consumes
the skb before returning NULL.

Keep the existing helper ownership rules: VLAN insertion helpers already
free on failure (this is the case in tag_8021q.c), while deferred
transmit paths either transfer the skb reference to worker context or
hold a worker reference with skb_get() and drop the caller's reference.

For SJA1105 meta RX, transfer the buffered stampable skb under the meta
lock and return NULL while the skb is waiting for its meta frame: the
skb is not dropped in this case.

Reported-by: Sashiko AI Review <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/r/20260610153952.1685895-1-kuba@kernel.org/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Assisted-by: Codex:gpt-5-5
Acked-by: David Yang <mmyangfl@gmail.com> # yt921x
Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Signed-off-by: Linus Walleij <linusw@kernel.org>
---
Changes in v2:
- In some instances __skb_pad() and __skb_put_padto() followed by a
  kfree_skb() could be simplified to just call skb_pad() and
  skb_put_padto() which will free the skb on failure.
- Use a label and goto for the kfree_skb(); return NULL; in
  the netc_rcv() callback in tag_netc.c as requested.
- Collect ACKs.
- Retag for net-next.
- Link to v1: https://patch.msgid.link/20260616-dsa-fix-free-skb-v1-1-fd30b35dcf66@kernel.org
---
 net/dsa/tag.c               |  4 +---
 net/dsa/tag_ar9331.c        | 10 ++++++++--
 net/dsa/tag_brcm.c          | 39 ++++++++++++++++++++++++---------------
 net/dsa/tag_dsa.c           | 15 ++++++++++++---
 net/dsa/tag_gswip.c         |  8 ++++++--
 net/dsa/tag_hellcreek.c     |  9 +++++++--
 net/dsa/tag_ksz.c           | 44 +++++++++++++++++++++++++++++++-------------
 net/dsa/tag_lan9303.c       |  2 ++
 net/dsa/tag_mtk.c           |  8 ++++++--
 net/dsa/tag_mxl-gsw1xx.c    |  3 +++
 net/dsa/tag_mxl862xx.c      |  3 +++
 net/dsa/tag_netc.c          | 18 ++++++++++--------
 net/dsa/tag_ocelot.c        |  4 +++-
 net/dsa/tag_ocelot_8021q.c  | 20 +++++++++++++-------
 net/dsa/tag_qca.c           | 14 +++++++++++---
 net/dsa/tag_rtl4_a.c        | 10 ++++++++--
 net/dsa/tag_rtl8_4.c        | 24 ++++++++++++++++++------
 net/dsa/tag_rzn1_a5psw.c    |  8 ++++++--
 net/dsa/tag_sja1105.c       | 42 +++++++++++++++++++++++++++---------------
 net/dsa/tag_trailer.c       | 16 ++++++++++++----
 net/dsa/tag_vsc73xx_8021q.c |  1 +
 net/dsa/tag_xrs700x.c       | 12 +++++++++---
 net/dsa/tag_yt921x.c        |  7 ++++++-
 net/dsa/user.c              |  7 +++----
 24 files changed, 230 insertions(+), 98 deletions(-)

diff --git a/net/dsa/tag.c b/net/dsa/tag.c
index 79ad105902d9..cfc8f5a0cbd9 100644
--- a/net/dsa/tag.c
+++ b/net/dsa/tag.c
@@ -84,10 +84,8 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
 		nskb = cpu_dp->rcv(skb, dev);
 	}
 
-	if (!nskb) {
-		kfree_skb(skb);
+	if (!nskb)
 		return 0;
-	}
 
 	skb = nskb;
 	skb_push(skb, ETH_HLEN);
diff --git a/net/dsa/tag_ar9331.c b/net/dsa/tag_ar9331.c
index cbb588ca73aa..2e2388143b02 100644
--- a/net/dsa/tag_ar9331.c
+++ b/net/dsa/tag_ar9331.c
@@ -51,8 +51,10 @@ static struct sk_buff *ar9331_tag_rcv(struct sk_buff *skb,
 	u8 ver, port;
 	u16 hdr;
 
-	if (unlikely(!pskb_may_pull(skb, AR9331_HDR_LEN)))
+	if (unlikely(!pskb_may_pull(skb, AR9331_HDR_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	hdr = le16_to_cpu(*(__le16 *)skb_mac_header(skb));
 
@@ -60,12 +62,14 @@ static struct sk_buff *ar9331_tag_rcv(struct sk_buff *skb,
 	if (unlikely(ver != AR9331_HDR_VERSION)) {
 		netdev_warn_once(ndev, "%s:%i wrong header version 0x%2x\n",
 				 __func__, __LINE__, hdr);
+		kfree_skb(skb);
 		return NULL;
 	}
 
 	if (unlikely(hdr & AR9331_HDR_FROM_CPU)) {
 		netdev_warn_once(ndev, "%s:%i packet should not be from cpu 0x%2x\n",
 				 __func__, __LINE__, hdr);
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -75,8 +79,10 @@ static struct sk_buff *ar9331_tag_rcv(struct sk_buff *skb,
 	port = FIELD_GET(AR9331_HDR_PORT_NUM_MASK, hdr);
 
 	skb->dev = dsa_conduit_find_user(ndev, 0, port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	return skb;
 }
diff --git a/net/dsa/tag_brcm.c b/net/dsa/tag_brcm.c
index cf9420439054..411e3b57d16a 100644
--- a/net/dsa/tag_brcm.c
+++ b/net/dsa/tag_brcm.c
@@ -102,9 +102,9 @@ static struct sk_buff *brcm_tag_xmit_ll(struct sk_buff *skb,
 	 * (including FCS and tag) because the length verification is done after
 	 * the Broadcom tag is stripped off the ingress packet.
 	 *
-	 * Let dsa_user_xmit() free the SKB
+	 * Free the SKB on error.
 	 */
-	if (__skb_put_padto(skb, ETH_ZLEN + BRCM_TAG_LEN, false))
+	if (skb_put_padto(skb, ETH_ZLEN + BRCM_TAG_LEN))
 		return NULL;
 
 	skb_push(skb, BRCM_TAG_LEN);
@@ -151,27 +151,35 @@ static struct sk_buff *brcm_tag_rcv_ll(struct sk_buff *skb,
 	int source_port;
 	u8 *brcm_tag;
 
-	if (unlikely(!pskb_may_pull(skb, BRCM_TAG_LEN)))
+	if (unlikely(!pskb_may_pull(skb, BRCM_TAG_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	brcm_tag = skb->data - offset;
 
 	/* The opcode should never be different than 0b000 */
-	if (unlikely((brcm_tag[0] >> BRCM_OPCODE_SHIFT) & BRCM_OPCODE_MASK))
+	if (unlikely((brcm_tag[0] >> BRCM_OPCODE_SHIFT) & BRCM_OPCODE_MASK)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* We should never see a reserved reason code without knowing how to
 	 * handle it
 	 */
-	if (unlikely(brcm_tag[2] & BRCM_EG_RC_RSVD))
+	if (unlikely(brcm_tag[2] & BRCM_EG_RC_RSVD)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Locate which port this is coming from */
 	source_port = brcm_tag[3] & BRCM_EG_PID_MASK;
 
 	skb->dev = dsa_conduit_find_user(dev, 0, source_port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Remove Broadcom tag and update checksum */
 	skb_pull_rcsum(skb, BRCM_TAG_LEN);
@@ -228,8 +236,10 @@ static struct sk_buff *brcm_leg_tag_rcv(struct sk_buff *skb,
 	__be16 *proto;
 	u8 *brcm_tag;
 
-	if (unlikely(!pskb_may_pull(skb, BRCM_LEG_TAG_LEN + VLAN_HLEN)))
+	if (unlikely(!pskb_may_pull(skb, BRCM_LEG_TAG_LEN + VLAN_HLEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	brcm_tag = dsa_etype_header_pos_rx(skb);
 	proto = (__be16 *)(brcm_tag + BRCM_LEG_TAG_LEN);
@@ -237,8 +247,10 @@ static struct sk_buff *brcm_leg_tag_rcv(struct sk_buff *skb,
 	source_port = brcm_tag[5] & BRCM_LEG_PORT_ID;
 
 	skb->dev = dsa_conduit_find_user(dev, 0, source_port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* The internal switch in BCM63XX SoCs always tags on egress on the CPU
 	 * port. We use VID 0 internally for untagged traffic, so strip the tag
@@ -273,10 +285,8 @@ static struct sk_buff *brcm_leg_tag_xmit(struct sk_buff *skb,
 	 * need to make sure that packets are at least 70 bytes
 	 * (including FCS and tag) because the length verification is done after
 	 * the Broadcom tag is stripped off the ingress packet.
-	 *
-	 * Let dsa_user_xmit() free the SKB
 	 */
-	if (__skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN, false))
+	if (skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN))
 		return NULL;
 
 	skb_push(skb, BRCM_LEG_TAG_LEN);
@@ -325,10 +335,8 @@ static struct sk_buff *brcm_leg_fcs_tag_xmit(struct sk_buff *skb,
 	 * need to make sure that packets are at least 70 bytes (including FCS
 	 * and tag) because the length verification is done after the Broadcom
 	 * tag is stripped off the ingress packet.
-	 *
-	 * Let dsa_user_xmit() free the SKB.
 	 */
-	if (__skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN, false))
+	if (skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN))
 		return NULL;
 
 	fcs_len = skb->len;
@@ -351,8 +359,9 @@ static struct sk_buff *brcm_leg_fcs_tag_xmit(struct sk_buff *skb,
 	brcm_tag[5] = dp->index & BRCM_LEG_PORT_ID;
 
 	/* Original FCS value */
-	if (__skb_pad(skb, ETH_FCS_LEN, false))
+	if (skb_pad(skb, ETH_FCS_LEN))
 		return NULL;
+
 	skb_put_data(skb, &fcs_val, ETH_FCS_LEN);
 
 	return skb;
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 2a2c4fb61a65..d5ffee35fbb5 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -224,6 +224,7 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
 			/* Remote management is not implemented yet,
 			 * drop.
 			 */
+			kfree_skb(skb);
 			return NULL;
 		case DSA_CODE_ARP_MIRROR:
 		case DSA_CODE_POLICY_MIRROR:
@@ -244,12 +245,14 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
 			/* Reserved code, this could be anything. Drop
 			 * seems like the safest option.
 			 */
+			kfree_skb(skb);
 			return NULL;
 		}
 
 		break;
 
 	default:
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -271,8 +274,10 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
 						 source_port);
 	}
 
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* When using LAG offload, skb->dev is not a DSA user interface,
 	 * so we cannot call dsa_default_offload_fwd_mark and we need to
@@ -335,8 +340,10 @@ static struct sk_buff *dsa_xmit(struct sk_buff *skb, struct net_device *dev)
 
 static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev)
 {
-	if (unlikely(!pskb_may_pull(skb, DSA_HLEN)))
+	if (unlikely(!pskb_may_pull(skb, DSA_HLEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	return dsa_rcv_ll(skb, dev, 0);
 }
@@ -375,8 +382,10 @@ static struct sk_buff *edsa_xmit(struct sk_buff *skb, struct net_device *dev)
 
 static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev)
 {
-	if (unlikely(!pskb_may_pull(skb, EDSA_HLEN)))
+	if (unlikely(!pskb_may_pull(skb, EDSA_HLEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	skb_pull_rcsum(skb, EDSA_HLEN - DSA_HLEN);
 
diff --git a/net/dsa/tag_gswip.c b/net/dsa/tag_gswip.c
index 5fa436121087..5c407d448c9f 100644
--- a/net/dsa/tag_gswip.c
+++ b/net/dsa/tag_gswip.c
@@ -80,16 +80,20 @@ static struct sk_buff *gswip_tag_rcv(struct sk_buff *skb,
 	int port;
 	u8 *gswip_tag;
 
-	if (unlikely(!pskb_may_pull(skb, GSWIP_RX_HEADER_LEN)))
+	if (unlikely(!pskb_may_pull(skb, GSWIP_RX_HEADER_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	gswip_tag = skb->data - ETH_HLEN;
 
 	/* Get source port information */
 	port = (gswip_tag[7] & GSWIP_RX_SPPID_MASK) >> GSWIP_RX_SPPID_SHIFT;
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* remove GSWIP tag */
 	skb_pull_rcsum(skb, GSWIP_RX_HEADER_LEN);
diff --git a/net/dsa/tag_hellcreek.c b/net/dsa/tag_hellcreek.c
index 544ab15685a2..dd9f328f3182 100644
--- a/net/dsa/tag_hellcreek.c
+++ b/net/dsa/tag_hellcreek.c
@@ -27,8 +27,10 @@ static struct sk_buff *hellcreek_xmit(struct sk_buff *skb,
 	 * checksums after the switch strips the tag.
 	 */
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
-	    skb_checksum_help(skb))
+	    skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Tag encoding */
 	tag  = skb_put(skb, HELLCREEK_TAG_LEN);
@@ -47,11 +49,14 @@ static struct sk_buff *hellcreek_rcv(struct sk_buff *skb,
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
 	if (!skb->dev) {
 		netdev_warn_once(dev, "Failed to get source port: %d\n", port);
+		kfree_skb(skb);
 		return NULL;
 	}
 
-	if (pskb_trim_rcsum(skb, skb->len - HELLCREEK_TAG_LEN))
+	if (pskb_trim_rcsum(skb, skb->len - HELLCREEK_TAG_LEN)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	dsa_default_offload_fwd_mark(skb);
 
diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
index d2475c3bbb7d..67fa89f102e0 100644
--- a/net/dsa/tag_ksz.c
+++ b/net/dsa/tag_ksz.c
@@ -88,11 +88,15 @@ static struct sk_buff *ksz_common_rcv(struct sk_buff *skb,
 				      unsigned int port, unsigned int len)
 {
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
-	if (pskb_trim_rcsum(skb, skb->len - len))
+	if (pskb_trim_rcsum(skb, skb->len - len)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	dsa_default_offload_fwd_mark(skb);
 
@@ -123,8 +127,10 @@ static struct sk_buff *ksz8795_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct ethhdr *hdr;
 	u8 *tag;
 
-	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Tag encoding */
 	tag = skb_put(skb, KSZ_INGRESS_TAG_LEN);
@@ -141,8 +147,10 @@ static struct sk_buff *ksz8795_rcv(struct sk_buff *skb, struct net_device *dev)
 {
 	u8 *tag;
 
-	if (skb_linearize(skb))
+	if (skb_linearize(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	tag = skb_tail_pointer(skb) - KSZ_EGRESS_TAG_LEN;
 
@@ -255,22 +263,24 @@ static struct sk_buff *ksz_defer_xmit(struct dsa_port *dp, struct sk_buff *skb)
 	xmit_work_fn = tagger_data->xmit_work_fn;
 	xmit_worker = priv->xmit_worker;
 
-	if (!xmit_work_fn || !xmit_worker)
+	if (!xmit_work_fn || !xmit_worker) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	xmit_work = kzalloc_obj(*xmit_work, GFP_ATOMIC);
-	if (!xmit_work)
+	if (!xmit_work) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	kthread_init_work(&xmit_work->work, xmit_work_fn);
-	/* Increase refcount so the kfree_skb in dsa_user_xmit
-	 * won't really free the packet.
-	 */
 	xmit_work->dp = dp;
 	xmit_work->skb = skb_get(skb);
 
 	kthread_queue_work(xmit_worker, &xmit_work->work);
 
+	kfree_skb(skb);
 	return NULL;
 }
 
@@ -284,8 +294,10 @@ static struct sk_buff *ksz9477_xmit(struct sk_buff *skb,
 	__be16 *tag;
 	u16 val;
 
-	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Tag encoding */
 	ksz_xmit_timestamp(dp, skb);
@@ -310,8 +322,10 @@ static struct sk_buff *ksz9477_rcv(struct sk_buff *skb, struct net_device *dev)
 	unsigned int port;
 	u8 *tag;
 
-	if (skb_linearize(skb))
+	if (skb_linearize(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Tag decoding */
 	tag = skb_tail_pointer(skb) - KSZ_EGRESS_TAG_LEN;
@@ -352,8 +366,10 @@ static struct sk_buff *ksz9893_xmit(struct sk_buff *skb,
 	struct ethhdr *hdr;
 	u8 *tag;
 
-	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Tag encoding */
 	ksz_xmit_timestamp(dp, skb);
@@ -418,8 +434,10 @@ static struct sk_buff *lan937x_xmit(struct sk_buff *skb,
 	__be16 *tag;
 	u16 val;
 
-	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	ksz_xmit_timestamp(dp, skb);
 
diff --git a/net/dsa/tag_lan9303.c b/net/dsa/tag_lan9303.c
index 258e5d7dc5ef..d1194696499a 100644
--- a/net/dsa/tag_lan9303.c
+++ b/net/dsa/tag_lan9303.c
@@ -85,6 +85,7 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(!pskb_may_pull(skb, LAN9303_TAG_LEN))) {
 		dev_warn_ratelimited(&dev->dev,
 				     "Dropping packet, cannot pull\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -102,6 +103,7 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev)
 	skb->dev = dsa_conduit_find_user(dev, 0, source_port);
 	if (!skb->dev) {
 		dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid source port\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/tag_mtk.c b/net/dsa/tag_mtk.c
index dea3eecaf093..c7dc7731675e 100644
--- a/net/dsa/tag_mtk.c
+++ b/net/dsa/tag_mtk.c
@@ -72,8 +72,10 @@ static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev)
 	int port;
 	__be16 *phdr;
 
-	if (unlikely(!pskb_may_pull(skb, MTK_HDR_LEN)))
+	if (unlikely(!pskb_may_pull(skb, MTK_HDR_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	phdr = dsa_etype_header_pos_rx(skb);
 	hdr = ntohs(*phdr);
@@ -87,8 +89,10 @@ static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev)
 	port = (hdr & MTK_HDR_RECV_SOURCE_PORT_MASK);
 
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	dsa_default_offload_fwd_mark(skb);
 
diff --git a/net/dsa/tag_mxl-gsw1xx.c b/net/dsa/tag_mxl-gsw1xx.c
index 60f7c445e656..4b1b6ef94196 100644
--- a/net/dsa/tag_mxl-gsw1xx.c
+++ b/net/dsa/tag_mxl-gsw1xx.c
@@ -73,6 +73,7 @@ static struct sk_buff *gsw1xx_tag_rcv(struct sk_buff *skb,
 
 	if (unlikely(!pskb_may_pull(skb, GSW1XX_HEADER_LEN))) {
 		dev_warn_ratelimited(&dev->dev, "Dropping packet, cannot pull SKB\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -81,6 +82,7 @@ static struct sk_buff *gsw1xx_tag_rcv(struct sk_buff *skb,
 	if (unlikely(ntohs(gsw1xx_tag[0]) != ETH_P_MXLGSW)) {
 		dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid special tag\n");
 		dev_warn_ratelimited(&dev->dev, "Tag: %8ph\n", gsw1xx_tag);
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -90,6 +92,7 @@ static struct sk_buff *gsw1xx_tag_rcv(struct sk_buff *skb,
 	if (!skb->dev) {
 		dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid source port\n");
 		dev_warn_ratelimited(&dev->dev, "Tag: %8ph\n", gsw1xx_tag);
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/tag_mxl862xx.c b/net/dsa/tag_mxl862xx.c
index 8daefeb8d49d..87b80ddf0946 100644
--- a/net/dsa/tag_mxl862xx.c
+++ b/net/dsa/tag_mxl862xx.c
@@ -64,6 +64,7 @@ static struct sk_buff *mxl862_tag_rcv(struct sk_buff *skb,
 
 	if (unlikely(!pskb_may_pull(skb, MXL862_HEADER_LEN))) {
 		dev_warn_ratelimited(&dev->dev, "Cannot pull SKB, packet dropped\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -73,6 +74,7 @@ static struct sk_buff *mxl862_tag_rcv(struct sk_buff *skb,
 		dev_warn_ratelimited(&dev->dev,
 				     "Invalid special tag marker, packet dropped, tag: %8ph\n",
 				     mxl862_tag);
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -83,6 +85,7 @@ static struct sk_buff *mxl862_tag_rcv(struct sk_buff *skb,
 		dev_warn_ratelimited(&dev->dev,
 				     "Invalid source port, packet dropped, tag: %8ph\n",
 				     mxl862_tag);
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/tag_netc.c b/net/dsa/tag_netc.c
index ccedfe3a80b6..df72a61796ad 100644
--- a/net/dsa/tag_netc.c
+++ b/net/dsa/tag_netc.c
@@ -131,14 +131,13 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
 	int type, subtype;
 
 	if (unlikely(!pskb_may_pull(skb, NETC_TAG_MAX_LEN)))
-		return NULL;
+		goto err_free_skb;
 
 	tag_cmn = dsa_etype_header_pos_rx(skb);
 	if (ntohs(tag_cmn->tpid) != ETH_P_NXP_NETC) {
 		dev_warn_ratelimited(&ndev->dev, "Unknown TPID 0x%04x\n",
 				     ntohs(tag_cmn->tpid));
-
-		return NULL;
+		goto err_free_skb;
 	}
 
 	if (tag_cmn->qos & NETC_TAG_QV)
@@ -149,14 +148,13 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
 	if (!sw_id) {
 		dev_warn_ratelimited(&ndev->dev,
 				     "VEPA switch ID is not supported yet\n");
-
-		return NULL;
+		goto err_free_skb;
 	}
 
 	port = FIELD_GET(NETC_TAG_PORT, tag_cmn->switch_port);
 	skb->dev = dsa_conduit_find_user(ndev, sw_id, port);
 	if (!skb->dev)
-		return NULL;
+		goto err_free_skb;
 
 	type = FIELD_GET(NETC_TAG_TYPE, tag_cmn->type);
 	subtype = FIELD_GET(NETC_TAG_SUBTYPE, tag_cmn->type);
@@ -165,11 +163,11 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
 	} else if (type == NETC_TAG_TO_HOST) {
 		/* Currently only subtype0 supported */
 		if (subtype != NETC_TAG_TH_SUBTYPE0)
-			return NULL;
+			goto err_free_skb;
 	} else {
 		dev_warn_ratelimited(&ndev->dev,
 				     "Unexpected  tag type %d\n", type);
-		return NULL;
+		goto err_free_skb;
 	}
 
 	/* Remove Switch tag from the frame */
@@ -178,6 +176,10 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
 	dsa_strip_etype_header(skb, tag_len);
 
 	return skb;
+
+err_free_skb:
+	kfree_skb(skb);
+	return NULL;
 }
 
 static void netc_flow_dissect(const struct sk_buff *skb, __be16 *proto,
diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c
index 3405def79c2d..d208c7322cd6 100644
--- a/net/dsa/tag_ocelot.c
+++ b/net/dsa/tag_ocelot.c
@@ -107,14 +107,16 @@ static struct sk_buff *ocelot_rcv(struct sk_buff *skb,
 	ocelot_xfh_get_rew_val(extraction, &rew_val);
 
 	skb->dev = dsa_conduit_find_user(netdev, 0, src_port);
-	if (!skb->dev)
+	if (!skb->dev) {
 		/* The switch will reflect back some frames sent through
 		 * sockets opened on the bare DSA conduit. These will come back
 		 * with src_port equal to the index of the CPU port, for which
 		 * there is no user registered. So don't print any error
 		 * message here (ignore and drop those frames).
 		 */
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	dsa_default_offload_fwd_mark(skb);
 	skb->priority = qos_class;
diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c
index e89d9254e90a..f50f1cd83f16 100644
--- a/net/dsa/tag_ocelot_8021q.c
+++ b/net/dsa/tag_ocelot_8021q.c
@@ -33,30 +33,34 @@ static struct sk_buff *ocelot_defer_xmit(struct dsa_port *dp,
 	xmit_work_fn = data->xmit_work_fn;
 	xmit_worker = priv->xmit_worker;
 
-	if (!xmit_work_fn || !xmit_worker)
+	if (!xmit_work_fn || !xmit_worker) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* PTP over IP packets need UDP checksumming. We may have inherited
 	 * NETIF_F_HW_CSUM from the DSA conduit, but these packets are not sent
 	 * through the DSA conduit, so calculate the checksum here.
 	 */
-	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	xmit_work = kzalloc_obj(*xmit_work, GFP_ATOMIC);
-	if (!xmit_work)
+	if (!xmit_work) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Calls felix_port_deferred_xmit in felix.c */
 	kthread_init_work(&xmit_work->work, xmit_work_fn);
-	/* Increase refcount so the kfree_skb in dsa_user_xmit
-	 * won't really free the packet.
-	 */
 	xmit_work->dp = dp;
 	xmit_work->skb = skb_get(skb);
 
 	kthread_queue_work(xmit_worker, &xmit_work->work);
 
+	kfree_skb(skb);
 	return NULL;
 }
 
@@ -84,8 +88,10 @@ static struct sk_buff *ocelot_rcv(struct sk_buff *skb,
 	dsa_8021q_rcv(skb, &src_port, &switch_id, NULL, NULL);
 
 	skb->dev = dsa_conduit_find_user(netdev, switch_id, src_port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	dsa_default_offload_fwd_mark(skb);
 
diff --git a/net/dsa/tag_qca.c b/net/dsa/tag_qca.c
index 9e3b429e8b36..510792fbfa92 100644
--- a/net/dsa/tag_qca.c
+++ b/net/dsa/tag_qca.c
@@ -46,16 +46,20 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
 
 	tagger_data = ds->tagger_data;
 
-	if (unlikely(!pskb_may_pull(skb, QCA_HDR_LEN)))
+	if (unlikely(!pskb_may_pull(skb, QCA_HDR_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	phdr = dsa_etype_header_pos_rx(skb);
 	hdr = ntohs(*phdr);
 
 	/* Make sure the version is correct */
 	ver = FIELD_GET(QCA_HDR_RECV_VERSION, hdr);
-	if (unlikely(ver != QCA_HDR_VERSION))
+	if (unlikely(ver != QCA_HDR_VERSION)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Get pk type */
 	pk_type = FIELD_GET(QCA_HDR_RECV_TYPE, hdr);
@@ -64,6 +68,7 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
 	if (pk_type == QCA_HDR_RECV_TYPE_RW_REG_ACK) {
 		if (likely(tagger_data->rw_reg_ack_handler))
 			tagger_data->rw_reg_ack_handler(ds, skb);
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -71,6 +76,7 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
 	if (pk_type == QCA_HDR_RECV_TYPE_MIB) {
 		if (likely(tagger_data->mib_autocast_handler))
 			tagger_data->mib_autocast_handler(ds, skb);
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -78,8 +84,10 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
 	port = FIELD_GET(QCA_HDR_RECV_SOURCE_PORT, hdr);
 
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Remove QCA tag and recalculate checksum */
 	skb_pull_rcsum(skb, QCA_HDR_LEN);
diff --git a/net/dsa/tag_rtl4_a.c b/net/dsa/tag_rtl4_a.c
index 3cc63eacfa03..9805c56025de 100644
--- a/net/dsa/tag_rtl4_a.c
+++ b/net/dsa/tag_rtl4_a.c
@@ -41,8 +41,10 @@ static struct sk_buff *rtl4a_tag_xmit(struct sk_buff *skb,
 	u16 out;
 
 	/* Pad out to at least 60 bytes */
-	if (unlikely(__skb_put_padto(skb, ETH_ZLEN, false)))
+	if (unlikely(__skb_put_padto(skb, ETH_ZLEN, false))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	netdev_dbg(dev, "add realtek tag to package to port %d\n",
 		   dp->index);
@@ -75,8 +77,10 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,
 	u8 prot;
 	u8 port;
 
-	if (unlikely(!pskb_may_pull(skb, RTL4_A_HDR_LEN)))
+	if (unlikely(!pskb_may_pull(skb, RTL4_A_HDR_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	tag = dsa_etype_header_pos_rx(skb);
 	p = (__be16 *)tag;
@@ -92,6 +96,7 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,
 	prot = (protport >> RTL4_A_PROTOCOL_SHIFT) & 0x0f;
 	if (prot != RTL4_A_PROTOCOL_RTL8366RB) {
 		netdev_err(dev, "unknown realtek protocol 0x%01x\n", prot);
+		kfree_skb(skb);
 		return NULL;
 	}
 	port = protport & 0xff;
@@ -99,6 +104,7 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
 	if (!skb->dev) {
 		netdev_dbg(dev, "could not find user for port %d\n", port);
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/tag_rtl8_4.c b/net/dsa/tag_rtl8_4.c
index 852c6b88079a..4da3beebef75 100644
--- a/net/dsa/tag_rtl8_4.c
+++ b/net/dsa/tag_rtl8_4.c
@@ -143,8 +143,10 @@ static struct sk_buff *rtl8_4t_tag_xmit(struct sk_buff *skb,
 	/* Calculate the checksum here if not done yet as trailing tags will
 	 * break either software or hardware based checksum
 	 */
-	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+	if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	rtl8_4_write_tag(skb, dev, skb_put(skb, RTL8_4_TAG_LEN));
 
@@ -201,11 +203,15 @@ static int rtl8_4_read_tag(struct sk_buff *skb, struct net_device *dev,
 static struct sk_buff *rtl8_4_tag_rcv(struct sk_buff *skb,
 				      struct net_device *dev)
 {
-	if (unlikely(!pskb_may_pull(skb, RTL8_4_TAG_LEN)))
+	if (unlikely(!pskb_may_pull(skb, RTL8_4_TAG_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
-	if (unlikely(rtl8_4_read_tag(skb, dev, dsa_etype_header_pos_rx(skb))))
+	if (unlikely(rtl8_4_read_tag(skb, dev, dsa_etype_header_pos_rx(skb)))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Remove tag and recalculate checksum */
 	skb_pull_rcsum(skb, RTL8_4_TAG_LEN);
@@ -218,14 +224,20 @@ static struct sk_buff *rtl8_4_tag_rcv(struct sk_buff *skb,
 static struct sk_buff *rtl8_4t_tag_rcv(struct sk_buff *skb,
 				       struct net_device *dev)
 {
-	if (skb_linearize(skb))
+	if (skb_linearize(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
-	if (unlikely(rtl8_4_read_tag(skb, dev, skb_tail_pointer(skb) - RTL8_4_TAG_LEN)))
+	if (unlikely(rtl8_4_read_tag(skb, dev, skb_tail_pointer(skb) - RTL8_4_TAG_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
-	if (pskb_trim_rcsum(skb, skb->len - RTL8_4_TAG_LEN))
+	if (pskb_trim_rcsum(skb, skb->len - RTL8_4_TAG_LEN)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	return skb;
 }
diff --git a/net/dsa/tag_rzn1_a5psw.c b/net/dsa/tag_rzn1_a5psw.c
index 10994b3470f6..df0098513f3e 100644
--- a/net/dsa/tag_rzn1_a5psw.c
+++ b/net/dsa/tag_rzn1_a5psw.c
@@ -48,7 +48,7 @@ static struct sk_buff *a5psw_tag_xmit(struct sk_buff *skb, struct net_device *de
 	 * least 60 bytes otherwise they will be discarded when they enter the
 	 * switch port logic.
 	 */
-	if (__skb_put_padto(skb, ETH_ZLEN, false))
+	if (skb_put_padto(skb, ETH_ZLEN))
 		return NULL;
 
 	/* provide 'A5PSW_TAG_LEN' bytes additional space */
@@ -77,6 +77,7 @@ static struct sk_buff *a5psw_tag_rcv(struct sk_buff *skb,
 	if (unlikely(!pskb_may_pull(skb, A5PSW_TAG_LEN))) {
 		dev_warn_ratelimited(&dev->dev,
 				     "Dropping packet, cannot pull\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -84,14 +85,17 @@ static struct sk_buff *a5psw_tag_rcv(struct sk_buff *skb,
 
 	if (tag->ctrl_tag != htons(ETH_P_DSA_A5PSW)) {
 		dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid TAG marker\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
 	port = FIELD_GET(A5PSW_CTRL_DATA_PORT, ntohs(tag->ctrl_data));
 
 	skb->dev = dsa_conduit_find_user(dev, 0, port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	skb_pull_rcsum(skb, A5PSW_TAG_LEN);
 	dsa_strip_etype_header(skb, A5PSW_TAG_LEN);
diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
index de6d4ce8668b..bfe1f746f55b 100644
--- a/net/dsa/tag_sja1105.c
+++ b/net/dsa/tag_sja1105.c
@@ -149,19 +149,20 @@ static struct sk_buff *sja1105_defer_xmit(struct dsa_port *dp,
 	xmit_work_fn = tagger_data->xmit_work_fn;
 	xmit_worker = priv->xmit_worker;
 
-	if (!xmit_work_fn || !xmit_worker)
+	if (!xmit_work_fn || !xmit_worker) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	xmit_work = kzalloc_obj(*xmit_work, GFP_ATOMIC);
-	if (!xmit_work)
+	if (!xmit_work) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	kthread_init_work(&xmit_work->work, xmit_work_fn);
-	/* Increase refcount so the kfree_skb in dsa_user_xmit
-	 * won't really free the packet.
-	 */
 	xmit_work->dp = dp;
-	xmit_work->skb = skb_get(skb);
+	xmit_work->skb = skb;
 
 	kthread_queue_work(xmit_worker, &xmit_work->work);
 
@@ -401,10 +402,7 @@ static struct sk_buff
 			kfree_skb(priv->stampable_skb);
 		}
 
-		/* Hold a reference to avoid dsa_switch_rcv
-		 * from freeing the skb.
-		 */
-		priv->stampable_skb = skb_get(skb);
+		priv->stampable_skb = skb;
 		spin_unlock(&priv->meta_lock);
 
 		/* Tell DSA we got nothing */
@@ -436,6 +434,7 @@ static struct sk_buff
 			dev_err_ratelimited(ds->dev,
 					    "Unexpected meta frame\n");
 			spin_unlock(&priv->meta_lock);
+			kfree_skb(skb);
 			return NULL;
 		}
 
@@ -443,6 +442,7 @@ static struct sk_buff
 			dev_err_ratelimited(ds->dev,
 					    "Meta frame on wrong port\n");
 			spin_unlock(&priv->meta_lock);
+			kfree_skb(skb);
 			return NULL;
 		}
 
@@ -501,18 +501,21 @@ static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
 	/* Normal data plane traffic and link-local frames are tagged with
 	 * a tag_8021q VLAN which we have to strip
 	 */
-	if (sja1105_skb_has_tag_8021q(skb))
+	if (sja1105_skb_has_tag_8021q(skb)) {
 		dsa_8021q_rcv(skb, &source_port, &switch_id, &vbid, &vid);
-	else if (source_port == -1 && switch_id == -1)
+	} else if (source_port == -1 && switch_id == -1) {
 		/* Packets with no source information have no chance of
 		 * getting accepted, drop them straight away.
 		 */
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	skb->dev = dsa_tag_8021q_find_user(netdev, source_port, switch_id,
 					   vid, vbid);
 	if (!skb->dev) {
 		netdev_warn(netdev, "Couldn't decode source port\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -539,12 +542,15 @@ static struct sk_buff *sja1110_rcv_meta(struct sk_buff *skb, u16 rx_header)
 	if (!ds) {
 		net_err_ratelimited("%s: cannot find switch id %d\n",
 				    conduit->name, switch_id);
+		kfree_skb(skb);
 		return NULL;
 	}
 
 	tagger_data = sja1105_tagger_data(ds);
-	if (!tagger_data->meta_tstamp_handler)
+	if (!tagger_data->meta_tstamp_handler) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	for (i = 0; i <= n_ts; i++) {
 		u8 ts_id, source_port, dir;
@@ -562,6 +568,7 @@ static struct sk_buff *sja1110_rcv_meta(struct sk_buff *skb, u16 rx_header)
 	}
 
 	/* Discard the meta frame, we've consumed the timestamps it contained */
+	kfree_skb(skb);
 	return NULL;
 }
 
@@ -572,8 +579,10 @@ static struct sk_buff *sja1110_rcv_inband_control_extension(struct sk_buff *skb,
 {
 	u16 rx_header;
 
-	if (unlikely(!pskb_may_pull(skb, SJA1110_HEADER_LEN)))
+	if (unlikely(!pskb_may_pull(skb, SJA1110_HEADER_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* skb->data points to skb_mac_header(skb) + ETH_HLEN, which is exactly
 	 * what we need because the caller has checked the EtherType (which is
@@ -609,8 +618,10 @@ static struct sk_buff *sja1110_rcv_inband_control_extension(struct sk_buff *skb,
 		 * padding and trailer we need to account for the fact that
 		 * skb->data points to skb_mac_header(skb) + ETH_HLEN.
 		 */
-		if (pskb_trim_rcsum(skb, start_of_padding - ETH_HLEN))
+		if (pskb_trim_rcsum(skb, start_of_padding - ETH_HLEN)) {
+			kfree_skb(skb);
 			return NULL;
+		}
 	/* Trap-to-host frame, no timestamp trailer */
 	} else {
 		*source_port = SJA1110_RX_HEADER_SRC_PORT(rx_header);
@@ -653,6 +664,7 @@ static struct sk_buff *sja1110_rcv(struct sk_buff *skb,
 
 	if (!skb->dev) {
 		netdev_warn(netdev, "Couldn't decode source port\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/tag_trailer.c b/net/dsa/tag_trailer.c
index 4dce24cfe6a7..49c802c10ca6 100644
--- a/net/dsa/tag_trailer.c
+++ b/net/dsa/tag_trailer.c
@@ -30,22 +30,30 @@ static struct sk_buff *trailer_rcv(struct sk_buff *skb, struct net_device *dev)
 	u8 *trailer;
 	int source_port;
 
-	if (skb_linearize(skb))
+	if (skb_linearize(skb)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	trailer = skb_tail_pointer(skb) - 4;
 	if (trailer[0] != 0x80 || (trailer[1] & 0xf8) != 0x00 ||
-	    (trailer[2] & 0xef) != 0x00 || trailer[3] != 0x00)
+	    (trailer[2] & 0xef) != 0x00 || trailer[3] != 0x00) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	source_port = trailer[1] & 7;
 
 	skb->dev = dsa_conduit_find_user(dev, 0, source_port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
-	if (pskb_trim_rcsum(skb, skb->len - 4))
+	if (pskb_trim_rcsum(skb, skb->len - 4)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	return skb;
 }
diff --git a/net/dsa/tag_vsc73xx_8021q.c b/net/dsa/tag_vsc73xx_8021q.c
index af121a9aff7f..f4736a1a7a0f 100644
--- a/net/dsa/tag_vsc73xx_8021q.c
+++ b/net/dsa/tag_vsc73xx_8021q.c
@@ -44,6 +44,7 @@ vsc73xx_rcv(struct sk_buff *skb, struct net_device *netdev)
 	if (!skb->dev) {
 		dev_warn_ratelimited(&netdev->dev,
 				     "Couldn't decode source port\n");
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/tag_xrs700x.c b/net/dsa/tag_xrs700x.c
index a05219f702c6..bb268020ee86 100644
--- a/net/dsa/tag_xrs700x.c
+++ b/net/dsa/tag_xrs700x.c
@@ -30,15 +30,21 @@ static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev)
 
 	source_port = ffs((int)trailer[0]) - 1;
 
-	if (source_port < 0)
+	if (source_port < 0) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	skb->dev = dsa_conduit_find_user(dev, 0, source_port);
-	if (!skb->dev)
+	if (!skb->dev) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
-	if (pskb_trim_rcsum(skb, skb->len - 1))
+	if (pskb_trim_rcsum(skb, skb->len - 1)) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	/* Frame is forwarded by hardware, don't forward in software. */
 	dsa_default_offload_fwd_mark(skb);
diff --git a/net/dsa/tag_yt921x.c b/net/dsa/tag_yt921x.c
index f3ced99b1c85..294784ab6694 100644
--- a/net/dsa/tag_yt921x.c
+++ b/net/dsa/tag_yt921x.c
@@ -87,8 +87,10 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
 	__be16 *tag;
 	u16 rx;
 
-	if (unlikely(!pskb_may_pull(skb, YT921X_TAG_LEN)))
+	if (unlikely(!pskb_may_pull(skb, YT921X_TAG_LEN))) {
+		kfree_skb(skb);
 		return NULL;
+	}
 
 	tag = dsa_etype_header_pos_rx(skb);
 
@@ -96,6 +98,7 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
 		dev_warn_ratelimited(&netdev->dev,
 				     "Unexpected EtherType 0x%04x\n",
 				     ntohs(tag[0]));
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -104,6 +107,7 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
 	if (unlikely((rx & YT921X_TAG_PORT_EN) == 0)) {
 		dev_warn_ratelimited(&netdev->dev,
 				     "Unexpected rx tag 0x%04x\n", rx);
+		kfree_skb(skb);
 		return NULL;
 	}
 
@@ -112,6 +116,7 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
 	if (unlikely(!skb->dev)) {
 		dev_warn_ratelimited(&netdev->dev,
 				     "Couldn't decode source port %u\n", port);
+		kfree_skb(skb);
 		return NULL;
 	}
 
diff --git a/net/dsa/user.c b/net/dsa/user.c
index 8704c1a3a5b7..072fa76972cc 100644
--- a/net/dsa/user.c
+++ b/net/dsa/user.c
@@ -935,13 +935,12 @@ static netdev_tx_t dsa_user_xmit(struct sk_buff *skb, struct net_device *dev)
 		eth_skb_pad(skb);
 
 	/* Transmit function may have to reallocate the original SKB,
-	 * in which case it must have freed it. Only free it here on error.
+	 * in which case it must have freed it. Taggers will drop the
+	 * passed skb on error.
 	 */
 	nskb = p->xmit(skb, dev);
-	if (!nskb) {
-		kfree_skb(skb);
+	if (!nskb)
 		return NETDEV_TX_OK;
-	}
 
 	return dsa_enqueue_skb(nskb, dev);
 }

---
base-commit: f34c6b3a3c3d98f34918e1d2ea846a5acccac6d1
change-id: 20260616-dsa-fix-free-skb-bb028ce90802

Best regards,
--  
Linus Walleij <linusw@kernel.org>


^ permalink raw reply related

* Re: [BUG] netdevsim: KASAN slab-use-after-free in ref_tracker_free
From: saeed bishara @ 2026-06-16  9:34 UTC (permalink / raw)
  To: Shuangpeng Bai
  Cc: netdev, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-kernel
In-Reply-To: <178144969601.60470.14764529841344817811@gmail.com>

I tried gemini, here its analysis and a fix suggestion:

This is a brilliant, subtle bug. Let's break this down with a rigorous
audit of the Linux network device refcounting architecture to see
exactly why this Use-After-Free is occurring.
The root cause is a classic "Reverse-Teardown Fallacy" colliding with
the kernel's dual-refcount lifecycle for network devices (pcpu_refcnt
vs. kobject refcount).

The Dual-Lifecycle Trap
In modern kernel networking, a struct net_device is kept alive by two
distinct mechanisms that must be carefully orchestrated:
dev->pcpu_refcnt (Operational Lifecycle): Managed by dev_hold() and
dev_put(). This tracks active operational references. The
unregister_netdevice sequence blocks in netdev_run_todo() waiting for
this to hit zero before it invokes free_netdev().
dev->dev.kobj (Memory Lifecycle): The underlying memory structure
manager. free_netdev() drops a reference to this. When the kobject
refcount hits 0, device_release() is synchronously triggered, and the
kmalloc-cg-8k slab is actually freed.

The Exact Race Sequence
Look at the teardown order in the error path of sysfs_rtnl_lock():

unbreak:
    sysfs_unbreak_active_protection(kn);  // LINE 121
    dev_put(ndev);                        // LINE 122

Normally, developers write error paths in the exact reverse order of
initialization. However, in this specific case, adhering to that
convention is fatal.
The Setup: sysfs_rtnl_lock starts by calling dev_hold(ndev),
incrementing the pcpu_refcnt. It then calls
sysfs_break_active_protection(), which explicitly calls kobject_get(),
elevating the kobj memory refcount.
The Unregister Race: In parallel, another CPU initiates device
teardown (e.g., echo 1 > del_device on netdevsim). The sysfs entries
are deleted, and device_del() drops its primary kobj reference. The
teardown thread hits netdev_run_todo() and blocks, waiting for
sysfs_rtnl_lock to release its dev_hold.
The Lock Fails: sysfs_rtnl_lock fails to get the rtnl lock and jumps to unbreak.
The Fatal Drop (Line 121): sysfs_unbreak_active_protection(kn)
executes. It calls kobject_put(). Because the unregistration path
already dropped the standard kobject references, this is the absolute
last kobject reference. device_release() is synchronously invoked, and
the net_device memory is instantly freed to the slab.
The KASAN Trigger (Line 122): dev_put(ndev) executes. Under the hood,
it attempts to call netdev_tracker_free(&ndev->ref_tracker, ...). It
reads from the ndev structure that was just destroyed microseconds
earlier. KASAN screams.

To fix this, the teardown convention must be deliberately violated to
respect the underlying memory dependencies.

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a1b2c3d4e5f6..7f8e9d0c1b2a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -118,8 +118,8 @@ static int sysfs_rtnl_lock(struct kobject *kobj,
struct attribute *attr,
  return 0;

 unbreak:
- sysfs_unbreak_active_protection(kn);
  dev_put(ndev);
+ sysfs_unbreak_active_protection(kn);
  return ret;
 }


On Mon, Jun 15, 2026 at 4:18 AM Shuangpeng Bai
<shuangpeng.kernel@gmail.com> wrote:
>
> Hi netdev maintainers,
>
> I hit the following KASAN report while testing an upstream kernel.
>
> The issue was reproduced with netdevsim. I have not confirmed whether this is
> specific to netdevsim or whether other net devices can trigger a similar issue.
>
> The KASAN report shows a slab-use-after-free in ref_tracker_free(), reached from
> sysfs_rtnl_lock() while reading phys_port_name.
>
> I reproduced this on commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7 (May 25 2026)
>
> To help trigger the bug more reliably, we applied a minimal diagnostic patch
> that only adds delays and print statements.
>
> The reproducer and .config files are here.
> https://gist.github.com/shuangpengbai/b49765d646ec4610917015371aa1c3ca
>
> I'm happy to test debug patches or provide additional information.
>
> Reported-by: Shuangpeng Bai <shuangpeng.kernel@gmail.com>
>
> [ 3145.449971][T17497] BUG: KASAN: slab-use-after-free in ref_tracker_free (lib/ref_tracker.c:295)
> [ 3145.452089][T17497] Read of size 1 at addr ffff888107678598 by task cat/17497
> [ 3145.454439][T17497]
> [ 3145.454977][T17497] Tainted: [W]=WARN
> [ 3145.454980][T17497] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 3145.454985][T17497] Call Trace:
> [ 3145.454991][T17497]  <TASK>
> [ 3145.454994][T17497]  dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
> [ 3145.455002][T17497]  print_report (mm/kasan/report.c:378 mm/kasan/report.c:482)
> [ 3145.455028][T17497]  kasan_report (mm/kasan/report.c:595)
> [ 3145.455046][T17497]  ref_tracker_free (lib/ref_tracker.c:295)
> [ 3145.455083][T17497]  sysfs_rtnl_lock (include/linux/netdevice.h:4491 include/linux/netdevice.h:4508 include/linux/netdevice.h:4534 net/core/net-sysfs.c:122)
> [ 3145.455091][T17497]  phys_port_name_show (net/core/net-sysfs.c:665)
> [ 3145.455118][T17497]  dev_attr_show (drivers/base/core.c:2421)
> [ 3145.455128][T17497]  sysfs_kf_seq_show (fs/sysfs/file.c:65)
> [ 3145.455135][T17497]  seq_read_iter (fs/seq_file.c:231)
> [ 3145.455144][T17497]  vfs_read (fs/read_write.c:493 fs/read_write.c:574)
> [ 3145.455169][T17497]  ksys_read (fs/read_write.c:717)
> [ 3145.455181][T17497]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
> [ 3145.455188][T17497]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
> [ 3145.455193][T17497] RIP: 0033:0x7fcf098c43ce
> [ 3145.455200][T17497] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 08 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> [ 3145.455204][T17497] RSP: 002b:00007ffd05e76b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 3145.455211][T17497] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcf098c43ce
> [ 3145.455214][T17497] RDX: 0000000000020000 RSI: 00007fcf095e4000 RDI: 0000000000000003
> [ 3145.455217][T17497] RBP: 00007fcf095e4000 R08: 00007fcf095e3010 R09: 0000000000000000
> [ 3145.455219][T17497] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
> [ 3145.455222][T17497] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> [ 3145.455227][T17497]  </TASK>
> [ 3145.455229][T17497]
> [ 3145.479014][T17497] Freed by task 17497 on cpu 0 at 3145.447575s:
> [ 3145.479559][T17497]  kasan_save_track (mm/kasan/common.c:57 mm/kasan/common.c:78)
> [ 3145.479963][T17497]  kasan_save_free_info (mm/kasan/generic.c:584)
> [ 3145.480411][T17497]  __kasan_slab_free (mm/kasan/common.c:253 mm/kasan/common.c:285)
> [ 3145.480813][T17497]  kfree (include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6251 mm/slub.c:6566)
> [ 3145.481148][T17497]  device_release (drivers/base/core.c:2542)
> [ 3145.481567][T17497]  kobject_put (lib/kobject.c:689 lib/kobject.c:720 include/linux/kref.h:65 lib/kobject.c:737)
> [ 3145.481951][T17497]  sysfs_rtnl_lock (net/core/net-sysfs.c:121)
> [ 3145.482351][T17497]  phys_port_name_show (net/core/net-sysfs.c:665)
> [ 3145.482782][T17497]  dev_attr_show (drivers/base/core.c:2421)
> [ 3145.483154][T17497]  sysfs_kf_seq_show (fs/sysfs/file.c:65)
> [ 3145.483586][T17497]  seq_read_iter (fs/seq_file.c:231)
> [ 3145.483975][T17497]  vfs_read (fs/read_write.c:493 fs/read_write.c:574)
> [ 3145.484334][T17497]  ksys_read (fs/read_write.c:717)
> [ 3145.484701][T17497]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
> [ 3145.485092][T17497]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
> [ 3145.485592][T17497]
> [ 3145.485794][T17497] The buggy address belongs to the object at ffff888107678000
> [ 3145.485794][T17497]  which belongs to the cache kmalloc-cg-8k of size 8192
> [ 3145.486991][T17497] The buggy address is located 1432 bytes inside of
> [ 3145.486991][T17497]  freed 8192-byte region [ffff888107678000, ffff88810767a000)
> [ 3145.488159][T17497]
> [ 3145.488367][T17497] The buggy address belongs to the physical page:
>
>
> Best,
> Shuangpeng
>

^ permalink raw reply related

* [PATCH bpf v2 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-16  9:31 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon
In-Reply-To: <20260616093103.471444-1-sun.jian.kdev@gmail.com>

prog_run_opts already verifies that BPF_PROG_TEST_RUN returns -ENOSPC
for a short data_out buffer while still reporting the full output size
through data_size_out.

Add the same coverage for non-linear test_run output. Use pass-through
TC and XDP programs with a 9000-byte packet, a 64-byte linear data area,
and a 100-byte data_out buffer. The expected output spans both the linear
data and the first fragment.

Verify that test_run returns -ENOSPC, reports the full packet length
through data_size_out, and copies the packet prefix into data_out for
both non-linear skb and XDP frags paths.

Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
 .../selftests/bpf/prog_tests/prog_run_opts.c  | 72 +++++++++++++++++++
 .../selftests/bpf/progs/test_pkt_access.c     | 12 ++++
 2 files changed, 84 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
index 01f1d1b6715a..71af1ff02023 100644
--- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
+++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
@@ -4,6 +4,10 @@
 
 #include "test_pkt_access.skel.h"
 
+#define NONLINEAR_PKT_LEN 9000
+#define NONLINEAR_LINEAR_DATA_LEN 64
+#define SHORT_OUT_LEN 100
+
 static const __u32 duration;
 
 static void check_run_cnt(int prog_fd, __u64 run_cnt)
@@ -20,6 +24,71 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
 	      "incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
 }
 
+static void init_pkt(__u8 *pkt, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		pkt[i] = i & 0xff;
+}
+
+static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	__u8 pkt[NONLINEAR_PKT_LEN];
+	__u8 out[SHORT_OUT_LEN];
+	struct __sk_buff skb = {};
+	int prog_fd, err;
+
+	init_pkt(pkt, sizeof(pkt));
+	memset(out, 0xa5, sizeof(out));
+
+	skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+	topts.data_in = pkt;
+	topts.data_size_in = sizeof(pkt);
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.ctx_in = &skb;
+	topts.ctx_size_in = sizeof(skb);
+
+	prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	ASSERT_EQ(err, -ENOSPC, "skb_nonlinear_partial_err");
+	ASSERT_EQ(topts.data_size_out, sizeof(pkt), "skb_nonlinear_partial_data_size_out");
+	ASSERT_OK(memcmp(out, pkt, sizeof(out)), "skb_nonlinear_partial_data_out");
+}
+
+static void test_xdp_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	__u8 pkt[NONLINEAR_PKT_LEN];
+	__u8 out[SHORT_OUT_LEN];
+	struct xdp_md ctx = {};
+	int prog_fd, err;
+
+	init_pkt(pkt, sizeof(pkt));
+	memset(out, 0xa5, sizeof(out));
+
+	ctx.data = 0;
+	ctx.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+	topts.data_in = pkt;
+	topts.data_size_in = sizeof(pkt);
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.ctx_in = &ctx;
+	topts.ctx_size_in = sizeof(ctx);
+
+	prog_fd = bpf_program__fd(skel->progs.xdp_frags_pass_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	ASSERT_EQ(err, -ENOSPC, "xdp_nonlinear_partial_err");
+	ASSERT_EQ(topts.data_size_out, sizeof(pkt), "xdp_nonlinear_partial_data_size_out");
+	ASSERT_OK(memcmp(out, pkt, sizeof(out)), "xdp_nonlinear_partial_data_out");
+}
+
 void test_prog_run_opts(void)
 {
 	struct test_pkt_access *skel;
@@ -69,6 +138,9 @@ void test_prog_run_opts(void)
 	run_cnt += topts.repeat;
 	check_run_cnt(prog_fd, run_cnt);
 
+	test_skb_nonlinear_data_out_partial(skel);
+	test_xdp_nonlinear_data_out_partial(skel);
+
 cleanup:
 	if (skel)
 		test_pkt_access__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
index bce7173152c6..cd284401eebd 100644
--- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
+++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
@@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
 
 	return TC_ACT_UNSPEC;
 }
+
+SEC("tc")
+int tc_pass_prog(struct __sk_buff *skb)
+{
+	return TC_ACT_OK;
+}
+
+SEC("xdp.frags")
+int xdp_frags_pass_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf v2 1/2] bpf: Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-16  9:31 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon
In-Reply-To: <20260616093103.471444-1-sun.jian.kdev@gmail.com>

For non-linear test_run output, bpf_test_finish() derives the linear
data copy length from copy_size - frag_size. This only matches the
linear data length when copy_size is the full packet size.

When userspace provides a short data_out buffer, copy_size is clamped to
that buffer size. If copy_size is smaller than frag_size, the computed
length becomes negative and bpf_test_finish() returns -ENOSPC before
copying the packet prefix or updating data_size_out.

Compute the linear data length from the packet layout instead, and clamp
the linear copy length to copy_size. This preserves the expected
partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
in data_out, and report the full packet length through data_size_out.

Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
 net/bpf/test_run.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2bc04feadfab..976e8fa31bc9 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -453,19 +453,16 @@ static int bpf_test_finish(const union bpf_attr *kattr,
 	}

 	if (data_out) {
-		int len = sinfo ? copy_size - frag_size : copy_size;
-
-		if (len < 0) {
-			err = -ENOSPC;
-			goto out;
-		}
+		u32 head_len = size - frag_size;
+		u32 len = min(copy_size, head_len);

 		if (copy_to_user(data_out, data, len))
 			goto out;

 		if (sinfo) {
-			int i, offset = len;
+			u32 offset = len;
 			u32 data_len;
+			int i;

 			for (i = 0; i < sinfo->nr_frags; i++) {
 				skb_frag_t *frag = &sinfo->frags[i];
-- 
2.43.0

^ permalink raw reply related

* [PATCH bpf v2 0/2] Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-16  9:31 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon

When BPF_PROG_TEST_RUN returns non-linear output and userspace provides a
short data_out buffer, bpf_test_finish() can return -ENOSPC before copying
the packet prefix or updating data_size_out.

Fix this by deriving the linear copy length from the packet layout rather
than from the already-clamped copy_size. Add selftest coverage for both
non-linear skb and XDP frags paths.

---

Changes in v2:

* Fix the Fixes tag to point to the commit that introduced the shared
  non-linear copy-out logic.
* Drop skb-specific wording from the fix commit.
* Move the selftest from skb_load_bytes.c to prog_run_opts.c.
* Add XDP frags coverage in addition to non-linear skb coverage.

v1:
https://lore.kernel.org/bpf/20260615073856.152479-1-sun.jian.kdev@gmail.com/

Tested with:
  ./test_progs -t prog_run_opts -v
  ./test_progs -t skb_load_bytes -v
  ./test_progs -t xdp_pull_data -v

Sun Jian (2):
  bpf: Fix partial copy of non-linear test_run output
  selftests/bpf: Cover partial copy of non-linear test_run output

 net/bpf/test_run.c                            | 11 ++-
 .../selftests/bpf/prog_tests/prog_run_opts.c  | 72 +++++++++++++++++++
 .../selftests/bpf/progs/test_pkt_access.c     | 12 ++++
 3 files changed, 88 insertions(+), 7 deletions(-)

Range-diff:
1:  3691b07aa440 ! 1:  e5a0c426d4cb bpf: Fix partial copy of non-linear skb test_run output
    @@ Metadata
     Author: Sun Jian <sun.jian.kdev@gmail.com>
     
      ## Commit message ##
    -    bpf: Fix partial copy of non-linear skb test_run output
    +    bpf: Fix partial copy of non-linear test_run output
     
    -    For non-linear skbs, bpf_test_finish() derives the linear head copy
    -    length from copy_size - frag_size. This only matches the skb head length
    -    when copy_size is the full packet size.
    +    For non-linear test_run output, bpf_test_finish() derives the linear
    +    data copy length from copy_size - frag_size. This only matches the
    +    linear data length when copy_size is the full packet size.
     
         When userspace provides a short data_out buffer, copy_size is clamped to
         that buffer size. If copy_size is smaller than frag_size, the computed
         length becomes negative and bpf_test_finish() returns -ENOSPC before
         copying the packet prefix or updating data_size_out.
     
    -    Compute the linear head length from the skb layout instead, and clamp the
    -    head copy length to copy_size. This preserves the expected partial-copy
    -    semantics: return -ENOSPC, copy the packet prefix that fits in data_out,
    -    and report the full packet length through data_size_out.
    +    Compute the linear data length from the packet layout instead, and clamp
    +    the linear copy length to copy_size. This preserves the expected
    +    partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
    +    in data_out, and report the full packet length through data_size_out.
     
    -    Fixes: 838baa351cee ("bpf: Craft non-linear skbs in BPF_PROG_TEST_RUN")
    +    Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
         Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
     
      ## net/bpf/test_run.c ##
2:  663847520f0b < -:  ------------ selftests/bpf: Cover partial copy of non-linear skb test_run output
-:  ------------ > 2:  680506532d97 selftests/bpf: Cover partial copy of non-linear test_run output
-- 
2.43.0


^ permalink raw reply

* Re: [PATCH net 1/1] net: smc: fix splice entry lifetime imbalance in smc_rx_splice
From: Dust Li @ 2026-06-16  9:30 UTC (permalink / raw)
  To: Ren Wei, linux-rdma, linux-s390, netdev
  Cc: alibuda, sidraya, wenjia, mjambigi, tonylu, guwen, ubraun,
	stefan.raspl, davem, yuantan098, zcliangcn, bird, lx24,
	d4n.for.sec
In-Reply-To: <192d1b44ed358ca143f44ef167d14153bccc51e9.1781097957.git.d4n.for.sec@gmail.com>

On 2026-06-11 01:54:11, Ren Wei wrote:
>From: Daming Li <d4n.for.sec@gmail.com>
>
>smc_rx_splice() hands candidate pages to splice_to_pipe() without taking
>references for the lifetime of each splice entry first. That breaks the
>splice ownership contract in the VM-backed RMB path.
>
>splice_to_pipe() drops unqueued entries through spd_release(), while
>queued entries are later dropped through the pipe buffer release
>callback. The current code only tries to take page references after the
>splice succeeds, and it derives the number of queued VM pages from a
>mutated offset value. This can underflow page refcounts and trigger a
>use-after-free. It also leaves the socket lifetime imbalanced in the
>multi-page VM case, where one sock_hold() can be followed by multiple
>sock_put() calls.
>
>Fix this by taking the page and socket references for every candidate
>splice entry before calling splice_to_pipe(), and by releasing the
>matching private state, page reference, and socket reference from
>smc_rx_spd_release() for entries that never get queued. This makes the
>SMC splice path follow the normal splice lifetime rules and removes the
>broken post-splice VM page counting entirely.
>
>Fixes: 9014db202cb7 ("smc: add support for splice()")
>Cc: stable@vger.kernel.org
>Reported-by: Yuan Tan <yuantan098@gmail.com>
>Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
>Reported-by: Xin Liu <bird@lzu.edu.cn>
>Assisted-by: Codex:GPT-5.4
>Co-developed-by: Liu Xiao <lx24@stu.ynu.edu.cn>
>Signed-off-by: Liu Xiao <lx24@stu.ynu.edu.cn>
>Signed-off-by: Daming Li <d4n.for.sec@gmail.com>
>Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>

The patch looks good to me, a minor nit below

Reviewed-by: Dust Li <dust.li@linux.alibaba.com>


>---
> net/smc/smc_rx.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
>diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
>index c1d9b923938d..88aee0d93597 100644
>--- a/net/smc/smc_rx.c
>+++ b/net/smc/smc_rx.c
>@@ -150,18 +150,23 @@ static const struct pipe_buf_operations smc_pipe_ops = {
> static void smc_rx_spd_release(struct splice_pipe_desc *spd,
> 			       unsigned int i)
> {
>+	struct smc_spd_priv *priv = (struct smc_spd_priv *)spd->partial[i].private;
>+	struct sock *sk = &priv->smc->sk;
>+
>+	kfree(priv);
> 	put_page(spd->pages[i]);
>+	sock_put(sk);
> }
> 
> static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len,
> 			 struct smc_sock *smc)
> {
> 	struct smc_link_group *lgr = smc->conn.lgr;
>-	int offset = offset_in_page(src);
> 	struct partial_page *partial;
> 	struct splice_pipe_desc spd;
> 	struct smc_spd_priv **priv;
> 	struct page **pages;
>+	int offset = offset_in_page(src);

Minor nit:
moving int offset = offset_in_page(src) down breaks the existing
reverse-xmas-tree declaration ordering. We keep this style in SMC.

Best regards,
Dust


^ permalink raw reply

* Re: net: thunderbolt: tbnet_poll() can overflow skb_shinfo()->frags[]
From: Mika Westerberg @ 2026-06-16  9:25 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: Mika Westerberg, Yehezkel Bernat, Andrew Lunn, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel
In-Reply-To: <178159529251.2170936.1136950368069628844@maoyixie.com>

Hi,

On Tue, Jun 16, 2026 at 03:34:52PM +0800, Maoyi Xie wrote:
> Hi all,
> 
> After the recent skb frags[] overflow fixes (t7xx, cdc-phonet, f_phonet), I
> went looking for the same pattern. I think tbnet_poll() in
> drivers/net/thunderbolt/main.c has it too. I would appreciate it if you could
> take a look.
> 
> tbnet_poll() reassembles a ThunderboltIP packet that spans several frames into
> one skb. It adds one rx fragment per frame.
> 
> 	skb = net->skb;
> 	if (!skb) {
> 		skb = build_skb(...);
> 		...
> 		net->skb = skb;
> 	} else {
> 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> 				page, hdr_size, frame_size,
> 				TBNET_RX_PAGE_SIZE - hdr_size);
> 	}
> 
> Nothing checks skb_shinfo(skb)->nr_frags against MAX_SKB_FRAGS here. The frame
> count comes from the peer, in the frame header. tbnet_check_frame() only bounds
> it at the start of a packet.
> 
> 	if (frame_count == 0 || frame_count > TBNET_RING_SIZE / 4) {
> 		net->stats.rx_length_errors++;
> 		return false;
> 	}
> 
> TBNET_RING_SIZE is 256, so frame_count can be as large as 64. MAX_SKB_FRAGS is 17
> by default. Frame 0 builds the skb and every frame after it adds a fragment, so
> nr_frags can reach 63. Once nr_frags hits MAX_SKB_FRAGS, skb_add_rx_frag() writes
> one entry past skb_shinfo()->frags[]. The frame_size and MTU checks do not stop
> this. With small frames, 64 fragments stay well under TBNET_MAX_MTU.
> 
> So a malicious or buggy peer can send a packet with frame_count between 19 and
> 64. The frames only need to increment the way tbnet_check_frame() wants. That
> drives nr_frags past frags[] and overruns skb_shared_info.

I agree this can happen.

> The fix I had in mind mirrors f0813bcd2d9d ("net: wwan: t7xx: fix potential
> skb->frags overflow in RX path") and 600dc40554dc ("net: usb: cdc-phonet: fix
> skb frags[] overflow in rx_complete()"). Add the fragment only while there is
> room, and drop the packet otherwise.
> 
> 	-	} else {
> 	+	} else if (skb_shinfo(skb)->nr_frags < MAX_SKB_FRAGS) {
> 			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> 					page, hdr_size, frame_size,
> 					TBNET_RX_PAGE_SIZE - hdr_size);
> 	+	} else {
> 	+		net->stats.rx_length_errors++;
> 	+		__free_pages(page, TBNET_RX_PAGE_ORDER);
> 	+		dev_kfree_skb_any(net->skb);
> 	+		net->skb = NULL;
> 	+		continue;
> 		}
> 
> I do not have two Thunderbolt hosts, so this is from reading the code. I can put
> together a focused reproducer if that helps.
> 
> Does this look like a real overflow? And is the MAX_SKB_FRAGS guard the right
> place, or would you rather tighten the frame_count bound in tbnet_check_frame()?
> It has been there since the driver was added (e69b6c02b4c3), so it is a stable
> candidate. Happy to send a proper patch once you confirm.

I would prefer do this in tbnet_check_frame(). Thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox