Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] tipc: free bearer discoverer via RCU to fix tipc_disc_rcv UAF
From: Tung Quang Nguyen @ 2026-06-16 11:34 UTC (permalink / raw)
  To: Sam P
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org, Jon Maloy,
	bestswngs@gmail.com
In-Reply-To: <fa2e0cfb-9d60-4295-8a46-f69ce1229094@bynar.io>

Subject: Re: [PATCH net] tipc: free bearer discoverer via RCU to fix tipc_disc_rcv UAF

> Oops, I missed that patch! I'm not sure what the etiquette
> is in this case, but I'm happy to defer to the original
> submitter (CCd) if they're working on a new patch and/or
> add any appropriate trailers to my v2.

> I've prepared a v2 to submit after the ~24h period,
> addressing your changes and taking into account Eric's
> feedback from the earlier submission as well
> (adding an rcu_barrier() in tipc_exit()).
Eric's concern is correct but it needs to be addressed in a separate patch because it is a pre-existing issue. It requires another reproduction (load/unload TIPC kernel module) and other considerations (calling call_rcu() from timer etc.).
For now, I think you just need to address my comment.


^ permalink raw reply

* [PATCH v3] net: mvneta_bm: add suspend/resume support to prevent crash after resume
From: Yun Zhou @ 2026-06-16 11:25 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: netdev, linux-kernel, yun.zhou

The mvneta driver uses the hardware Buffer Manager (BM) for RX buffer
allocation. During suspend, mvneta disables its clock, causing BM to
lose all buffer address state. On resume, mvneta_bm_port_init() re-
attaches the BM pool to the NIC, but BM hardware returns stale/garbage
buffer addresses. When NAPI poll processes these buffers, DMA cache
sync hits an invalid virtual address causing a kernel panic:

 Unable to handle kernel paging request at virtual address b0000080
 PC is at v7_dma_inv_range
 Call trace:
  v7_dma_inv_range from arch_sync_dma_for_cpu+0x94/0x158
  arch_sync_dma_for_cpu from __dma_sync_single_for_cpu+0xc4/0x15c
  __dma_sync_single_for_cpu from mvneta_rx_swbm+0x6c8/0xf48
  mvneta_rx_swbm from mvneta_poll+0x6fc/0x70c
  mvneta_poll from __napi_poll.constprop.0+0x2c/0x1e0
  __napi_poll.constprop.0 from net_rx_action+0x160/0x2c4
  net_rx_action from handle_softirqs+0xd8/0x2b8
  handle_softirqs from run_ksoftirqd+0x30/0x94
  run_ksoftirqd from smpboot_thread_fn+0x100/0x204
  smpboot_thread_fn from kthread+0xf4/0x110
  kthread from ret_from_fork+0x14/0x28

Fix by adding suspend/resume callbacks to the BM driver:

- suspend: drain all buffers (with DMA unmapping), free the BPPE
  regions, and reset pool state to FREE before stopping BM and gating
  the clock.

- resume: enable the clock, reinitialize BM defaults, and restore pool
  read/write pointers and size registers. Pool allocation and buffer
  refill are handled by mvneta_resume() through the normal
  mvneta_bm_port_init() path, which sees pools as FREE and performs
  full initialization identical to probe.

Add a device_link (DL_FLAG_AUTOREMOVE_CONSUMER) in mvneta_probe to
guarantee BM resumes before mvneta and suspends after mvneta.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3:
  - Restore per-pool POOL_SIZE_REG, POOL_READ_PTR_REG, and
    POOL_WRITE_PTR_REG in resume, since clock gating loses all BM
    register state.
  - Check device_link_add() return value and emit dev_warn on failure.
  - Replace SIMPLE_DEV_PM_OPS (deprecated) with
    DEFINE_SIMPLE_DEV_PM_OPS and pm_sleep_ptr(), removing the
    #ifdef CONFIG_PM_SLEEP guard.
  - Add dev_warn in suspend if not all buffers could be freed.

v2:
  - Drain buffers via mvneta_bm_bufs_free() in suspend instead of only
    stopping BM and gating the clock. This ensures proper DMA unmapping
    and avoids buffer leaks.
  - Free the BPPE DMA-coherent region in suspend so that resume takes
    the full probe-time initialization path (alloc + fill), eliminating
    the need to modify mvneta_bm_pool_create().
  - Reset pool type to MVNETA_BM_FREE in suspend so mvneta_bm_pool_use()
    correctly re-creates and refills pools on resume.
  - Check clk_prepare_enable() return value in resume.
  - Add device_link between mvneta (consumer) and mvneta_bm (supplier)
    to guarantee correct suspend/resume ordering.

 drivers/net/ethernet/marvell/mvneta.c    |  7 +++
 drivers/net/ethernet/marvell/mvneta_bm.c | 58 ++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 0c061fb0ed07..b4a845f04c05 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5678,6 +5678,13 @@ static int mvneta_probe(struct platform_device *pdev)
 					 "use SW buffer management\n");
 				mvneta_bm_put(pp->bm_priv);
 				pp->bm_priv = NULL;
+			} else {
+				/* Ensure BM suspends after us, resumes before us */
+				if (!device_link_add(&pdev->dev,
+						     &pp->bm_priv->pdev->dev,
+						     DL_FLAG_AUTOREMOVE_CONSUMER))
+					dev_warn(&pdev->dev,
+						 "failed to create device link to BM\n");
 			}
 		}
 		/* Set RX packet offset correction for platforms, whose
diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c b/drivers/net/ethernet/marvell/mvneta_bm.c
index 6bb380494919..85162a43eaf6 100644
--- a/drivers/net/ethernet/marvell/mvneta_bm.c
+++ b/drivers/net/ethernet/marvell/mvneta_bm.c
@@ -477,6 +477,63 @@ static void mvneta_bm_remove(struct platform_device *pdev)
 	clk_disable_unprepare(priv->clk);
 }
 
+static int mvneta_bm_suspend(struct device *dev)
+{
+	struct mvneta_bm *priv = dev_get_drvdata(dev);
+	int i;
+
+	/* Drain buffers and free pool resources while BM is still clocked */
+	for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+		struct mvneta_bm_pool *bm_pool = &priv->bm_pools[i];
+		int size_bytes;
+
+		if (bm_pool->type == MVNETA_BM_FREE)
+			continue;
+
+		mvneta_bm_bufs_free(priv, bm_pool, bm_pool->port_map);
+		if (bm_pool->hwbm_pool.buf_num)
+			dev_warn(&priv->pdev->dev,
+				 "pool %d: %d buffers not freed\n",
+				 bm_pool->id, bm_pool->hwbm_pool.buf_num);
+
+		size_bytes = sizeof(u32) * bm_pool->hwbm_pool.size;
+		dma_free_coherent(&priv->pdev->dev, size_bytes,
+				  bm_pool->virt_addr, bm_pool->phys_addr);
+		bm_pool->virt_addr = NULL;
+		bm_pool->type = MVNETA_BM_FREE;
+	}
+
+	mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_STOP_MASK);
+	clk_disable_unprepare(priv->clk);
+	return 0;
+}
+
+static int mvneta_bm_resume(struct device *dev)
+{
+	struct mvneta_bm *priv = dev_get_drvdata(dev);
+	int i, err;
+
+	err = clk_prepare_enable(priv->clk);
+	if (err)
+		return err;
+
+	/* Reinitialize BM hardware; pools are refilled by mvneta_resume() */
+	mvneta_bm_default_set(priv);
+
+	/* Restore pool registers lost during clock gating */
+	for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+		mvneta_bm_write(priv, MVNETA_BM_POOL_READ_PTR_REG(i), 0);
+		mvneta_bm_write(priv, MVNETA_BM_POOL_WRITE_PTR_REG(i), 0);
+		mvneta_bm_write(priv, MVNETA_BM_POOL_SIZE_REG(i),
+				priv->bm_pools[i].hwbm_pool.size);
+	}
+
+	mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_START_MASK);
+	return 0;
+}
+
+static DEFINE_SIMPLE_DEV_PM_OPS(mvneta_bm_pm_ops, mvneta_bm_suspend, mvneta_bm_resume);
+
 static const struct of_device_id mvneta_bm_match[] = {
 	{ .compatible = "marvell,armada-380-neta-bm" },
 	{ }
@@ -489,6 +546,7 @@ static struct platform_driver mvneta_bm_driver = {
 	.driver = {
 		.name = MVNETA_BM_DRIVER_NAME,
 		.of_match_table = mvneta_bm_match,
+		.pm = pm_sleep_ptr(&mvneta_bm_pm_ops),
 	},
 };
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: kernel test robot @ 2026-06-16 11:21 UTC (permalink / raw)
  To: Luigi Rizzo, rizzo.unipi, m.szyprowski, robin.murphy, willemb,
	kuniyu, davem, edumazet, kuba, pabeni
  Cc: oe-kbuild-all, gregkh, rafael, akpm, david, netdev, linux-mm,
	iommu, driver-core, linux-kernel
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>

Hi Luigi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v7.1 next-20260615]
[cannot apply to driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luigi-Rizzo/swiotlb-avoid-double-copy-with-swiotlb-on-tx-socket/20260616-074655
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260615234220.3946885-1-lrizzo%40google.com
patch subject: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
config: arm-randconfig-r122-20260616 (https://download.01.org/0day-ci/archive/20260616/202606161921.OPkgBApm-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 16.1.0
sparse: v0.6.5-rc1
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260616/202606161921.OPkgBApm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606161921.OPkgBApm-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/dma/swiotlb.c: note: in included file (through include/linux/dma-direct.h):
>> include/linux/swiotlb.h:229:65: sparse: sparse: Using plain integer as NULL pointer
>> include/linux/swiotlb.h:229:65: sparse: sparse: Using plain integer as NULL pointer

vim +229 include/linux/swiotlb.h

   224	
   225	static inline bool is_zerocopy_swiotlb_folio(struct page *page)
   226	{
   227		struct folio *folio = page_folio(page);
   228	
 > 229		return folio_test_zcswiotlb(folio) && folio->private != 0;
   230	}
   231	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net 4/4] net: ti: icssg: Fix XSK zero copy TX during application wakeup
From: Meghana Malladi @ 2026-06-16 11:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: diogo.ivo, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, sdf, john.fastabend, hawk, daniel, ast, pabeni,
	edumazet, davem, andrew+netdev, bpf, linux-kernel, netdev,
	linux-arm-kernel, srk, Vignesh Raghavendra, Roger Quadros,
	danishanwar
In-Reply-To: <20260615162157.3748bcda@kernel.org>

Hi Jakub,

On 6/16/26 04:51, Jakub Kicinski wrote:
> On Fri, 12 Jun 2026 00:27:44 +0530 Meghana Malladi wrote:
>> @@ -169,9 +169,6 @@ static int emac_xsk_xmit_zc(struct prueth_emac *emac,
>>   
>>   		num_tx++;
>>   	}
>> -
>> -	xsk_tx_release(tx_chn->xsk_pool);
>> -	return num_tx;
> 
> Why are you deleting this?
> 

xsk_sendmsg() also calls this without an rcu-lock when transmitting the 
packets if the xmit was successful, so I was assuming it is not required 
and I removed this.

>>   }
>>   
>>   void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
>> @@ -279,9 +276,6 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
>>   		num_tx++;
>>   	}
>>   
>> -	if (!num_tx)
>> -		return 0;
> 
> Does something prevent us from running all this code if budget is 0?
> If budget is 0 we can complete normal Tx with skbs but we must
> not touch any AF-XDP related state.
> 

Can you elaborate more, I couldn't interpret your comment here

>>   	netif_txq = netdev_get_tx_queue(ndev, chn);
>>   	netdev_tx_completed_queue(netif_txq, num_tx, total_bytes);
>>   
>> @@ -306,7 +300,9 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
>>   
>>   		netif_txq = netdev_get_tx_queue(ndev, chn);
>>   		txq_trans_cond_update(netif_txq);
> 
> This looks misplaced, now we will hit it even if we didn't complete
> or submit any Tx.
> 

This code needs to be hit for packet transmission in zero copy mode.
emac_xsk_xmit_zc() submits the packets to the DMA in NAPI context,
when application wakes up the driver and triggers NAPI. Once DMA 
transfer is done, irq gets triggered NAPI gets called which will handle 
the tx packet completion + submit next Tx batch packets to the DMA.

if (tx_chn->xsk_pool) -> check ensure this hits and runs for zero copy 
only. Also above check (!num_tx) returns early during the application 
wakeup (where budget is zero), hence it is removed.

>> +		__netif_tx_lock(netif_txq, smp_processor_id());
>>   		emac_xsk_xmit_zc(emac, chn);
>> +		__netif_tx_unlock(netif_txq);
>>   	}


^ permalink raw reply

* [PATCH net 2/2] devlink: Fix parent ref leak on tc-bw failure
From: Cosmin Ratiu @ 2026-06-16 11:06 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Michal Wilczynski, Carolina Jubran,
	Cosmin Ratiu, Mark Bloch, Tariq Toukan
In-Reply-To: <20260616110633.1449432-1-cratiu@nvidia.com>

When a node is created via rate-new with tc-bw and a parent node,
devlink_nl_rate_set() executes the sequence of ops. It bails out on the
first failure and doesn't rollback anything. For most things that is
fine (setting some numbers), but the parent set can leak if there's
another failure after that.

That is precisely what happens when parent setting isn't the last block
in the function. After the referenced "Fixes" commit, when tc-bw fails
to be set the function bails out after having set the parent and
incremented its refcount.
There are two callers:
- devlink_nl_rate_set_doit() is fine, it just reports the error.
- but devlink_nl_rate_new_doit() frees the newly created node and leaks
  the parent refcnt.

Fix that by reordering the blocks so parent setting is last and adding a
comment explaining this so future modification preserve the ordering
(hopefully).

Fixes: 566e8f108fc7 ("devlink: Extend devlink rate API with traffic classes bandwidth management")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
---
 net/devlink/rate.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 210e26c6cfa0..533d21b028a7 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -486,16 +486,19 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
 		devlink_rate->tx_weight = weight;
 	}
 
-	nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
-	if (nla_parent) {
-		err = devlink_nl_rate_parent_node_set(devlink_rate, info,
-						      nla_parent);
+	if (attrs[DEVLINK_ATTR_RATE_TC_BWS]) {
+		err = devlink_nl_rate_tc_bw_set(devlink_rate, info);
 		if (err)
 			return err;
 	}
 
-	if (attrs[DEVLINK_ATTR_RATE_TC_BWS]) {
-		err = devlink_nl_rate_tc_bw_set(devlink_rate, info);
+	/* Keep parent setting last because it takes a reference. This function
+	 * has no rollback, so failing after taking the ref would leak it.
+	 */
+	nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
+	if (nla_parent) {
+		err = devlink_nl_rate_parent_node_set(devlink_rate, info,
+						      nla_parent);
 		if (err)
 			return err;
 	}
-- 
2.53.0


^ permalink raw reply related

* [PATCH net 0/2] devlink: Fix a couple parent ref leaks
From: Cosmin Ratiu @ 2026-06-16 11:06 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Michal Wilczynski, Carolina Jubran,
	Cosmin Ratiu, Mark Bloch, Tariq Toukan

These two patches fix parent ref leaks on errors.

Cosmin Ratiu (2):
  devlink: Fix parent ref leak in devl_rate_node_create()
  devlink: Fix parent ref leak on tc-bw failure

 net/devlink/rate.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

-- 
2.53.0


^ permalink raw reply

* [PATCH net 1/2] devlink: Fix parent ref leak in devl_rate_node_create()
From: Cosmin Ratiu @ 2026-06-16 11:06 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Michal Wilczynski, Carolina Jubran,
	Cosmin Ratiu, Mark Bloch, Tariq Toukan
In-Reply-To: <20260616110633.1449432-1-cratiu@nvidia.com>

In the original commit the function bails out on kstrdup failure,
forgetting to decrement the refcnt of the parent.

Fix that by moving the parent refcnt setting after kstrdup.

Fixes: caba177d7f4d ("devlink: Enable creation of the devlink-rate nodes from the driver")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
---
 net/devlink/rate.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 41be2d6c2954..210e26c6cfa0 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -725,11 +725,6 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
 	if (!rate_node)
 		return ERR_PTR(-ENOMEM);
 
-	if (parent) {
-		rate_node->parent = parent;
-		refcount_inc(&rate_node->parent->refcnt);
-	}
-
 	rate_node->type = DEVLINK_RATE_TYPE_NODE;
 	rate_node->devlink = devlink;
 	rate_node->priv = priv;
@@ -740,6 +735,11 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	if (parent) {
+		rate_node->parent = parent;
+		refcount_inc(&rate_node->parent->refcnt);
+	}
+
 	refcount_set(&rate_node->refcnt, 1);
 	list_add(&rate_node->list, &devlink->rate_list);
 	devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Mostafa Saleh @ 2026-06-16 11:06 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Jakub Kicinski, rizzo.unipi, m.szyprowski, robin.murphy, willemb,
	kuniyu, davem, edumazet, pabeni, gregkh, rafael, akpm, david,
	netdev, linux-mm, iommu, driver-core, linux-kernel
In-Reply-To: <CAMOZA0KAHKsvA9yRcdrjG13S+=rJhw-Cvnw2WdLjGGY0azG0kw@mail.gmail.com>

On Tue, Jun 16, 2026 at 02:33:52AM +0200, Luigi Rizzo wrote:
> On Tue, Jun 16, 2026 at 2:25 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 15 Jun 2026 23:42:20 +0000 Luigi Rizzo wrote:
> > > The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> > > especially with greedy senders, this has a high chance of happening in
> > > the softirq handler for tx network interrupts, creating a significant
> > > performance bottleneck.
> >
> > What's the use case? I associate swiotlb with debug / testing mostly,
> > so it'd be useful for people like me to explain why you care.
> 
> Ah sorry, I forgot to mention.
> swiotlb is used in guest kernels for confidential computing VMs.
> Ordinary memory pages are encrypted and the host or devices
> have no way to decrypt them, so the kernel must use
> unencrypted bounce buffers to exchange data with I/O devices.

I started looking into the same problem recently, to reduce the
bouncing in protected KVM (pKVM) confidential guests.
My first attempt was to update dma_direct_map_phys() to skip
bouncing and do inline memory decryption (for pKVM that is a hypercall
which updates the stage-2 page tables), however, that was really slow
compared to the memcpy in bouncing even for massive pages.
My conclusion was similar that we need to solve this at construction
by making this memory allocated from a pre-decrypted pool (which
does not have to be part of the SWIOTLB)
My initial idea was to teach some of the kernel subsystems (SKB,
BLK, SLAB) about "CoCo allocators" that allocate decrypted memory,
as this is not a net specific problem.

I am still looking into this, I was planning to bring this up in the
upcoming LPC.
I will give this patch a try. However, I believe that we need a more
generalised concept for CoCo pre-decrypted allocators in the kernel.

Thanks,
Mostafa

> 
> cheers
> luigi
> 

^ permalink raw reply

* Re: [PATCH net] tipc: free bearer discoverer via RCU to fix tipc_disc_rcv UAF
From: Sam P @ 2026-06-16 11:04 UTC (permalink / raw)
  To: Tung Quang Nguyen
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org, Jon Maloy,
	bestswngs
In-Reply-To: <GV1P189MB19887A9A37B5B170C112DF8EC6E52@GV1P189MB1988.EURP189.PROD.OUTLOOK.COM>

On 16/06/2026 08:50, Tung Quang Nguyen wrote: 
> A similar patch was submitted 6 days ago: https://patchwork.kernel.org/project/netdevbpf/patch/20260610153349.2546041-2-bestswngs@gmail.com/
> 
> I do not receive updated patch from the submitter yet.
> Your patch has the same coding style issue (long line, over 80 columns), see linux/Documentation/process/coding-style.rst
> 
> If you break the long line into 2 lines and submit again, I think I can acknowledge your patch.

Oops, I missed that patch! I'm not sure what the etiquette
is in this case, but I'm happy to defer to the original
submitter (CCd) if they're working on a new patch and/or
add any appropriate trailers to my v2.

I've prepared a v2 to submit after the ~24h period,
addressing your changes and taking into account Eric's
feedback from the earlier submission as well
(adding an rcu_barrier() in tipc_exit()).


^ permalink raw reply

* Re: [PATCH net-next] i40e: add devlink parameter for Flow Director ATR sample rate
From: mohammad heib @ 2026-06-16 11:03 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, jiri, davem, edumazet, kuba, pabeni, horms, corbet,
	anthony.l.nguyen, przemyslaw.kitszel, andrew+netdev
In-Reply-To: <20260614161131.192068-1-mheib@redhat.com>



On 6/14/26 7:11 PM, mheib@redhat.com wrote:
> From: Mohammad Heib <mheib@redhat.com>
> 
> The i40e driver uses Flow Director ATR to periodically update flow
> steering information for active TCP flows. The update frequency is
> currently controlled by I40E_DEFAULT_ATR_SAMPLE_RATE and is fixed at
> driver build time.
> 
> On systems with a large number of queues and high-rate TCP workloads,
> the default sampling interval can result in frequent Flow Director
> reprogramming for long-lived flows.
> 
> The amount of TCP packet reordering observed on some systems is
> sensitive to the ATR sampling interval. Increasing the interval reduces
> Flow Director programming activity and can significantly reduce the
> associated reordering.
> 
> Since the optimal sampling interval depends on the workload and system
> configuration, a single fixed value is not suitable for all deployments.
> 
> Add a devlink parameter to allow administrators to tune the ATR sample
> rate at runtime without rebuilding the driver or disabling ATR
> functionality entirely.
> 
> Signed-off-by: Mohammad Heib <mheib@redhat.com>
> ---
>   Documentation/networking/devlink/i40e.rst     | 19 ++++++
>   drivers/net/ethernet/intel/i40e/i40e.h        |  1 +
>   .../net/ethernet/intel/i40e/i40e_devlink.c    | 65 +++++++++++++++++++
>   drivers/net/ethernet/intel/i40e/i40e_main.c   |  4 +-
>   drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  4 +-
>   5 files changed, 90 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
> index 51c887f0dc83..704469aa9acf 100644
> --- a/Documentation/networking/devlink/i40e.rst
> +++ b/Documentation/networking/devlink/i40e.rst
> @@ -40,6 +40,25 @@ Parameters
>   
>           The default value is ``0`` (internal calculation is used).
>   
> +.. list-table:: Driver specific parameters implemented
> +    :widths: 5 5 90
> +
> +    * - Name
> +      - Mode
> +      - Description
> +    * - ``atr_sample_rate``
> +      - runtime
> +      - Controls how frequently Flow Director ATR updates flow steering
> +        information for active TCP flows.
> +
> +        ATR programs Flow Director entries based on sampled transmitted
> +        packets. The sampling interval is specified as the number of
> +        transmitted packets between ATR updates.
> +
> +        Lower values increase Flow Director programming activity, while
> +        higher values reduce the update frequency.
> +
> +        The default value is ``20``.
>   
>   Info versions
>   =============
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
> index 1b6a8fbaa648..88eb40ee45f0 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -487,6 +487,7 @@ struct i40e_pf {
>   	u16 rss_size_max;          /* HW defined max RSS queues */
>   	u16 fdir_pf_filter_count;  /* num of guaranteed filters for this PF */
>   	u16 num_alloc_vsi;         /* num VSIs this driver supports */
> +	u32 atr_sample_rate;
>   	bool wol_en;
>   
>   	struct hlist_head fdir_filter_list;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_devlink.c b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
> index 229179ccc131..16e51762db45 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_devlink.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
> @@ -33,12 +33,77 @@ static int i40e_max_mac_per_vf_get(struct devlink *devlink,
>   	return 0;
>   }
>   
> +static int i40e_atr_sample_rate_set(struct devlink *devlink,
> +				    u32 id,
> +				    struct devlink_param_gset_ctx *ctx,
> +				    struct netlink_ext_ack *extack)
> +{
> +	struct i40e_pf *pf = devlink_priv(devlink);
> +	struct i40e_vsi *vsi;
> +	u32 sample_rate = ctx->val.vu32;
> +	int i;
> +
> +	pf->atr_sample_rate = sample_rate;
> +
> +	if (!test_bit(I40E_FLAG_FD_ATR_ENA, pf->flags))
> +		return 0;
> +
> +	vsi = i40e_pf_get_main_vsi(pf);
> +	if (!vsi)
> +		return 0;
> +
> +	for (i = 0; i < vsi->num_queue_pairs; i++) {
> +		if (!vsi->tx_rings[i])
> +			continue;
> +		vsi->tx_rings[i]->atr_sample_rate = sample_rate;
> +		vsi->tx_rings[i]->atr_count = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +static int i40e_atr_sample_rate_get(struct devlink *devlink,
> +				    u32 id,
> +				    struct devlink_param_gset_ctx *ctx,
> +				    struct netlink_ext_ack *extack)
> +{
> +	struct i40e_pf *pf = devlink_priv(devlink);
> +
> +	ctx->val.vu32 = pf->atr_sample_rate;
> +
> +	return 0;
> +}
> +
> +static int i40e_atr_sample_rate_validate(struct devlink *devlink, u32 id,
> +					 union devlink_param_value val,
> +					 struct netlink_ext_ack *extack)
> +{
> +	if (!val.vu32) {
> +		NL_SET_ERR_MSG_MOD(extack,
> +				   "ATR sample rate must be greater than 0");
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +enum i40e_dl_param_id {
> +	I40E_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
> +	I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
> +};
> +
>   static const struct devlink_param i40e_dl_params[] = {
>   	DEVLINK_PARAM_GENERIC(MAX_MAC_PER_VF,
>   			      BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>   			      i40e_max_mac_per_vf_get,
>   			      i40e_max_mac_per_vf_set,
>   			      NULL),
> +	DEVLINK_PARAM_DRIVER(I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
> +			     "atr_sample_rate",
> +			     DEVLINK_PARAM_TYPE_U32,
> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> +			     i40e_atr_sample_rate_get,
> +			     i40e_atr_sample_rate_set,
> +			     i40e_atr_sample_rate_validate),
>   };
>   
>   static void i40e_info_get_dsn(struct i40e_pf *pf, char *buf, size_t len)
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index d59750c490f4..9c8144970a34 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3458,7 +3458,7 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>   
>   	/* some ATR related tx ring init */
>   	if (test_bit(I40E_FLAG_FD_ATR_ENA, vsi->back->flags)) {
> -		ring->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
> +		ring->atr_sample_rate = vsi->back->atr_sample_rate;
>   		ring->atr_count = 0;
>   	} else {
>   		ring->atr_sample_rate = 0;
> @@ -12745,6 +12745,8 @@ static int i40e_sw_init(struct i40e_pf *pf)
>   		}
>   	}
>   
> +	pf->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
> +
>   	if ((pf->hw.func_caps.fd_filters_guaranteed > 0) ||
>   	    (pf->hw.func_caps.fd_filters_best_effort > 0)) {
>   		set_bit(I40E_FLAG_FD_ATR_ENA, pf->flags);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index bb741ff3e5f2..7e29e9244c3a 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -372,8 +372,8 @@ struct i40e_ring {
>   	u16 next_to_clean;
>   	u16 xdp_tx_active;
>   
> -	u8 atr_sample_rate;
> -	u8 atr_count;
> +	u32 atr_sample_rate;
> +	u32 atr_count;
>   
>   	bool ring_active;		/* is ring online or not */
>   	bool arm_wb;		/* do something to arm write back */

Hi Aleksandr,

Your concern is indeed valid. I'm not 100% sure whether devlink 
callbacks are still protected by rtnl_lock after the large locking 
changes that recently went into net/core.

That said, I'm wondering whether we need to store the ATR sample rate 
per ring at all. As far as I can tell, there is no option to configure 
the sample rate independently for individual rings, so maintaining a 
copy in every ring may not be necessary.

Would it make sense to remove the per-ring copy entirely and keep the 
sample rate only at the PF level? That would avoid the need to walk the 
rings from the devlink callback and would eliminate the race you pointed 
out.

Thanks Piotr for the review. I'll address your comment in v2.


^ permalink raw reply

* Re: [PATCH net v3] ip_tunnel: drop stale dst from generated PMTU ICMP replies
From: Ido Schimmel @ 2026-06-16 11:02 UTC (permalink / raw)
  To: laikabcprice
  Cc: David Ahern, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Shuah Khan, netdev, linux-kernel,
	linux-kselftest
In-Reply-To: <20260614-master-v3-1-9f5060ba1ed1@gmail.com>

On Sun, Jun 14, 2026 at 12:13:57AM +0100, Laika Price via B4 Relay wrote:
> From: Laika Price <laikabcprice@gmail.com>
> 
> iptunnel_pmtud_build_icmp(...) and iptunnel_pmtud_build_icmpv6(...) take
> in an sk_buff, modify it to create a PMTU ICMP error reply, and return it.
> As part of these modifications, the source/destination ethernet and IP
> addresses are swapped around which makes the sk_buff's current dst invalid.
> 
> If the stale dst is left, the packet can skip input routing and be
> forwarded using the original output device. This was observed when sending
> packets to a VXLAN over a WireGuard tunnel - the ICMP reply was generated
> but it was sent over the VXLAN instead of to the WireGuard tunnel.
> 
> This patch drops the stale dst after building the PMTU reply so that the
> packet is routed using its new headers when it is reinjected.
> 
> The pmtu_ipv4_br_vxlan4_exception test generates PMTU exceptions by
> pinging an IP on the other side of a tunnel. This was incorrect as it
> would return upon the first ICMP Fragmentation Needed due to the -w flag
> being used in conjunction with || return 1.
> 
> This patch updates pmtu_ipv4_br_vxlan4_exception to be in line with how
> PMTU exceptions are generated in other tests such as in test_pmtu_ipvX
> 
>     run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst1}
>     run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst2}

1. Please split the selftest fix to a separate patch (patch #1), explain
why the test is currently passing and why it's going to break with the
subsequent code change.

2. Use the appropriate Fixes tag for each patch.

3. Go over this doc:

https://docs.kernel.org/process/maintainer-netdev.html

4. Use ingest_mdir.py to test your patches:

https://github.com/linux-netdev/nipa#running-locally

> 
> Signed-off-by: Laika Price <laikabcprice@gmail.com>
> ---
> Changes in v3:
> - Squashed the selftest update into the ip_tunnel fix so the patch remains
>   bisectable.
> - Link to v2: https://patch.msgid.link/20260613-master-v2-0-061b70fd45dd@gmail.com
> 
> Changes in v2:
> - Fixed incorrect PMTU exception generation in the selftest.
> - Link to v1: https://patch.msgid.link/20260613-master-v1-1-df796e8e2d74@gmail.com
> ---
>  net/ipv4/ip_tunnel_core.c           | 2 ++
>  tools/testing/selftests/net/pmtu.sh | 4 ++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index d3c677e9b..949150e43 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -267,6 +267,7 @@ static int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu)
>  
>  	eth_header(skb, skb->dev, ntohs(eh.h_proto), eh.h_source, eh.h_dest, 0);
>  	skb_reset_mac_header(skb);
> +	skb_dst_drop(skb);

This probably needs to be:

if (skb_valid_dst(skb))
	skb_dst_drop(skb);

Both VXLAN and GENEVE use the dst after skb_tunnel_check_pmtu() when in
external mode, so you can't drop it unconditionally. This shouldn't be a
problem because both IPv4 and IPv6 will resolve a new dst if the
current one isn't valid (i.e., it's a dst metadata one).

>  
>  	return skb->len;
>  }
> @@ -370,6 +371,7 @@ static int iptunnel_pmtud_build_icmpv6(struct sk_buff *skb, int mtu)
>  
>  	eth_header(skb, skb->dev, ntohs(eh.h_proto), eh.h_source, eh.h_dest, 0);
>  	skb_reset_mac_header(skb);
> +	skb_dst_drop(skb);
>  
>  	return skb->len;
>  }
> diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
> index a3323c21f..9498d9f53 100755
> --- a/tools/testing/selftests/net/pmtu.sh
> +++ b/tools/testing/selftests/net/pmtu.sh
> @@ -1456,8 +1456,8 @@ test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() {
>  	mtu "${ns_a}" ${type}_a $((${ll_mtu} + 1000))
>  	mtu "${ns_b}" ${type}_b $((${ll_mtu} + 1000))
>  
> -	run_cmd ${ns_c} ${ping} -q -M want -i 0.1 -c 10 -s $((${ll_mtu} + 500)) ${dst} || return 1
> -	run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1  -s $((${ll_mtu} + 500)) ${dst} || return 1
> +	run_cmd ${ns_c} ${ping} -q -M want -i 0.1 -w 1 -s $((${ll_mtu} + 500)) ${dst}
> +	run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s $((${ll_mtu} + 500)) ${dst}
>  
>  	# Check that exceptions were created
>  	pmtu="$(route_get_dst_pmtu_from_exception "${ns_c}" ${dst})"
> 
> ---
> base-commit: 2a2974b5145cdf2f4db134be1a2157e9ca4a1cf0
> change-id: 20260613-master-b749dfae5ecc
> 
> Best regards,
> --  
> Laika Price <laikabcprice@gmail.com>
> 
> 

^ permalink raw reply

* [PATCH] net: airoha: Clean up RX queues in airoha_dev_stop
From: Wayen Yan @ 2026-06-16 10:50 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

When the last port is stopped, airoha_dev_stop() clears TX queues
but neglects to clean up RX queues. This can lead to:
- RX ring buffer descriptors remaining valid after device close
- Potential DMA synchronization issues on device reopen
- Risk of use-after-free if pages are freed while DMA is still active

Add cleanup loop for RX queues to mirror the TX queue cleanup,
ensuring symmetric resource management.

Fixes: 20bf7d07c956 ("net: airoha: add QDMA support for Airoha EN7581 Ethernet")
Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 31cdb11cd7..9ca5bbf64d 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1771,6 +1771,13 @@ static int airoha_dev_stop(struct net_device *dev)
 
 			airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
 		}
+
+		for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
+			if (!qdma->q_rx[i].ndesc)
+				continue;
+
+			airoha_qdma_cleanup_rx_queue(&qdma->q_rx[i]);
+		}
 	}
 
 	return 0;
-- 
2.51.0



^ permalink raw reply related

* [PATCH] net: airoha: Stop TX queues on error path in airoha_dev_open
From: Wayen Yan @ 2026-06-16 10:50 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

In airoha_dev_open(), if airoha_set_vip_for_gdm_port() fails after
netif_tx_start_all_queues() has been called, the TX queues remain
started while the device configuration is incomplete. This leaves
the device in an inconsistent state where packets could be
transmitted before the VIP/IFC port configuration is complete.

Add netif_tx_stop_all_queues() call on the error path to properly
roll back the TX queue state.

Fixes: 20bf7d07c956 ("net: airoha: add QDMA support for Airoha EN7581 Ethernet")
Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 31cdb11cd7..cf9c366907 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1715,8 +1715,10 @@ static int airoha_dev_open(struct net_device *dev)
 
 	netif_tx_start_all_queues(dev);
 	err = airoha_set_vip_for_gdm_port(port, true);
-	if (err)
+	if (err) {
+		netif_tx_stop_all_queues(dev);
 		return err;
+	}
 
 	if (netdev_uses_dsa(dev))
 		airoha_fe_set(qdma->eth, REG_GDM_INGRESS_CFG(port->id),
-- 
2.51.0



^ permalink raw reply related

* [PATCH net] dpaa2-switch: fix VLAN upper check not rejecting bridge join
From: Ioana Ciornei @ 2026-06-16 10:54 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, netdev
  Cc: f.fainelli, vladimir.oltean, linux-kernel

The blamed commit refactored the prechangeupper event handling but
failed to actually return an error in case
dpaa2_switch_prevent_bridging_with_8021q_upper() detected a 802.1q upper
on a port which tries to join a bridge. Fix this by returning err
instead of 0.

Fixes: 45035febc495 ("net: dpaa2-switch: refactor prechangeupper sanity checks")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index 52c1cb9cb7e0..46ae81c2fa01 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -2177,7 +2177,7 @@ dpaa2_switch_prechangeupper_sanity_checks(struct net_device *netdev,
 	if (err) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "Cannot join a bridge while VLAN uppers are present");
-		return 0;
+		return err;
 	}
 
 	netdev_for_each_lower_dev(upper_dev, other_dev, iter) {
-- 
2.25.1


^ permalink raw reply related

* [PATCH] net: airoha: Fix QoS counter configuration for Tx-fwd channels
From: Wayen Yan @ 2026-06-16 10:50 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

In airoha_qdma_init_qos_stats(), the Tx-fwd counter was incorrectly
using register index (i << 1) instead of ((i << 1) + 1). This caused
the Tx-fwd configuration to overwrite the Tx-cpu configuration for
each QoS channel, resulting in incorrect QoS statistics.

Fix by using the correct register index ((i << 1) + 1) for Tx-fwd
counter configuration.

Fixes: 20bf7d07c956 ("net: airoha: add QDMA support for Airoha EN7581 Ethernet")
Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 31cdb11cd7..329988a840 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1256,7 +1256,7 @@ static void airoha_qdma_init_qos_stats(struct airoha_qdma *qdma)
 			       FIELD_PREP(CNTR_CHAN_MASK, i));
 		/* Tx-fwd transferred count */
 		airoha_qdma_wr(qdma, REG_CNTR_VAL((i << 1) + 1), 0);
-		airoha_qdma_wr(qdma, REG_CNTR_CFG(i << 1),
+		airoha_qdma_wr(qdma, REG_CNTR_CFG((i << 1) + 1),
 			       CNTR_EN_MASK | CNTR_ALL_QUEUE_EN_MASK |
 			       CNTR_ALL_DSCP_RING_EN_MASK |
 			       FIELD_PREP(CNTR_SRC_MASK, 1) |
-- 
2.51.0



^ permalink raw reply related

* [PATCH 5.10/5.15/6.1/6.6/6.12/6.18] tap: free page on error paths in tap_get_user_xdp()
From: Nazar Kalashnikov @ 2026-06-16  9:02 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Nazar Kalashnikov, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Dongli Zhang, netdev,
	linux-kernel, bpf, Si-Wei Liu, Willem de Bruijn, lvc-project,
	Xiang Mei, Weiming Shi

From: Weiming Shi <bestswngs@gmail.com>

commit 3bcf7aec6a9d16438f2cec29f5d7c8d5b8edf9b2 upstream.

tap_get_user_xdp() rejects a frame shorter than ETH_HLEN with -EINVAL,
and returns -ENOMEM when build_skb() fails. Both paths jump to the err
label without freeing the page that vhost_net_build_xdp() allocated for
the frame. tap_sendmsg() discards the per-buffer return value and always
returns 0, so vhost_tx_batch() takes the success path and never frees
the page; each rejected frame in a batch leaks one page-frag chunk.

Free the page on both error paths, before the skb is built. This is the
tap counterpart of the same leak in tun_xdp_one().

Fixes: 0efac27791ee ("tap: accept an array of XDP buffs through sendmsg()")
Fixes: ed7f2afdd0e0 ("tap: add missing verification for short frame")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260521163230.1478627-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Nazar Kalashnikov <nazarkalashnikov0@gmail.com>
---
Backport fix for CVE-2026-46320
 drivers/net/tap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 6fd3b14273b3..b51ce7af1b20 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1052,6 +1052,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 	int err, depth;
 
 	if (unlikely(xdp->data_end - xdp->data < ETH_HLEN)) {
+		put_page(virt_to_head_page(xdp->data));
 		err = -EINVAL;
 		goto err;
 	}
@@ -1061,6 +1062,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
 
 	skb = build_skb(xdp->data_hard_start, buflen);
 	if (!skb) {
+		put_page(virt_to_head_page(xdp->data));
 		err = -ENOMEM;
 		goto err;
 	}
-- 
2.47.3

^ permalink raw reply related

* [PATCH] ice: retry reading NVM if admin queue returns EBUSY
From: Robert Malz @ 2026-06-16 10:45 UTC (permalink / raw)
  To: anthony.l.nguyen, przemyslaw.kitszel; +Cc: intel-wired-lan, netdev

When the admin queue command to read NVM returns EBUSY, the driver
currently treats it as a fatal error and aborts the entire read
operation. This can cause spurious NVM read failures during periods of
high firmware activity.

Add retry logic to ice_read_flat_nvm() that handles EBUSY responses
from the admin queue. When an EBUSY error is encountered, release the
NVM resource lock, wait for ICE_SQ_SEND_DELAY_TIME_MS, re-acquire it,
and retry the failed read. The retry is attempted up to
ICE_SQ_SEND_MAX_EXECUTE times before giving up.

Code was extracted from OOT ice driver 1.15.4 release. Additional
change was made to reset last_cmd in case of retry to make sure that
all commands are retried properly.

Fixes: e94509906d6b ("ice: create function to read a section of the NVM and Shadow RAM")
Signed-off-by: Robert Malz <robert.malz@canonical.com>
---
 drivers/net/ethernet/intel/ice/ice_nvm.c | 25 +++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_nvm.c b/drivers/net/ethernet/intel/ice/ice_nvm.c
index 7e187a804dfa..cbe21ef9d18e 100644
--- a/drivers/net/ethernet/intel/ice/ice_nvm.c
+++ b/drivers/net/ethernet/intel/ice/ice_nvm.c
@@ -67,6 +67,7 @@ ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
 {
 	u32 inlen = *length;
 	u32 bytes_read = 0;
+	int retry_cnt = 0;
 	bool last_cmd;
 	int status;
 
@@ -96,11 +97,25 @@ ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
 					 offset, read_size,
 					 data + bytes_read, last_cmd,
 					 read_shadow_ram, NULL);
-		if (status)
-			break;
-
-		bytes_read += read_size;
-		offset += read_size;
+		if (status) {
+			if (hw->adminq.sq_last_status != ICE_AQ_RC_EBUSY ||
+			    retry_cnt > ICE_SQ_SEND_MAX_EXECUTE)
+				break;
+			ice_debug(hw, ICE_DBG_NVM,
+				  "NVM read EBUSY error, retry %d\n",
+				  retry_cnt + 1);
+			last_cmd = false;
+			ice_release_nvm(hw);
+			msleep(ICE_SQ_SEND_DELAY_TIME_MS);
+			status = ice_acquire_nvm(hw, ICE_RES_READ);
+			if (status)
+				break;
+			retry_cnt++;
+		} else {
+			bytes_read += read_size;
+			offset += read_size;
+			retry_cnt = 0;
+		}
 	} while (!last_cmd);
 
 	*length = bytes_read;
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Sebastian Andrzej Siewior @ 2026-06-16 10:35 UTC (permalink / raw)
  To: Jakub Kicinski, Petr Mladek, John Ogness, Sergey Senozhatsky,
	Peter Zijlstra
  Cc: Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260611191114.5bc43a59@kernel.org>

On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > +	local_bh_disable();
> > + 	poll_napi(dev);
> > +	_local_bh_enable();
> 
> tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> softirq into not waking ksoftirqd? The problematic path is:
> 
>   scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)
> 
> so the softirq may never get serviced.
> 
> In netcons we try to avoid touching the network driver if the Tx path
> locks are already held. Ideally we'd do something similar with the
> scheduler. Try to do bare minimum if we may be in the scheduler.
> Failing that - don't poll the driver if we were called with irqs
> already disabled.
> 
> Or maybe we only poll from console->write_thread ?

So this is not an issue since commit 7eab73b18630e ("netconsole: convert
to NBCON console infrastructure"). Because from here now on writes are
deferred to the nbcon thread. So this purely about -stable in this case.

Looking at the patch and the amount of comments vs code changes look
somehow hackish. That ifdef for PREEMPT_RT is not needed because on
PREEMPT_RT we have either nbcon or the legacy console (including
netconsole before the mentioned commit) wrapped in a dedicated thread
(via force_legacy_kthread()).
That means in both cases the flow never ends there and the problem is
limited to !PREEMPT_RT.

Now. The scheduler usually does printk_deferred() because of the rq lock
so it does not deadlock for various reasons. It is kind of a pity that
the various WARN macros don't do that.
I don't think that patch is enough. It works around the problem in this
scenario but should the NIC driver invoke schedule_work() then we are
back here again.
Should the network driver acquire a lock then lockdep might observe
rq -> driver-lock and then driver-lock -> rq and yell dead lock (CPU1
doing AB and CPU2 doing BA). This includes also other console driver so
it is not limited to netconsole.

Point being made is that we should avoid the callchain:

|  console_unlock
|  vprintk_emit
|  __warn
|  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
|  put_prev_entity
|  put_prev_task_fair
|  __schedule

basically a printk under the rq lock.

We could add printk_deferred_enter/exit() to all the rq_lock() variants.
I think PeterZ loves this the most. And Greg will appreciate it too
while backporting because of all the context changes.

We could also introduce WARN_ON_DEFERRED +variants which do the
printk_deferred_enter/exit() thingy should around the printk and replace
all the WARNs in kernel/sched/.
I *think* the tty/console layer has also a deadlock problem where it
holds locks and then the WARN(), that never triggers, asks for the same
locks again so we might have a second user…

Adding sched and printk folks for opinions while eyeballing
WARN_ON_DEFERRED().

Sebastian

^ permalink raw reply

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Pedro Falcato @ 2026-06-16 10:28 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
	edumazet, kuba, pabeni, gregkh, rafael, akpm, david, netdev,
	linux-mm, iommu, driver-core, linux-kernel,
	Jesper Dangaard Brouer, Ilias Apalodimas
In-Reply-To: <CAMOZA0+L2+=FEQ5ORvv07JaJix0R+6Q6u01CyMKCbd842To9nA@mail.gmail.com>

On Tue, Jun 16, 2026 at 11:48:36AM +0200, Luigi Rizzo wrote:
> On Tue, Jun 16, 2026 at 11:20 AM Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > (+cc page pool maintainers)
> > On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote:
> > > The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> > > especially with greedy senders, this has a high chance of happening in
> > > the softirq handler for tx network interrupts, creating a significant
> > > performance bottleneck.
> > >
> > > Allow tx sockets to allocate socket buffers directly from the bounce
> > > buffers. This avoids the second copy and removes the above bottleneck.
> > > The fraction of swiotlb buffers allowed for this feature is set with
> > >    /sys/module/swiotlb/parameters/zerocopy_tx_percent
> > > (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
> > >
> > > Implementation:
> > > - define a new page type to unambiguously identify bounce buffers used
> > >   as backing storage for socket buffers
> > > - modify skb_page_frag_refill to perform the modified allocation
> > > - modify the destructors __free_frozen_pages(), free_unref_folio() to
> > >   handle those pages and return them to the pool.
> > >
> > > The savings are especially visible with fewer queues. In synthetic
> > > benchmarks, senders with 1-2 queues would cap around 50Gbps with
> > > conventional swiotlb, and reach over 170Gbps with the feature enabled.
> >
> > I could be wrong, but I genuinely think that the way to go about this is
> > using page_pool for regular TX as well. page_pool pages are all dma-mapped
> > (so whatever swiotlb optimization you want can be done there), and the net
> > stack already has awareness of these special pages and special skbs, so it
> > won't Just Return Them back to the page allocator.
> 
> I am not sure I follow your comment above, can you expand/clarify?
> 
> The problem I am dealing with is that the copy from the socket buffer
> to the bounce buffer is done in the device xmit function. Under high
> it is almost always done by the tx softirq.
> This means that even if we move the copy outside the HARD_TX_LOCK(),
> it would still be almost completely serialized.
> Hence the proposed method to make skb_page_frag_refill() allocate
> directly a bounce buffer (under specific conditions) so there is a single copy
> done directly to the dma-able buffer, and ii is done  in the user threads/CPUs
> and is not seriallized in the softirq thread.
> 
> I am not sure how page_pool on tx could help here.

Page pool would provide both the means of passing around an iommu-mapped page,
and a concrete "this is where we allocate these pages" spot. Then introducing
a "zero-copy" swiotlb allocation would be a simple matter of introducing this
on page pool's side. In pseudo-code, something like:

static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
						 gfp_t gfp)
{
	struct page *page;

	gfp |= __GFP_COMP;
	
	if (pool->dma_map && /* is_swiotlb */) {
		page = swiotlb_alloc_pages(pool->p.nid, gfp, pool->p.order, ...);
		if (!page)
			return NULL;
		/* page is implicitly swiotlb mapped (well, _actually_ it's
		 * not that simple, because of the dma_mapped tracking that
		 * was introduced, but PoC anyway..). */
	} else {
		page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
		if (unlikely(!page))
			return NULL;

		if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
			put_page(page);
			return NULL;
		}
	}
}

(plus other spots, obviously). No copying should be required, and the
netmem desc will keep the dma_addr around. The network stack will notice
pp_recycle on all of these skbs and simply refuse to throw the pages away to
the page allocator.

In any case, it might be that this is not feasible for XYZ reasons, but I've
thought about this (making net use and reuse page pool pre-iommu-mapped pages
exclusively) for a while and I definitely see a lot of similarities with your
problem (that more or less reduces down to "I want to get an iommu-mapped page
from the get-go").

-- 
Pedro

^ permalink raw reply

* RE: [PATCH net-next v2] net: dsa: Fix skb ownership in taggers
From: Wei Fang @ 2026-06-16 10:19 UTC (permalink / raw)
  To: Linus Walleij
  Cc: netdev@vger.kernel.org, Sashiko AI Review, Andrew Lunn,
	Vladimir Oltean, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Florian Fainelli, Jonas Gorski,
	Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh,
	UNGLinuxDriver@microchip.com, Chester A. Unal, Daniel Golle,
	Matthias Brugger, AngeloGioacchino Del Regno, Clark Wang,
	Clément Léger, George McCollister, David Yang
In-Reply-To: <20260616-dsa-fix-free-skb-v2-1-9dbda6a19e97@kernel.org>

> The tag_8021q.c tagger calls vlan_insert_tag() in dsa_8021q_xmit().
> vlan_insert_tag() will consume the skb with kfree_skb() on failure
> and return NULL.
> 
> When NULL is returned as error code to ->xmit() in dsa_user_xmit()
> it will free the same skb again leading to a double-free.
> 
> The idea of dsa_user_xmit() and dsa_switch_rcv() dropping the skb
> they held before the call to ->xmit() and ->rcv() is conceptually
> wrong: the pattern elsewhere in the networking code is that consumers
> drop their skb:s on failure.
> 
> Modify the ->xmit() and ->rcv() call sites to not drop the SKB if
> the taggers return NULL from any of these calls. Move those drops into
> the taggers so every callback error path that retains ownership consumes
> the skb before returning NULL.
> 
> Keep the existing helper ownership rules: VLAN insertion helpers already
> free on failure (this is the case in tag_8021q.c), while deferred
> transmit paths either transfer the skb reference to worker context or
> hold a worker reference with skb_get() and drop the caller's reference.
> 
> For SJA1105 meta RX, transfer the buffered stampable skb under the meta
> lock and return NULL while the skb is waiting for its meta frame: the
> skb is not dropped in this case.

Reviewed-by: Wei Fang <wei.fang@nxp.com> # netc


^ permalink raw reply

* Re: [PATCH bpf] bpf, sockmap: fix lock inversion between stab->lock and sk_callback_lock
From: Jiayuan Chen @ 2026-06-16 10:17 UTC (permalink / raw)
  To: Sechang Lim, John Fastabend, Jakub Sitnicki
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Simon Horman, netdev, bpf,
	linux-kernel
In-Reply-To: <20260616091153.2966617-1-rhkrqnwk98@gmail.com>


On 6/16/26 5:11 PM, Sechang Lim wrote:
> sock_map_update_common() and __sock_map_delete() hold stab->lock and call
> sock_map_unref() -> sock_map_del_link() under it. sock_map_del_link() takes
> sk_callback_lock for write to stop the strparser and verdict, giving the
> lock order stab->lock -> sk_callback_lock.
>
> The opposite order comes from an SK_SKB stream parser. On RX,
> sk_psock_strp_data_ready() holds sk_callback_lock for read while running
> the parser. The verdict redirects the skb to egress, where a sched_cls


The commit message is wrong. A verdict does not redirect to egress
synchronously — sk_psock_skb_redirect() only queues the skb and
schedule_delayed_work()s sk_psock_backlog, so egress runs in workqueue
context, not under sk_callback_lock.


> program calls bpf_map_delete_elem() on a sockmap, which takes stab->lock:
>
>    WARNING: possible circular locking dependency detected
>    7.1.0-rc6 Not tainted
>    ------------------------------------------------------
>    syz.9.8824 is trying to acquire lock:
>    (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
>    but task is already holding lock:
>    (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173
>
>    -> #1 (clock-AF_INET){++.-}-{3:3}:
>           _raw_write_lock_bh
>           sock_map_del_link net/core/sock_map.c:167
>           sock_map_unref net/core/sock_map.c:184
>           sock_map_update_common net/core/sock_map.c:509
>           sock_map_update_elem_sys net/core/sock_map.c:588
>           map_update_elem kernel/bpf/syscall.c:1805
>
>    -> #0 (&stab->lock){+.-.}-{3:3}:
>           _raw_spin_lock_bh
>           __sock_map_delete net/core/sock_map.c:421
>           sock_map_delete_elem net/core/sock_map.c:452
>           bpf_prog_06044d24140080b6
>           tcx_run net/core/dev.c:4451
>           sch_handle_egress net/core/dev.c:4541
>           __dev_queue_xmit net/core/dev.c:4808
>           ...
>           tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701


I guess it is an ACK. What is the actual purpose of a sched_cls program 
calling

sockmap delete on the TX path of an ACK? If there is no real use case 
for it, this is

just broken BPF usage, not a kernel bug worth this change.



^ permalink raw reply

* [PATCH] net: airoha: fix foe_check_time allocation size
From: Wayen Yan @ 2026-06-16  9:49 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek

foe_check_time is declared as u16 pointer but was allocated with
only ppe_num_entries bytes instead of ppe_num_entries * sizeof(u16).

When airoha_ppe_foe_verify_entry() is called with hash >= ppe_num_entries/2,
it writes beyond the allocated buffer, causing heap buffer overflow and
potential kernel crash.

Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_ppe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c
index 5c9dff6bcc..8fb8ecf909 100644
--- a/drivers/net/ethernet/airoha/airoha_ppe.c
+++ b/drivers/net/ethernet/airoha/airoha_ppe.c
@@ -1578,7 +1578,8 @@ int airoha_ppe_init(struct airoha_eth *eth)
 			return -ENOMEM;
 	}
 
-	ppe->foe_check_time = devm_kzalloc(eth->dev, ppe_num_entries,
+	ppe->foe_check_time = devm_kzalloc(eth->dev,
+					   ppe_num_entries * sizeof(*ppe->foe_check_time),
 					   GFP_KERNEL);
 	if (!ppe->foe_check_time)
 		return -ENOMEM;
-- 
2.51.0



^ permalink raw reply related

* Re: [PATCH net-next 3/5] selftests/bpf: remove sockmap + ktls tests
From: Jakub Sitnicki @ 2026-06-16 10:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, bpf,
	john.fastabend, sd
In-Reply-To: <20260614014102.461064-4-kuba@kernel.org>

On Sat, Jun 13, 2026 at 06:40 PM -07, Jakub Kicinski wrote:
> The combination of sockmap and TLS is no longer supported - installing
> the TLS ULP on a sockmap socket (and vice versa) is now rejected. Remove
> the tests that exercise the combination along with their BPF program;
> the file covered nothing but sockmap sockets holding kTLS contexts.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/sockmap_ktls.c   | 355 ------------------
>  .../selftests/bpf/progs/test_sockmap_ktls.c   |  61 ---
>  tools/testing/selftests/bpf/test_sockmap.c    | 227 +----------
>  3 files changed, 1 insertion(+), 642 deletions(-)
>  delete mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_ktls.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
> index 6ed8e149e3d5..cda6b22cf759 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c

[...]

>  static void run_ktls_test(int family, int sotype)
>  {
>  	if (test__start_subtest("tls simple offload"))
>  		test_sockmap_ktls_offload(family, sotype);

Nit: We probably don't need to keep this one test around.
It tests pure kTLS and overlaps with selftests/net/tls.c.

> -	if (test__start_subtest("tls tx cork"))
> -		test_sockmap_ktls_tx_cork(family, sotype, false);
> -	if (test__start_subtest("tls tx cork with push"))
> -		test_sockmap_ktls_tx_cork(family, sotype, true);
> -	if (test__start_subtest("tls tx egress with no buf"))
> -		test_sockmap_ktls_tx_no_buf(family, sotype, true);
> -	if (test__start_subtest("tls tx with pop"))
> -		test_sockmap_ktls_tx_pop(family, sotype);
> -	if (test__start_subtest("tls verdict with tls rx"))
> -		test_sockmap_ktls_verdict_with_tls_rx(family, sotype);
>  }

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply

* [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-16 10:03 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
	linux-hardening, llvm, Ilya Maximets, Johan Thomsen

kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
structure to the options_len, which is then initialized to zero.
Later, we're initializing the structure by copying the tunnel info
together with the options, and this triggers a warning for a potential
memcpy overflow, since the compiler estimates that the options can't
fit into the structure, even though the memory for them is actually
allocated.

 memcpy: detected buffer overflow: 104 byte write of buffer size 96
 WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
  skb_tunnel_info_unclone+0x179/0x190
  geneve_xmit+0x7fe/0xe00

The issue is triggered when built with clang and source fortification.

Fix that by doing the copy in two stages: first - the main data with
the options_len, then the options.  This way the correct length should
be known at the time of the copy.

It would be better if the options_len never changed after allocation,
but the allocation code is a little separate from the initialization
and it would be awkward and potentially dangerous to return a struct
with options_len set to a non-zero value from the metadata_dst_alloc().

Another option would be to use ip_tunnel_info_opts_set(), but it is
doing too many unnecessary operations for the use case here.

Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Reported-by: Johan Thomsen <write@ownrisk.dk>
Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

Johan, if you can test this one in your setup as well, that would
be great.  Thanks.

 include/net/dst_metadata.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 1fc2fb03ce3f..f45d1e3163f0 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
 	if (!new_md)
 		return ERR_PTR(-ENOMEM);

-	memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
-	       sizeof(struct ip_tunnel_info) + md_size);
+	/* Copy in two stages to keep the __counted_by happy. */
+	new_md->u.tun_info = md_dst->u.tun_info;
+	memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
+	       ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
+
 #ifdef CONFIG_DST_CACHE
 	/* Unclone the dst cache if there is one */
 	if (new_md->u.tun_info.dst_cache.cache) {
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH 07/23] driver core: platform: provide platform_device_set_fwnode()
From: Bartosz Golaszewski @ 2026-06-16  9:51 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Lee Jones, Mark Brown, Thierry Reding, Sebastian Hesselbarth,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Srinivas Kandagatla, Greg Kroah-Hartman, Vinod Koul,
	Rafael J. Wysocki, Danilo Krummrich, Rob Herring, Saravana Kannan,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
	Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
	Broadcom internal kernel review list, Ulf Hansson, Frank Li,
	Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
	Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
	Maximilian Luz, Hans de Goede, Ilpo Järvinen,
	Krzysztof Kozlowski, Benjamin Herrenschmidt, linux-kernel, netdev,
	linux-arm-msm, linux-sound, driver-core, devicetree, linuxppc-dev,
	linux-i2c, iommu, linux-pm, imx, linux-arm-kernel, intel-xe,
	dri-devel, linux-usb, linux-mips, platform-driver-x86,
	Bartosz Golaszewski, Bartosz Golaszewski
In-Reply-To: <ajEcDq0S067wMFaK@black.igk.intel.com>

On Tue, 16 Jun 2026 11:49:02 +0200, Andy Shevchenko
<andriy.shevchenko@linux.intel.com> said:
> On Thu, Jun 04, 2026 at 05:32:27AM -0700, Bartosz Golaszewski wrote:
>> On Tue, 2 Jun 2026 23:41:53 +0200, Andy Shevchenko
>> <andriy.shevchenko@linux.intel.com> said:
>> > On Thu, May 21, 2026 at 10:36:30AM +0200, Bartosz Golaszewski wrote:
>> >> Provide a helper function encapsulating the logic of assigning firmware
>> >> nodes to platform devices created with platform_device_alloc(). Make the
>> >> kerneldoc state that this is the proper interface for assigning firmware
>> >> nodes to dynamically allocated platform devices. This will allow us to
>> >> switch to counting the references of the device's firmware nodes in the
>> >> future, not only the OF nodes.
>> >
>> > But why different for of_node and fwnode to begin with?!
>>
>> I'm not following. What are you suggesting?
>
> After re-reading of this thread, I think I'm suggesting the same what you have
> in plans to do in the future as you put it as "This will allow us to switch to
> counting the references of the device's firmware nodes in the future, not only
> the OF nodes."
>
> // Offtopic
> I haven't heard from you for more than a month on this:
> https://lore.kernel.org/r/af18zdP5HF3_P9Vo@black.igk.intel.com
> Anything should I do? Please, answer to that thread.
>

Eek, sorry, must have flown under the radar.

I'll pull it now, I will do a second PR for this merge window anyway.

Bart

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox