Netdev List
 help / color / mirror / Atom feed
* [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>

MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.

This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.

Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.

Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 39 +++++++++++++++----
 include/net/mana/mana.h                       |  8 ++++
 2 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..1875bffd82b7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool *from_pool)
+			     dma_addr_t *da, bool *from_pool,
+			     struct page **pp_page, u32 *dma_sync_offset)
 {
 	struct page *page;
 	u32 offset;
 	void *va;
+
 	*from_pool = false;
+	*pp_page = NULL;
+	*dma_sync_offset = 0;
 
 	/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
 	 * per page.
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 	va  = page_to_virt(page) + offset;
 	*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
 	*from_pool = true;
+	*pp_page = page;
+	*dma_sync_offset = offset + rxq->headroom;
 
 	return va;
 }
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf,
-			       bool *old_fp)
+			       struct mana_recv_buf_oob *rxoob, u32 pktlen,
+			       void **old_buf, bool *old_fp)
 {
+	struct page *pp_page;
+	u32 dma_sync_offset;
 	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+			     &dma_sync_offset);
 	if (!va)
 		return;
-	if (!rxoob->from_pool || rxq->frag_count == 1)
+	if (!rxoob->from_pool || rxq->frag_count == 1) {
 		dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 				 DMA_FROM_DEVICE);
+	} else {
+		/* The page pool maps the whole page and only syncs for device
+		 * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+		 * for the CPU before they are read: this is required if DMA
+		 * is incoherent or bounce buffers are used.
+		 */
+		page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+					   rxoob->dma_sync_offset, pktlen);
+	}
 	*old_buf = rxoob->buf_va;
 	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
 	rxoob->from_pool = from_pool;
+	rxoob->pp_page = pp_page;
+	rxoob->dma_sync_offset = dma_sync_offset;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,7 +2190,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+		mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
 
 		/* Unsuccessful refill will have old_buf == NULL.
 		 * In this case, mana_rx_skb() will drop the packet.
@@ -2566,6 +2586,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	struct page *pp_page = NULL;
+	u32 dma_sync_offset = 0;
 	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
@@ -2573,13 +2595,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+				     &dma_sync_offset);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
 	rx_oob->from_pool = from_pool;
+	rx_oob->pp_page = pp_page;
+	rx_oob->dma_sync_offset = dma_sync_offset;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
 
 	void *buf_va;
 	bool from_pool; /* allocated from a page pool */
+	/* head page of the page_pool fragment; valid only when
+	 * from_pool && frag_count > 1.
+	 */
+	struct page *pp_page;
+	/* Fragment offset plus rxq->headroom, passed to
+	 * page_pool_dma_sync_for_cpu().
+	 */
+	u32 dma_sync_offset;
 
 	/* SGL of the buffer going to be sent as part of the work request. */
 	u32 num_sge;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>

Validate the packet length reported in the RX CQE before using it as a DMA
sync length or passing it to skb processing. The CQE is supplied by the
NIC device and should not be blindly trusted.

Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++++----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1875bffd82b7..0b44c51ae6ec 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2190,12 +2190,26 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+		if (unlikely(pktlen > rxq->datasize)) {
+			/* Increase it even if mana_rx_skb() isn't called. */
+			rxq->rx_cq.work_done++;
 
-		/* Unsuccessful refill will have old_buf == NULL.
-		 * In this case, mana_rx_skb() will drop the packet.
-		 */
-		mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+			++ndev->stats.rx_dropped;
+			netdev_warn_once(ndev,
+				"Dropped oversized RX packet: len=%u, datasize=%u\n",
+				pktlen, rxq->datasize);
+
+			/* Reuse the RX buffer since rxbuf_oob is unchanged. */
+		} else {
+
+			mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen,
+					   &old_buf, &old_fp);
+
+			/* Unsuccessful refill will have old_buf == NULL.
+			 * In this case, mana_rx_skb() will drop the packet.
+			 */
+			mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+		}
 
 		mana_move_wq_tail(rxq->gdma_rq,
 				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 0/2] Fix MANA RX with bounce buffering
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma

With swiotlb=force, the MANA NIC fails to work properly due to commit
730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead
of full pages to improve memory efficiency.")

Dipayaan tried to fix this by avoiding page pool frags when bounce
buffering is in use [1][2]. However, that is not a clean solution: no
other NIC drivers need to explicitly check whether bounce buffering is
in use. It is also not good for throughput, since
dma_map_single()/dma_unmap_single() are then called for each incoming
packet.

In fact, page pool frags can still be used with the standard MTU of
1500: all we need is to add page_pool_dma_sync_for_cpu() before the CPU
reads the incoming packet, so I implemented that in v1 [3].

As Simon suggested [4], this version splits v1 into two patches:
Patch 1 adds page_pool_dma_sync_for_cpu().
Patch 2 validates the packet length reported by the NIC.

There is no functional difference between v1 and v2, so I am keeping
Haiyang's Reviewed-by tag in v2.

Please review. Thanks!

Note that, with jumbo MTU and XDP, page pool frags are not used, and
dma_map_single()/dma_unmap_single() are still called for each incoming
packet, causing poor throughput with swiotlb=force; see
mana_get_rxbuf_cfg() and mana_refill_rx_oob() -> mana_get_rxfrag().
The jumbo MTU/XDP issue will be addressed later since that needs more
consideration if we want to use page pool with PP_FLAG_DMA_MAP there:
e.g., for XDP, the received packet can be transmitted in place, i.e. the
same RX buffer can be used as a TX buffer:
mana_rx_skb() -> mana_xdp_tx() -> mana_start_xmit() -> mana_map_skb().

In mana_create_page_pool(), we may have to set pprm.dma_dir to
DMA_BIDIRECTIONAL if XDP is in use:
pprm.dma_dir = mana_xdp_get(mpc) ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;

In the case of XDP, the next issue is that mana_rx_skb() -> ... ->
mana_map_skb() appears to call dma_map_single() on an RX buffer allocated
from a page pool created with PP_FLAG_DMA_MAP, which seems incorrect.
Any thoughts?

[1] https://lore.kernel.org/all/ae91hyrLf4n23XE6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/#r
[2] https://lore.kernel.org/all/ae9pxvJfkAZYfKMf@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
[3] https://lore.kernel.org/all/20260618035029.249361-1-decui@microsoft.com/
[4] https://lore.kernel.org/all/20260619090514.GT827683@horms.kernel.org/

Dexuan Cui (2):
  net: mana: Sync page pool RX frags for CPU
  net: mana: Validate the packet length reported by the NIC

 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
 include/net/mana/mana.h                       |  8 +++
 2 files changed, 58 insertions(+), 11 deletions(-)

-- 
2.34.1


^ permalink raw reply

* Re: [PATCH net] eth: fbnic: fix race between concurrent hwmon sensor reads
From: Alexander Duyck @ 2026-06-24 22:22 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Zinc Lim, Alexander Duyck, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, kernel-team, netdev,
	linux-kernel, zinclim
In-Reply-To: <f2ecb668-ec56-4c84-ac97-474a2d761005@lunn.ch>

On Wed, Jun 24, 2026 at 2:56 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Jun 24, 2026 at 11:51:29PM +0200, Andrew Lunn wrote:
> > On Wed, Jun 24, 2026 at 01:05:14PM -0700, Zinc Lim wrote:
> > > From: Zinc Lim <limzhineng2@gmail.com>
> > >
> > > Reading an hwmon sensor
> >
> > Since this is a hwmon related patch, it is a good idea to Cc: the
> > hwmon Maintainer.
> >
> > I don't know hwmon too well, i've never had to look at locking, but...
> >
> > static int hwmon_thermal_get_temp(struct thermal_zone_device *tz, int *temp)
> > {
> >       struct hwmon_thermal_data *tdata = thermal_zone_device_priv(tz);
> >       struct hwmon_device *hwdev = to_hwmon_device(tdata->dev);
> >       int ret;
> >       long t;
> >
> >       guard(mutex)(&hwdev->lock);
> >
> >       ret = hwdev->chip->ops->read(tdata->dev, hwmon_temp, hwmon_temp_input,
> >                                    tdata->index, &t);
> >
> > Could you explain why this lock in the hwmon core is not sufficient?
> > It could be i'm reading it wrong, but it looks like the lock is shared
> > by all hwmon instanced registered by a
> > hwmon_device_register_with_info() call.
>
> Documentation/hwmon/hwmon-kernel-api.rst
>
> When using ``[devm_]hwmon_device_register_with_info()`` to register the
> hardware monitoring device, accesses using the associated access functions
> are serialised by the hardware monitoring core. If a driver needs locking
> for other functions such as interrupt handlers or for attributes which are
> fully implemented in the driver, hwmon_lock() and hwmon_unlock() can be used
> to ensure that calls to those functions are serialized.

I did some digging and it looks like 3ad2a7b9b15d ("hwmon: Serialize
accesses in hwmon core") landed into 6.18. This was a patch put
together based on issues reported in an earlier kernel version and
explains how this got missed. So what we have is a redundant fix. We
can drop this for upstream and probably look at pulling the upstream
fix into our kernels that were having the issues.

Thanks for the review feedback.

- Alex

^ permalink raw reply

* Re: [PATCH v29 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Dan Williams (nvidia) @ 2026-06-24 22:10 UTC (permalink / raw)
  To: alejandro.lucero-palau, linux-cxl, netdev, dan.j.williams,
	edward.cree, davem, kuba, pabeni, edumazet, dave.jiang
  Cc: Alejandro Lucero, Edward Cree
In-Reply-To: <20260622124010.2192888-5-alejandro.lucero-palau@amd.com>

alejandro.lucero-palau@ wrote:
> From: Alejandro Lucero <alucerop@amd.com>
> 
> Use core API for safely obtain the CXL range linked to an HDM committed
> by the BIOS. Map such a range for being used as the ctpio buffer.
> 
> A potential user space action through sysfs unbinding or core cxl
> modules remove will trigger sfc driver device detachment, with that case
> not racing with this mapping as this is done during driver probe and
> therefore protected with device lock against those user space actions.
> 
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
> ---
>  drivers/net/ethernet/sfc/efx.c     |  2 ++
>  drivers/net/ethernet/sfc/efx_cxl.c | 23 +++++++++++++++++++++++
>  drivers/net/ethernet/sfc/efx_cxl.h |  3 +++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index 61cbb6cfc360..3806cd3dd7f4 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -984,6 +984,7 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
>  	efx_fini_io(efx);
>  
>  	probe_data = container_of(efx, struct efx_probe_data, efx);
> +	efx_cxl_exit(probe_data);
>  
>  	pci_dbg(efx->pci_dev, "shutdown successful\n");
>  
> @@ -1242,6 +1243,7 @@ static int efx_pci_probe(struct pci_dev *pci_dev,
>  	return 0;
>  
>   fail3:
> +	efx_cxl_exit(probe_data);
>  	efx_fini_io(efx);
>   fail2:
>  	efx_fini_struct(efx);
> diff --git a/drivers/net/ethernet/sfc/efx_cxl.c b/drivers/net/ethernet/sfc/efx_cxl.c
> index 18b535b3ea40..3e7c950f83e9 100644
> --- a/drivers/net/ethernet/sfc/efx_cxl.c
> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
> @@ -18,6 +18,7 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>  {
>  	struct efx_nic *efx = &probe_data->efx;
>  	struct pci_dev *pci_dev = efx->pci_dev;
> +	struct range cxl_pio_range;
>  	struct efx_cxl *cxl;
>  	u16 dvsec;
>  	int rc;
> @@ -73,9 +74,31 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>  		return -ENODEV;
>  	}
>  
> +	cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
> +	if (IS_ERR(cxl->cxlmd)) {
> +		pci_err(pci_dev, "CXL accel memdev creation failed\n");
> +		return PTR_ERR(cxl->cxlmd);
> +	}
> +
> +	cxl->ctpio_cxl = ioremap_wc(cxl_pio_range.start,
> +				    range_len(&cxl_pio_range));
> +	if (!cxl->ctpio_cxl) {
> +		pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
> +			&cxl_pio_range);
> +		return -ENOMEM;
> +	}
> +
>  	probe_data->cxl = cxl;
>  
>  	return 0;
>  }
>  
> +void efx_cxl_exit(struct efx_probe_data *probe_data)
> +{

If you are going to have an explicit efx_cxl_exit() then I would also
add an explicit unregistration of the memdev. This would also fix the
Sashiko report about pci_disable_device() running while the cxl_memdev
is still registered. Unfortunately, mixing devm and explicit unwind is
always fraught.

Let me know if this passes your testing, and I can send it out as a
standalone patch. You could also use it to unwind if the ioremap()
fails.

-- 8< --
From 47940f7d6de6dd1039d0e445eb944ca96ad5eb3a Mon Sep 17 00:00:00 2001
From: Dan Williams <djbw@kernel.org>
Date: Wed, 24 Jun 2026 11:47:24 -0700
Subject: [PATCH] cxl: Add devm_cxl_remove_mem()

Undo devm_cxl_probe_mem() without causing the host driver to unload.

The expectation of devm_cxl_probe_mem() is that upon attach the caller
depends on the CXL topology not being torn down. If it is then it is
escalated to a 'remove' event on the caller. However, if the caller wants
to cancel their interest in CXL operation, they should be able to do so
without that side effect.

This is useful for setup scenarios where CXL successfully attaches, but
some other event precludes CXL operation.

Cc: Alejandro Lucero <alucerop@amd.com>
Signed-off-by: Dan Williams <djbw@kernel.org>
---
 include/cxl/cxl.h         |  1 +
 drivers/cxl/core/memdev.c | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 016c74fb747c..cff2922b0d4a 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -226,4 +226,5 @@ struct cxl_dev_state *_devm_cxl_dev_state_create(struct device *dev,
 
 struct cxl_memdev *devm_cxl_probe_mem(struct cxl_dev_state *cxlds,
 				      struct range *range);
+void devm_cxl_remove_mem(struct cxl_memdev *cxlmd);
 #endif /* __CXL_CXL_H__ */
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 33a3d2e7b13a..aa35876f0067 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -1185,6 +1185,31 @@ struct cxl_memdev *__devm_cxl_add_memdev(struct cxl_dev_state *cxlds,
 }
 EXPORT_SYMBOL_FOR_MODULES(__devm_cxl_add_memdev, "cxl_mem");
 
+static struct cxl_attach_region *cancel_probe_mem_teardown(struct cxl_memdev *cxlmd)
+{
+	struct cxl_attach_region *attach;
+
+	guard(device)(&cxlmd->dev);
+	attach = container_of(cxlmd->attach, typeof(*attach), attach);
+	cxlmd->attach = NULL;
+	return attach;
+}
+
+void devm_cxl_remove_mem(struct cxl_memdev *cxlmd)
+{
+	struct cxl_attach_region *attach;
+	struct cxl_dev_state *cxlds;
+
+	if (IS_ERR(cxlmd))
+		return;
+
+	attach = cancel_probe_mem_teardown(cxlmd);
+	cxlds = cxlmd->cxlds;
+	devm_release_action(cxlds->dev, cxl_memdev_unregister, cxlmd);
+	devm_kfree(cxlds->dev, attach);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_remove_mem, "CXL");
+
 static void sanitize_teardown_notifier(void *data)
 {
 	struct cxl_memdev_state *mds = data;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v4] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 22:10 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bX-iDvcUOPPY+NLz95tkRJYwWqvzAr=U48uNaub_HZLGw@mail.gmail.com>

On Wed, Jun 24, 2026 at 03:02:02PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 
> Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
> vhost_vdpa_pa_unmap()")

still 2 lines?

> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> Changes in v4:
> - Keep the Fixes tag on one line.
> - Add Michael's Acked-by tag.
> 
> Changes in v3:
> - Add the Fixes tag.
> 
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH v12 07/12] static_call: Define EXPORT_STATIC_CALL_FOR_MODULES()
From: Sean Christopherson @ 2026-06-24 22:03 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, KP Singh,
	Jiri Olsa, David S. Miller, David Laight, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, David Ahern, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	Stanislav Fomichev, Hao Luo, Paolo Bonzini, Jonathan Corbet,
	Jason Baron, Alice Ryhl, Steven Rostedt, Ard Biesheuvel,
	Shuah Khan, linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf,
	netdev, linux-doc
In-Reply-To: <20260624214955.6kkivefeuapcocib@desk>

On Wed, Jun 24, 2026, Pawan Gupta wrote:
> On Wed, Jun 24, 2026 at 05:59:19AM -0700, Sean Christopherson wrote:
> > On Tue, Jun 23, 2026, Pawan Gupta wrote:
> > > There is EXPORT_STATIC_CALL_TRAMP() that hides the static key from all
> > > modules. But there is no equivalent of EXPORT_SYMBOL_FOR_MODULES() to
> > > restrict symbol visibility to only certain modules.
> > > 
> > > Add EXPORT_STATIC_CALL_FOR_MODULES(name, mods) that wraps both the key and
> > > the trampoline with EXPORT_SYMBOL_FOR_MODULES(), allowing only a limited
> > > set of modules to see and update the static key.
> > > 
> > > The immediate user is KVM, in the following commit.
> > > 
> > > checkpatch reported below warnings with this change that I believe don't
> > > apply in this case:
> > > 
> > >   include/linux/static_call.h:219: WARNING: Non-declarative macros with multiple statements should be enclosed in a do - while loop
> > >   include/linux/static_call.h:220: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> > > 
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > > ---

...

> > Drat, I forgot about this.  Exporting static call trampolines for KVM came up in
> > another conversation[*].  I had already put together patches to effectively default
> > to exporting only the trampoline, and also to deduplicate this code so that the
> > CONFIG_HAVE_STATIC_CALL_INLINE=y / CONFIG_HAVE_STATIC_CALL=y / CONFIG_HAVE_STATIC_CALL=n
> > implementations don't need to copy+paste the same lines of code.
> > 
> > The attached patches touch a lot more code, and will conflict mightily with KVM
> > changes I want to land in 7.3 (more use of a static_call in KVM).  But if we get
> > them applied (to tip tree) shortly after 7.2-rc1 and provide a topic branch/tag,
> > then there shouldn't be too much juggling needed?
> > 
> > If we want to go with the more aggressive cleanup, I'll formally post the patches.
> > 
> > [*] https://lore.kernel.org/all/ahhoDGUz39KSGZ6o@google.com
> 
> Thanks for the context.
> 
> Earlier making the key ro-after-init came up as an option in a thread with
> Peter. Does it look like a good option to you?

No, it won't work for KVM.  kvm.ko (owner of the keys) updates the keys only when
a vendor module (kvm-intel.ko or kvm-amd.ko) is loaded, and updates keys *every*
time a vendor module is loaded.  So for KVM, the static calls need to be __read_mostly,
not __ro_after_init.

> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> index b610afd1ed55..ea56da8fb446 100644
> --- a/include/linux/static_call.h
> +++ b/include/linux/static_call.h
> @@ -200,6 +200,14 @@ extern long __static_call_return0(void);
>  	};								\
>  	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
>  
> +#define DEFINE_STATIC_CALL_NULL_RO_AFTER_INIT(name, _func)		\
> +	DECLARE_STATIC_CALL(name, _func);				\
> +	struct static_call_key STATIC_CALL_KEY(name) __ro_after_init = {\
> +		.func = _func,						\
> +		.type = 1,						\
> +	};								\
> +	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
> +
>  #define DEFINE_STATIC_CALL_RET0(name, _func)				\
>  	DECLARE_STATIC_CALL(name, _func);				\
>  	struct static_call_key STATIC_CALL_KEY(name) = {		\

^ permalink raw reply

* [PATCH v4] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 22:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
vhost_vdpa_pa_unmap()")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v4:
- Keep the Fixes tag on one line.
- Add Michael's Acked-by tag.

Changes in v3:
- Add the Fixes tag.

Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 2/2] selftests/bpf: Add tests for bpf_redirect_peer with BPF_F_EGRESS
From: Paul Chaignon @ 2026-06-24 21:59 UTC (permalink / raw)
  To: Jordan Rife
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Stanislav Fomichev, Jiayuan Chen
In-Reply-To: <20260618182035.43811-3-jordan@jrife.io>

On Thu, Jun 18, 2026 at 11:20:33AM -0700, Jordan Rife wrote:
> Extend redirect tests to cover bpf_redirect_peer(BPF_F_EGRESS). SRC
> redirects to DST using bpf_redirect_peer(BPF_F_EGRESS) then traffic is
> hairpinned into DST using bpf_redirect.
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>

Acked-by: Paul Chaignon <paul.chaignon@gmail.com>


^ permalink raw reply

* Re: [PATCH v3] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 21:59 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bV8OeSTOVnAPRh6ygKdogFjqEiDNj1Vbh623KBBkZgxiw@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:56:20PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 
> Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
> vhost_vdpa_pa_unmap()")

weirdly wrapped. will likely break some tools.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> Changes in v3:
> - Add the Fixes tag.
> 
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/2] bpf: Support BPF_F_EGRESS with bpf_redirect_peer
From: Paul Chaignon @ 2026-06-24 21:58 UTC (permalink / raw)
  To: Jordan Rife
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Stanislav Fomichev, Jiayuan Chen
In-Reply-To: <20260618182035.43811-2-jordan@jrife.io>

On Thu, Jun 18, 2026 at 11:20:32AM -0700, Jordan Rife wrote:
> We have several use cases where a pod injects traffic into the datapath
> of another so that the traffic appears to have originated from that
> pod. One such use case is a synthetic flow generator which injects
> synthetic traffic into a pod's datapath to enable dynamic probing and
> debugging. Another is a transparent proxy where connections originating
> from one pod are redirected towards another which proxies that
> connection. The new connection is bound to the IP of the original pod
> using IP_TRANSPARENT and its traffic is injected into that pod's
> datapath and handled as if it had originated there. This can be used for
> mTLS, etc.
> 
> We use bpf_redirect(BPF_F_INGRESS) to direct traffic leaving the proxy,
> flow generator, etc. towards the target pod, ensuring that eBPF programs
> that are meant to intercept traffic leaving that pod are executed.
> However, this doesn't work with netkit.
> 
> With netkit, an ingress redirection from proxy to workload skips eBPF
> programs that are meant to intercept traffic leaving the pod, since they
> reside on the netkit peer device. One workaround is to attach the
> same program to both the netkit peer device and the TCX ingress hook for
> the netkit pair's primary interface, but
> 
> a) This seems hacky and we need to be careful not to run the same
>    program twice for the same skb in cases where we want to pass that
>    traffic to the host stack.
> b) We're trying to keep the proxy redirection / traffic injection
>    systems as modular and separated from Cilium as possible, the system
>    that manages netkit setup and core eBPF programming.
> 
> It would be handy if instead we could redirect traffic directly from
> one netkit peer device to another. This patch proposes an extension
> to bpf_redirect_peer to allow us to do just that.
> 
> With this patch, the BPF_F_EGRESS flag tells bpf_redirect_peer to emit
> the skb in the egress direction of the target interface's peer device
> While the main use case is netkit, I suppose you could also use this
> mode with veth as well if, e.g., there were some eBPF programs attached
> to that side of the veth pair that needed to intercept traffic.
> 
>  +---------------------------------------------------------------------+
>  | +-------------------------+         6. bpf_redirect_neigh(eth0)     |
>  | | pod (10.244.0.10)       |           ------------------------      |
>  | |                         |          |                        |     |
>  | |              +--------+ |          |      +---------+       |     |
>  | | 1. packet -->|        | |          |      |         |       |     |
>  | |    leaves ^  | netkit |<===========|======| netkit  |       |     |
>  | |           |  | peer   |=======(eBPF)=====>| primary |       |     |
>  | |           |  |        | |          |      |         |       |     |
>  | |           |  +--------+ |          |      +---------+       |     |
>  | |           |             |          | 2. bpf_redirect        v     |
>  | +-----------|-------------+          |___________________   +-------|
>  |             |                                            |  | eth0  |
>  |             | 5. bpf_redirect_peer(BPF_F_EGRESS)         |  +-------|
>  |             |________________________                    |          |
>  | +-------------------------+          |                   |          |
>  | | proxy (10.244.0.11)     |          |                   |          |
>  | | IP_TRANSPARENT          |          |                   |          |
>  | |              +--------+ |          |      +---------+  |          |
>  | | 3. packet <--|        | |          |      |         |<--          |
>  | |    enters    | netkit |<===========|======| netkit  |             |
>  | |    [proxy]   | peer   |=======(eBPF)=====>| primary |             |
>  | | 4. packet -->|        | |                 |         |             |
>  | |    leaves    +--------+ |                 +---------+             |
>  | |    sip=10.244.0.10      |                                         |
>  | +-------------------------+                                         |
>  +---------------------------------------------------------------------+
> 
> Using the proxy use case as an example, in step 5 we would redirect
> traffic leaving the proxy towards the pod's peer device using
> bpf_redirect_peer(BPF_F_EGRESS).
> 
> As a bonus, since the skb doesn't have to go through the backlog queue
> it can take full advantage of netkit's performance benefits. I set up a
> test where outgoing iperf3 traffic is injected into the datapath of
> another pod using either bpf_redirect_peer(BPF_F_EGRESS) or
> bpf_redirect(BPF_F_INGRESS). I used Cilium's eBPF host routing mode
> which skips the host stack and uses BPF redirect helpers to do all the
> routing.
> 
>   (net.ipv4.tcp_congestion_control=cubic,mtu=1500,100GiB link,Cilium
>    eBPF host routing mode)
> 
> BASELINE [bpf_redirect(BPF_F_INGRESS)]
>   1. [iperf pod] ==bpf_redirect([pod b], BPF_F_INGRESS)==> [pod b]
>   2. [pod b]     ==bpf_redirect_neigh([eth0])==>           eth0
>   3. eth0        ==over network==>                         [host b]
> 
>   [ ID] Interval           Transfer     Bitrate         Retr
>   [  5]   0.00-60.00  sec   231 GBytes  33.0 Gbits/sec  12060     sender
>   [  5]   0.00-60.00  sec   230 GBytes  33.0 Gbits/sec            receiver
> 
> TEST [bpf_redirect_peer(BPF_F_EGRESS)]
>   1. [iperf pod] ==bpf_redirect_peer([pod b], BPF_F_EGRESS)==> [pod b]
>   2. [pod b]     ==bpf_redirect_neigh([eth0])==>               eth0
>   3. eth0        ==over network==>                             [host b]
> 
>   [ ID] Interval           Transfer     Bitrate         Retr
>   [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec    0       sender
>   [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec            receiver
> 
> In this test, using bpf_redirect_peer(BPF_F_EGRESS) for the hop from
> [iperf pod] to [pod b] led to ~18% more throughput compared to
> bpf_redirect(BPF_F_INGRESS).
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>

Acked-by: Paul Chaignon <paul.chaignon@gmail.com>


^ permalink raw reply

* Re: [PATCH net] net: udp_tunnel: fix use-after-free by refcounting udp_tunnel_nic
From: Jakub Kicinski @ 2026-06-24 21:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Simon Horman, Ido Schimmel,
	David Ahern, netdev, eric.dumazet, Yue Sun
In-Reply-To: <20260624171034.4117423-1-edumazet@google.com>

On Wed, 24 Jun 2026 17:10:34 +0000 Eric Dumazet wrote:
> Yue Sun reported a use-after-free and debugobjects warning in
> udp_tunnel_nic_device_sync_work() during concurrent device operations.
> 
> The state flags of struct udp_tunnel_nic were originally bitfields
> sharing a byte, modified concurrently without locking (RCU vs worker).

Can you clarify the path where the bits are modified without locks??
My mental model is that this is basically all under rtnl_lock, and 
Stan added _another_ lock so that drivers can call "sync" / reply
without needing rtnl lock, but any changes are still under rtnl_lock.

The gap seems to be that we don't check pending under Stan's new lock,
since commit 1ead7501094c6 ("udp_tunnel: remove rtnl_lock dependency")
did:

+++ b/drivers/net/netdevsim/udp_tunnels.c
@@ -112,12 +112,10 @@ nsim_udp_tunnels_info_reset_write(struct file *file, const char __user *data,
        struct net_device *dev = file->private_data;
        struct netdevsim *ns = netdev_priv(dev);
 
-       rtnl_lock();
        if (dev->reg_state == NETREG_REGISTERED) {
                memset(ns->udp_ports.ports, 0, sizeof(ns->udp_ports.__ports));
                udp_tunnel_nic_reset_ntf(dev);
        }
-       rtnl_unlock();


so we just need:

diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
index 9944ed923ddf..d7db89a222f8 100644
--- a/net/ipv4/udp_tunnel_nic.c
+++ b/net/ipv4/udp_tunnel_nic.c
@@ -863,6 +863,7 @@ static void
 udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
        const struct udp_tunnel_nic_info *info = dev->udp_tunnel_nic_info;
+       bool pending;
 
        udp_tunnel_nic_lock(dev);
 
@@ -899,12 +900,14 @@ udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
         * from the work which we will boot immediately.
         */
        udp_tunnel_nic_flush(dev, utn);
+
+       pending = utn->work_pending;
        udp_tunnel_nic_unlock(dev);
 
        /* Wait for the work to be done using the state, netdev core will
         * retry unregister until we give up our reference on this device.
         */
-       if (utn->work_pending)
+       if (pending)
                return;
 
        udp_tunnel_nic_free(utn);



> Even after converting to atomic bitops, a single WORK_PENDING flag
> races: the workqueue core clears the pending bit before running the
> worker. A concurrent queueing sets the flag, but the running worker
> clears it, leading to premature freeing in unregister() while the
> re-queued work is still active.

> +	if (utn->dev)
> +		dev_put(utn->dev);

nit: cocci complains that null check is not needed here.

^ permalink raw reply related

* [PATCH v3] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 21:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
vhost_vdpa_pa_unmap()")
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v3:
- Add the Fixes tag.

Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH net] eth: fbnic: fix race between concurrent hwmon sensor reads
From: Andrew Lunn @ 2026-06-24 21:56 UTC (permalink / raw)
  To: Zinc Lim
  Cc: Alexander Duyck, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, alexander.duyck, kernel-team, netdev,
	linux-kernel, zinclim
In-Reply-To: <0673af7d-51a6-484c-be38-3f46849ebb30@lunn.ch>

On Wed, Jun 24, 2026 at 11:51:29PM +0200, Andrew Lunn wrote:
> On Wed, Jun 24, 2026 at 01:05:14PM -0700, Zinc Lim wrote:
> > From: Zinc Lim <limzhineng2@gmail.com>
> > 
> > Reading an hwmon sensor
> 
> Since this is a hwmon related patch, it is a good idea to Cc: the
> hwmon Maintainer.
> 
> I don't know hwmon too well, i've never had to look at locking, but...
> 
> static int hwmon_thermal_get_temp(struct thermal_zone_device *tz, int *temp)
> {
> 	struct hwmon_thermal_data *tdata = thermal_zone_device_priv(tz);
> 	struct hwmon_device *hwdev = to_hwmon_device(tdata->dev);
> 	int ret;
> 	long t;
> 
> 	guard(mutex)(&hwdev->lock);
> 
> 	ret = hwdev->chip->ops->read(tdata->dev, hwmon_temp, hwmon_temp_input,
> 				     tdata->index, &t);
> 
> Could you explain why this lock in the hwmon core is not sufficient?
> It could be i'm reading it wrong, but it looks like the lock is shared
> by all hwmon instanced registered by a
> hwmon_device_register_with_info() call.

Documentation/hwmon/hwmon-kernel-api.rst 

When using ``[devm_]hwmon_device_register_with_info()`` to register the
hardware monitoring device, accesses using the associated access functions
are serialised by the hardware monitoring core. If a driver needs locking
for other functions such as interrupt handlers or for attributes which are
fully implemented in the driver, hwmon_lock() and hwmon_unlock() can be used
to ensure that calls to those functions are serialized.

	Andrew

^ permalink raw reply

* Re: [PATCH v2] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 21:54 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bURwyoJtKUskzXpbUtrxXT1vBZZxpKpnnJY6qWsLtTBMA@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:51:53PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 


And a Fixes: tag, please.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* [PATCH v2] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 21:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH net] eth: fbnic: fix race between concurrent hwmon sensor reads
From: Andrew Lunn @ 2026-06-24 21:51 UTC (permalink / raw)
  To: Zinc Lim
  Cc: Alexander Duyck, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, alexander.duyck, kernel-team, netdev,
	linux-kernel, zinclim
In-Reply-To: <20260624200514.1332788-1-zinclim@meta.com>

On Wed, Jun 24, 2026 at 01:05:14PM -0700, Zinc Lim wrote:
> From: Zinc Lim <limzhineng2@gmail.com>
> 
> Reading an hwmon sensor

Since this is a hwmon related patch, it is a good idea to Cc: the
hwmon Maintainer.

I don't know hwmon too well, i've never had to look at locking, but...

static int hwmon_thermal_get_temp(struct thermal_zone_device *tz, int *temp)
{
	struct hwmon_thermal_data *tdata = thermal_zone_device_priv(tz);
	struct hwmon_device *hwdev = to_hwmon_device(tdata->dev);
	int ret;
	long t;

	guard(mutex)(&hwdev->lock);

	ret = hwdev->chip->ops->read(tdata->dev, hwmon_temp, hwmon_temp_input,
				     tdata->index, &t);

Could you explain why this lock in the hwmon core is not sufficient?
It could be i'm reading it wrong, but it looks like the lock is shared
by all hwmon instanced registered by a
hwmon_device_register_with_info() call.

	Andrew

^ permalink raw reply

* Re: [PATCH v12 07/12] static_call: Define EXPORT_STATIC_CALL_FOR_MODULES()
From: Pawan Gupta @ 2026-06-24 21:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
	David Kaplan, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, KP Singh,
	Jiri Olsa, David S. Miller, David Laight, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, David Ahern, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	Stanislav Fomichev, Hao Luo, Paolo Bonzini, Jonathan Corbet,
	Jason Baron, Alice Ryhl, Steven Rostedt, Ard Biesheuvel,
	Shuah Khan, linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf,
	netdev, linux-doc
In-Reply-To: <ajvUp_kPJBRZ7k_p@google.com>

On Wed, Jun 24, 2026 at 05:59:19AM -0700, Sean Christopherson wrote:
> On Tue, Jun 23, 2026, Pawan Gupta wrote:
> > There is EXPORT_STATIC_CALL_TRAMP() that hides the static key from all
> > modules. But there is no equivalent of EXPORT_SYMBOL_FOR_MODULES() to
> > restrict symbol visibility to only certain modules.
> > 
> > Add EXPORT_STATIC_CALL_FOR_MODULES(name, mods) that wraps both the key and
> > the trampoline with EXPORT_SYMBOL_FOR_MODULES(), allowing only a limited
> > set of modules to see and update the static key.
> > 
> > The immediate user is KVM, in the following commit.
> > 
> > checkpatch reported below warnings with this change that I believe don't
> > apply in this case:
> > 
> >   include/linux/static_call.h:219: WARNING: Non-declarative macros with multiple statements should be enclosed in a do - while loop
> >   include/linux/static_call.h:220: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
> > 
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> > ---
> >  include/linux/static_call.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> > index 78a77a4ae0ea..b610afd1ed55 100644
> > --- a/include/linux/static_call.h
> > +++ b/include/linux/static_call.h
> > @@ -216,6 +216,9 @@ extern long __static_call_return0(void);
> >  #define EXPORT_STATIC_CALL_GPL(name)					\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name));			\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
> > +#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods);		\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_TRAMP(name), mods)
> >  
> >  /* Leave the key unexported, so modules can't change static call targets: */
> >  #define EXPORT_STATIC_CALL_TRAMP(name)					\
> > @@ -276,6 +279,9 @@ extern long __static_call_return0(void);
> >  #define EXPORT_STATIC_CALL_GPL(name)					\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name));			\
> >  	EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(name))
> > +#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods);		\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_TRAMP(name), mods)
> >  
> >  /* Leave the key unexported, so modules can't change static call targets: */
> >  #define EXPORT_STATIC_CALL_TRAMP(name)					\
> > @@ -346,6 +352,8 @@ static inline int static_call_text_reserved(void *start, void *end)
> >  
> >  #define EXPORT_STATIC_CALL(name)	EXPORT_SYMBOL(STATIC_CALL_KEY(name))
> >  #define EXPORT_STATIC_CALL_GPL(name)	EXPORT_SYMBOL_GPL(STATIC_CALL_KEY(name))
> > +#define EXPORT_STATIC_CALL_FOR_MODULES(name, mods)			\
> > +	EXPORT_SYMBOL_FOR_MODULES(STATIC_CALL_KEY(name), mods)
> >  
> >  #endif /* CONFIG_HAVE_STATIC_CALL */
> 
> Drat, I forgot about this.  Exporting static call trampolines for KVM came up in
> another conversation[*].  I had already put together patches to effectively default
> to exporting only the trampoline, and also to deduplicate this code so that the
> CONFIG_HAVE_STATIC_CALL_INLINE=y / CONFIG_HAVE_STATIC_CALL=y / CONFIG_HAVE_STATIC_CALL=n
> implementations don't need to copy+paste the same lines of code.
> 
> The attached patches touch a lot more code, and will conflict mightily with KVM
> changes I want to land in 7.3 (more use of a static_call in KVM).  But if we get
> them applied (to tip tree) shortly after 7.2-rc1 and provide a topic branch/tag,
> then there shouldn't be too much juggling needed?
> 
> If we want to go with the more aggressive cleanup, I'll formally post the patches.
> 
> [*] https://lore.kernel.org/all/ahhoDGUz39KSGZ6o@google.com

Thanks for the context.

Earlier making the key ro-after-init came up as an option in a thread with
Peter. Does it look like a good option to you?

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index b610afd1ed55..ea56da8fb446 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -200,6 +200,14 @@ extern long __static_call_return0(void);
 	};								\
 	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
 
+#define DEFINE_STATIC_CALL_NULL_RO_AFTER_INIT(name, _func)		\
+	DECLARE_STATIC_CALL(name, _func);				\
+	struct static_call_key STATIC_CALL_KEY(name) __ro_after_init = {\
+		.func = _func,						\
+		.type = 1,						\
+	};								\
+	ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+
 #define DEFINE_STATIC_CALL_RET0(name, _func)				\
 	DECLARE_STATIC_CALL(name, _func);				\
 	struct static_call_key STATIC_CALL_KEY(name) = {		\

^ permalink raw reply related

* Re: [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Yousef Alhouseen @ 2026-06-24 21:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <20260624153850-mutt-send-email-mst@kernel.org>

On Wed, Jun 24, 2026 at 01:53:38PM -0400, Michael S. Tsirkin wrote:
> You should add "on 32 bit systems" - I do not see how it can
> overflow on 64 bit.

Right, the overflow I was trying to cover is the unsigned long
page-count calculation on 32-bit systems, where size can be wider than
unsigned long and the page offset is added before PFN_UP(). I should
have made that scope explicit in the changelog.

> I don't see how this can happen at all - pinned_vm is in units of pages.

Agreed, that part is not needed for this fix. I'll drop the memlock
check change and send a v2 with the changelog clarified to say this is
for 32-bit systems.

Thanks,
Yousef


On Wed, 24 Jun 2026 15:53:38 -0400, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Jun 24, 2026 at 09:06:53PM +0200, Yousef Alhouseen wrote:
> > vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> > size before computing the number of pages to pin. If that addition wraps,
> > the code can pin and map fewer pages than the requested IOTLB range.
> >
> > Reject sizes that overflow the page-count calculation.
>
> You should add "on 32 bit systems" - I do not see how it can
> overflow on 64 bit.
>
> > Also make the
> > memlock check subtraction-based so a large page count cannot wrap the
> > pinned page total.
>
> I don't see how this can happen at all - pinned_vm is in units of pages.
>
> > Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> > ---
> > drivers/vhost/vdpa.c | 12 ++++++++++--
> > 1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index ac55275fa..090cb8693 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > unsigned int gup_flags = FOLL_LONGTERM;
> > unsigned long npages, cur_base, map_pfn, last_pfn = 0;
> > unsigned long lock_limit, sz2pin, nchunks, i;
> > + unsigned long page_offset;
> > + u64 pinned_vm;
> > u64 start = iova;
> > long pinned;
> > int ret = 0;
> > @@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > if (perm & VHOST_ACCESS_WO)
> > gup_flags |= FOLL_WRITE;
> >
> > - npages = PFN_UP(size + (iova & ~PAGE_MASK));
> > + page_offset = iova & ~PAGE_MASK;
> > + if (size > ULONG_MAX - page_offset) {
> > + ret = -EINVAL;
> > + goto free;
> > + }
> > + npages = PFN_UP(size + page_offset);
> > if (!npages) {
> > ret = -EINVAL;
> > goto free;
> > @@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > mmap_read_lock(dev->mm);
> >
> > lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > - if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
> > + pinned_vm = atomic64_read(&dev->mm->pinned_vm);
> > + if (npages > lock_limit || pinned_vm > lock_limit - npages) {
> > ret = -ENOMEM;
> > goto unlock;
> > }
> > --
> > 2.54.0

^ permalink raw reply

* Re: [PATCH net v4 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Aleksander Jan Bajkowski @ 2026-06-24 21:44 UTC (permalink / raw)
  To: Petr Wozniak, Russell King, Andrew Lunn, Heiner Kallweit
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	netdev, linux-kernel, linux-phy, Maxime Chevallier, Bjorn Mork,
	Marek Behun
In-Reply-To: <20260624084814.20972-3-petr.wozniak@gmail.com>

Hi Petr,

W dniu 24.06.2026 o 10:48, Petr Wozniak pisze:
> commit 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO
> bridge in mdio-i2c") introduced a regression: the RollBall I2C-to-MDIO
> bridge is not yet ready to respond to CMD_READ/CMD_DONE cycles when
> sfp_sm_add_mdio_bus() runs in SFP_S_INIT.  The 200 ms probe times out,
> i2c_mii_probe_rollball() returns -ENODEV, and sfp_sm_add_mdio_bus()
> sets mdio_protocol = MDIO_I2C_NONE.  By the time sfp_sm_probe_for_phy()
> runs (up to ~17 s later on affected hardware), the bridge is fully
> initialized but PHY probing is skipped because the protocol has already
> been changed to NONE.
>
> This affects both modules inserted before boot and hotplugged modules on
> hardware where bridge initialization exceeds the 200 ms probe window
> (confirmed: FLYPRO SFP-10GT-CS-30M with Aquantia AQR113C, hotplugged).
>
> Move the probe from i2c_mii_init_rollball(), called at bus-creation time,
> to sfp_sm_probe_for_phy() in sfp.c, where it runs after the SFP state
> machine module initialization delays.  Export the probe function as
> mdio_i2c_probe_rollball() so sfp.c can call it.
>
> For RTL8261BE-based modules the probe correctly returns -ENODEV at PHY
> discovery time, causing sfp_sm_probe_for_phy() to destroy the MDIO bus
> and set MDIO_I2C_NONE, eliminating the 5+ minute PHY probe retry loop.
>
> For genuine RollBall modules (e.g. FLYPRO SFP-10GT-CS-30M with Aquantia
> AQR113C) the probe now runs after initialization is complete and
> correctly returns 0, so PHY detection proceeds normally.
The FLPRO SFP module still fails to detect the PHY. It is necessary to
increase `module_t_wait` to 20 seconds. Most likely, during this time
the module loads the PHY firmware from SPI memory or from the
microcontroller (rollball bridge) via MDIO. Same probably applies to
most SFP modules with a PHY that load firmware at start-up (AQR113,
RTL8261C etc.).
>
> Reported-by: Aleksander Bajkowski <olek2@wp.pl>
> Fixes: 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO bridge in mdio-i2c")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>
> ---
>   drivers/net/mdio/mdio-i2c.c   | 15 +++++++++------
>   drivers/net/phy/sfp.c         | 22 ++++++++++++++--------
>   include/linux/mdio/mdio-i2c.h |  1 +
>   3 files changed, 24 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/mdio/mdio-i2c.c b/drivers/net/mdio/mdio-i2c.c
> index b88f63234b4e..2a3a418c1369 100644
> --- a/drivers/net/mdio/mdio-i2c.c
> +++ b/drivers/net/mdio/mdio-i2c.c
> @@ -419,7 +419,7 @@ static int i2c_mii_write_rollball(struct mii_bus *bus, int phy_id, int devad,
>   	return 0;
>   }
>   
> -static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c)
>   {
>   	u8 data_buf[] = { ROLLBALL_DATA_ADDR, 0x01, 0x00, 0x00 };
>   	u8 cmd_buf[]  = { ROLLBALL_CMD_ADDR, ROLLBALL_CMD_READ };
> @@ -462,9 +462,13 @@ static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
>   
>   	return -ENODEV;
>   }
> +EXPORT_SYMBOL_GPL(mdio_i2c_probe_rollball);
>   
>   static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
>   {
> +	/* Send the RollBall unlock password; bridge presence is verified
> +	 * later, in sfp_sm_probe_for_phy(), after module initialization.
> +	 */
>   	struct i2c_msg msg;
>   	u8 pw[5];
>   	int ret;
> @@ -486,7 +490,7 @@ static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
>   	if (ret != 1)
>   		return -EIO;
>   
> -	return i2c_mii_probe_rollball(i2c);
> +	return 0;
>   }
>   
>   static bool mdio_i2c_check_functionality(struct i2c_adapter *i2c,
> @@ -531,10 +535,9 @@ struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
>   	case MDIO_I2C_ROLLBALL:
>   		ret = i2c_mii_init_rollball(i2c);
>   		if (ret < 0) {
> -			if (ret != -ENODEV)
> -				dev_err(parent,
> -					"Cannot initialize RollBall MDIO I2C protocol: %d\n",
> -					ret);
> +			dev_err(parent,
> +				"Cannot initialize RollBall MDIO I2C protocol: %d\n",
> +				ret);
>   			mdiobus_free(mii);
>   			return ERR_PTR(ret);
>   		}
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index c4d274ab651e..01b941a38eed 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -2174,17 +2174,10 @@ static void sfp_sm_fault(struct sfp *sfp, unsigned int next_state, bool warn)
>   
>   static int sfp_sm_add_mdio_bus(struct sfp *sfp)
>   {
> -	int ret;
> -
>   	if (sfp->mdio_protocol == MDIO_I2C_NONE)
>   		return 0;
>   
> -	ret = sfp_i2c_mdiobus_create(sfp);
> -	if (ret == -ENODEV) {
> -		sfp->mdio_protocol = MDIO_I2C_NONE;
> -		return 0;
> -	}
> -	return ret;
> +	return sfp_i2c_mdiobus_create(sfp);
>   }
>   
>   /* Probe a SFP for a PHY device if the module supports copper - the PHY
> @@ -2215,6 +2208,19 @@ static int sfp_sm_probe_for_phy(struct sfp *sfp)
>   		break;
>   
>   	case MDIO_I2C_ROLLBALL:
> +		/* Probe here, after module initialization delays, so that
> +		 * genuine RollBall bridges have had time to start up.
> +		 * Modules without a bridge (e.g. RTL8261BE) return -ENODEV.
> +		 */
> +		err = mdio_i2c_probe_rollball(sfp->i2c);
> +		if (err == -ENODEV) {
> +			sfp_i2c_mdiobus_destroy(sfp);
> +			sfp->mdio_protocol = MDIO_I2C_NONE;
> +			err = 0;
> +			break;
> +		}
> +		if (err)
> +			break;
>   		err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
>   		break;
>   	}
> diff --git a/include/linux/mdio/mdio-i2c.h b/include/linux/mdio/mdio-i2c.h
> index 65b550a6fc32..5cf14f45c94b 100644
> --- a/include/linux/mdio/mdio-i2c.h
> +++ b/include/linux/mdio/mdio-i2c.h
> @@ -20,5 +20,6 @@ enum mdio_i2c_proto {
>   
>   struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
>   			       enum mdio_i2c_proto protocol);
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c);
>   
>   #endif

Best regards,
Aleksander


^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Kuniyuki Iwashima @ 2026-06-24 21:39 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Willem de Bruijn, Jakub Sitnicki, John Fastabend, Jiayuan Chen,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <CAAVpQUBHMrSiWqn0Zo_D7sUCLFdkKggduoijRdFtHx5CPiSpsg@mail.gmail.com>

On Wed, Jun 24, 2026 at 2:33 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Wed, Jun 24, 2026 at 2:26 PM Michal Luczaj <mhal@rbox.co> wrote:
> >
> > On 6/24/26 22:01, Willem de Bruijn wrote:
> > > Jakub Sitnicki wrote:
> > >> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
> > >>> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
> > >>> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
> > >>>
> > >>> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
> > >>> socket's refcount via lookup. If the socket is subsequently bound, the
> > >>> transition from unbound to bound causes bpf_sk_release() to skip the
> > >>> decrement of the refcount, causing a memory leak.
> > >>>
> > >>> unreferenced object 0xffff88810bc2eb40 (size 1984):
> > >>>   comm "test_progs", pid 2451, jiffies 4295320596
> > >>>   hex dump (first 32 bytes):
> > >>>     7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
> > >>>     02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
> > >>>   backtrace (crc bdee079d):
> > >>>     kmem_cache_alloc_noprof+0x557/0x660
> > >>>     sk_prot_alloc+0x69/0x240
> > >>>     sk_alloc+0x30/0x460
> > >>>     inet_create+0x2ce/0xf80
> > >>>     __sock_create+0x25b/0x5c0
> > >>>     __sys_socket+0x119/0x1d0
> > >>>     __x64_sys_socket+0x72/0xd0
> > >>>     do_syscall_64+0xa1/0x5f0
> > >>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >>>
> > >>> Maintain balanced refcounts across sk lookup/release: (re-)set
> > >>> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
> > >>> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
> > >>>
> > >>> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
> > >>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> > >>> ---
> > >>> Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
> > >>> sockets in bpf_sk_assign").
> > >>> ---
> > >>>  net/ipv4/udp_bpf.c | 3 +++
> > >>>  1 file changed, 3 insertions(+)
> > >>>
> > >>> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> > >>> index ad57c4c9eaab..970327b59582 100644
> > >>> --- a/net/ipv4/udp_bpf.c
> > >>> +++ b/net/ipv4/udp_bpf.c
> > >>> @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> > >>>     if (sk->sk_family == AF_INET6)
> > >>>             udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
> > >>>
> > >>> +   /* Treat all sockets as non-refcounted, regardless of binding state. */
> > >>> +   sock_set_flag(sk, SOCK_RCU_FREE);
> > >>> +
> > >>>     sock_replace_proto(sk, &udp_bpf_prots[family]);
> > >>>     return 0;
> > >>>  }
> > >>
> > >> There is a side effect that an unhashed (unbound) UDP socket can now be
> > >> selected in sk_lookup with bpf_sk_assign.
> > >
> > > The commit does mention a related fix, beneath the ---, commit
> > > 67312adc96b5 ("bpf: reject unhashed sockets in bpf_sk_assign").
> > > That fixes a similar issue by exactly disallowing this:
> > >
> > >     Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
> > >     This matches the behaviour of __inet_lookup_skb which is ultimately
> > >     the goal of bpf_sk_assign().
> > >
> > > So ..
> > >
> > >> Though perhaps that's for the
> > >> better because TC bpf_sk_assign doesn't reject non-refcounted UDP
> > >> sockets either, so we would have both socket dispatch sites behave the
> > >> same way.
> > >
> > > .. there are two conflicting types of consistency here? Consistent with
> > > __inet_lookup_skb or the TC bpf hook. Of those the first is the more
> > > canonical.
> > >
> > >> Also, with this patch, if we insert & remove an unhashed UDP socket
> > >> into/from a sockmap, we end up with an unhashed non-refcounted UDP
> > >> socket. Not entirely sure if that is actually a problem or not.
> > >>
> > >> Willem, what is your take on having unhashed non-refcoted UDP sockets?
> > >
> > > I don't immediately see a problem, but I'm not an expert on SOCK_RCU_FREE.
> >
> > Perhaps it's worth mentioning that unhashed non-refcounted UDP socket is
> > already possible: first auto-bind via connect(AF_INET) (which also sets
> > SOCK_RCU_FREE), then unhash via connect(AF_UNSPEC).
>
> Setting SOCK_RCU_FREE itself should not cause a problem, but I think
> we should take a step back.
>
> AFAIU, 0c48eefae712 was to allow putting AF_UNIX SOCK_DGRAM sockets
> into sockmap, not to allow using unconnected UDP sockets in sk_lookup etc.
>
> Actually, v4 of the patch was implemented as such but did not get any feedback,
> https://lore.kernel.org/bpf/20210508220835.53801-9-xiyou.wangcong@gmail.com/#t
>
> ... and v5 (the final commit) somehow removed the restriction for unconnected
> UDP socket as well.
> https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com/
>
> Given the initial use case, sockmap redirect, is still blocked by
> TCP_ESTABLISHED
> check in sock_map_redirect_allowed(), I feel there is no point in supporting
> unconnected UDP sockets in sockmap.  It cannot get any skb from anywhere
> (without buggy sk_lookup).

s/unconnected/unhashed/g :)

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Kuniyuki Iwashima @ 2026-06-24 21:33 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Willem de Bruijn, Jakub Sitnicki, John Fastabend, Jiayuan Chen,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <dd065bfb-52ce-48fd-b1ef-9c6166f714ed@rbox.co>

On Wed, Jun 24, 2026 at 2:26 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 6/24/26 22:01, Willem de Bruijn wrote:
> > Jakub Sitnicki wrote:
> >> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
> >>> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
> >>> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
> >>>
> >>> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
> >>> socket's refcount via lookup. If the socket is subsequently bound, the
> >>> transition from unbound to bound causes bpf_sk_release() to skip the
> >>> decrement of the refcount, causing a memory leak.
> >>>
> >>> unreferenced object 0xffff88810bc2eb40 (size 1984):
> >>>   comm "test_progs", pid 2451, jiffies 4295320596
> >>>   hex dump (first 32 bytes):
> >>>     7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
> >>>     02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
> >>>   backtrace (crc bdee079d):
> >>>     kmem_cache_alloc_noprof+0x557/0x660
> >>>     sk_prot_alloc+0x69/0x240
> >>>     sk_alloc+0x30/0x460
> >>>     inet_create+0x2ce/0xf80
> >>>     __sock_create+0x25b/0x5c0
> >>>     __sys_socket+0x119/0x1d0
> >>>     __x64_sys_socket+0x72/0xd0
> >>>     do_syscall_64+0xa1/0x5f0
> >>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>>
> >>> Maintain balanced refcounts across sk lookup/release: (re-)set
> >>> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
> >>> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
> >>>
> >>> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
> >>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> >>> ---
> >>> Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
> >>> sockets in bpf_sk_assign").
> >>> ---
> >>>  net/ipv4/udp_bpf.c | 3 +++
> >>>  1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> >>> index ad57c4c9eaab..970327b59582 100644
> >>> --- a/net/ipv4/udp_bpf.c
> >>> +++ b/net/ipv4/udp_bpf.c
> >>> @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> >>>     if (sk->sk_family == AF_INET6)
> >>>             udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
> >>>
> >>> +   /* Treat all sockets as non-refcounted, regardless of binding state. */
> >>> +   sock_set_flag(sk, SOCK_RCU_FREE);
> >>> +
> >>>     sock_replace_proto(sk, &udp_bpf_prots[family]);
> >>>     return 0;
> >>>  }
> >>
> >> There is a side effect that an unhashed (unbound) UDP socket can now be
> >> selected in sk_lookup with bpf_sk_assign.
> >
> > The commit does mention a related fix, beneath the ---, commit
> > 67312adc96b5 ("bpf: reject unhashed sockets in bpf_sk_assign").
> > That fixes a similar issue by exactly disallowing this:
> >
> >     Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
> >     This matches the behaviour of __inet_lookup_skb which is ultimately
> >     the goal of bpf_sk_assign().
> >
> > So ..
> >
> >> Though perhaps that's for the
> >> better because TC bpf_sk_assign doesn't reject non-refcounted UDP
> >> sockets either, so we would have both socket dispatch sites behave the
> >> same way.
> >
> > .. there are two conflicting types of consistency here? Consistent with
> > __inet_lookup_skb or the TC bpf hook. Of those the first is the more
> > canonical.
> >
> >> Also, with this patch, if we insert & remove an unhashed UDP socket
> >> into/from a sockmap, we end up with an unhashed non-refcounted UDP
> >> socket. Not entirely sure if that is actually a problem or not.
> >>
> >> Willem, what is your take on having unhashed non-refcoted UDP sockets?
> >
> > I don't immediately see a problem, but I'm not an expert on SOCK_RCU_FREE.
>
> Perhaps it's worth mentioning that unhashed non-refcounted UDP socket is
> already possible: first auto-bind via connect(AF_INET) (which also sets
> SOCK_RCU_FREE), then unhash via connect(AF_UNSPEC).

Setting SOCK_RCU_FREE itself should not cause a problem, but I think
we should take a step back.

AFAIU, 0c48eefae712 was to allow putting AF_UNIX SOCK_DGRAM sockets
into sockmap, not to allow using unconnected UDP sockets in sk_lookup etc.

Actually, v4 of the patch was implemented as such but did not get any feedback,
https://lore.kernel.org/bpf/20210508220835.53801-9-xiyou.wangcong@gmail.com/#t

... and v5 (the final commit) somehow removed the restriction for unconnected
UDP socket as well.
https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com/

Given the initial use case, sockmap redirect, is still blocked by
TCP_ESTABLISHED
check in sock_map_redirect_allowed(), I feel there is no point in supporting
unconnected UDP sockets in sockmap.  It cannot get any skb from anywhere
(without buggy sk_lookup).

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Michal Luczaj @ 2026-06-24 21:25 UTC (permalink / raw)
  To: Willem de Bruijn, Jakub Sitnicki
  Cc: John Fastabend, Jiayuan Chen, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Alexei Starovoitov,
	Cong Wang, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan, netdev,
	bpf, linux-kernel, linux-kselftest, kuniyu
In-Reply-To: <willemdebruijn.kernel.24d11e11d5dc0@gmail.com>

On 6/24/26 22:01, Willem de Bruijn wrote:
> Jakub Sitnicki wrote:
>> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
>>> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
>>> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
>>>
>>> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
>>> socket's refcount via lookup. If the socket is subsequently bound, the
>>> transition from unbound to bound causes bpf_sk_release() to skip the
>>> decrement of the refcount, causing a memory leak.
>>>
>>> unreferenced object 0xffff88810bc2eb40 (size 1984):
>>>   comm "test_progs", pid 2451, jiffies 4295320596
>>>   hex dump (first 32 bytes):
>>>     7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
>>>     02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
>>>   backtrace (crc bdee079d):
>>>     kmem_cache_alloc_noprof+0x557/0x660
>>>     sk_prot_alloc+0x69/0x240
>>>     sk_alloc+0x30/0x460
>>>     inet_create+0x2ce/0xf80
>>>     __sock_create+0x25b/0x5c0
>>>     __sys_socket+0x119/0x1d0
>>>     __x64_sys_socket+0x72/0xd0
>>>     do_syscall_64+0xa1/0x5f0
>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Maintain balanced refcounts across sk lookup/release: (re-)set
>>> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
>>> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
>>>
>>> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>> ---
>>> Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
>>> sockets in bpf_sk_assign").
>>> ---
>>>  net/ipv4/udp_bpf.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
>>> index ad57c4c9eaab..970327b59582 100644
>>> --- a/net/ipv4/udp_bpf.c
>>> +++ b/net/ipv4/udp_bpf.c
>>> @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>>>  	if (sk->sk_family == AF_INET6)
>>>  		udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
>>>  
>>> +	/* Treat all sockets as non-refcounted, regardless of binding state. */
>>> +	sock_set_flag(sk, SOCK_RCU_FREE);
>>> +
>>>  	sock_replace_proto(sk, &udp_bpf_prots[family]);
>>>  	return 0;
>>>  }
>>
>> There is a side effect that an unhashed (unbound) UDP socket can now be
>> selected in sk_lookup with bpf_sk_assign.
> 
> The commit does mention a related fix, beneath the ---, commit
> 67312adc96b5 ("bpf: reject unhashed sockets in bpf_sk_assign").
> That fixes a similar issue by exactly disallowing this:
> 
>     Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
>     This matches the behaviour of __inet_lookup_skb which is ultimately
>     the goal of bpf_sk_assign().
> 
> So ..
> 
>> Though perhaps that's for the
>> better because TC bpf_sk_assign doesn't reject non-refcounted UDP
>> sockets either, so we would have both socket dispatch sites behave the
>> same way.
> 
> .. there are two conflicting types of consistency here? Consistent with
> __inet_lookup_skb or the TC bpf hook. Of those the first is the more
> canonical.
> 
>> Also, with this patch, if we insert & remove an unhashed UDP socket
>> into/from a sockmap, we end up with an unhashed non-refcounted UDP
>> socket. Not entirely sure if that is actually a problem or not.
>>
>> Willem, what is your take on having unhashed non-refcoted UDP sockets?
> 
> I don't immediately see a problem, but I'm not an expert on SOCK_RCU_FREE.

Perhaps it's worth mentioning that unhashed non-refcounted UDP socket is
already possible: first auto-bind via connect(AF_INET) (which also sets
SOCK_RCU_FREE), then unhash via connect(AF_UNSPEC).

^ permalink raw reply

* Re: [PATCH bpf 2/2] selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
From: Michal Luczaj @ 2026-06-24 21:24 UTC (permalink / raw)
  To: bot+bpf-ci, john.fastabend, jakub, jiayuan.chen, davem, edumazet,
	kuba, pabeni, horms, ast, cong.wang, daniel, andrii, eddyz87,
	memxor, martin.lau, song, yonghong.song, jolsa, emil, shuah
  Cc: netdev, bpf, linux-kernel, linux-kselftest, martin.lau, clm,
	ihor.solodrai
In-Reply-To: <bd4abd3f584c38f1fc512a9d3ae1f90c2a374588fa23529b32b7f1aed6ab2ead@mail.kernel.org>

On 6/23/26 21:32, bot+bpf-ci@kernel.org wrote:
>> selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
>>
>>     Setup and join a cgroup, then attach a cgroup/connect4 program that runs
>>
>>        sk = bpf_map_lookup_elem(sockmap, 0)
>>        bpf_bind(ctx, sa, sizeof(sa))
>>        bpf_sk_release(sk)
>>
>>     Unpatched kernel leaks the socket.
>>
>>     Signed-off-by: Michal Luczaj <mhal@rbox.co>
> 
> This test reproduces a UDP socket leak across sockmap lookup-bind-release,
> but there is no Fixes: tag.
> 
> The sibling fix commit a2510dc351c5 ("bpf, sockmap: Don't leak UDP socks on
> lookup-bind-release") carries a Fixes: tag pointing at the commit that lifted
> the socket-state restriction allowing unbound UDP sockets into sockmap.
> 
> Should this test carry the same tag so it stays backportable alongside the
> fix?
> 
>   Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")

I am skipping this suggestion as the test itself does not address any bugs
and requires no backporting.

Michal

^ permalink raw reply

* Re: [PATCH nf-next 1/4] netfilter: nf_conntrack_sane: replace u_int16_t with u16
From: Florian Westphal @ 2026-06-24 21:00 UTC (permalink / raw)
  To: Carlos Grillet
  Cc: Pablo Neira Ayuso, Phil Sutter, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netfilter-devel,
	coreteam, netdev, linux-kernel
In-Reply-To: <20260624184036.71051-2-carlos@carlosgrillet.me>

Carlos Grillet <carlos@carlosgrillet.me> wrote:
> Use preferred kernel integer type u16 instead of the POSIX u_int16_t
> variant.
> 
> No functional change.
> 
> Signed-off-by: Carlos Grillet <carlos@carlosgrillet.me>
> ---
>  net/netfilter/nf_conntrack_sane.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nf_conntrack_sane.c b/net/netfilter/nf_conntrack_sane.c
> index 39085acf7a71..130b3e68090e 100644
> --- a/net/netfilter/nf_conntrack_sane.c
> +++ b/net/netfilter/nf_conntrack_sane.c
> @@ -35,7 +35,7 @@ MODULE_DESCRIPTION("SANE connection tracking helper");
>  MODULE_ALIAS_NFCT_HELPER(HELPER_NAME);
>  
>  #define MAX_PORTS 8
> -static u_int16_t ports[MAX_PORTS];
> +static u16 ports[MAX_PORTS];

These port variables are useless and will be removed soon.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox