Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] net: gianfar: use of_irq_get()
From: Simon Horman @ 2026-06-25  9:36 UTC (permalink / raw)
  To: rosenp
  Cc: Simon Horman, netdev, claudiu.manoil, andrew+netdev, davem,
	edumazet, kuba, pabeni, afleming, linux-kernel
In-Reply-To: <20260624032122.986986-1-rosenp@gmail.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: gianfar: use of_irq_get()

This patch switches gfar_parse_group() from irq_of_parse_and_map() to
of_irq_get() and widens the irq field in struct gfar_irqinfo from
unsigned int to int. The stated goals are to fix a leak from a missing
irq_dispose_mapping() and to enable -EPROBE_DEFER propagation.

> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
> index 3271de5844f8..17a0d0787ed2 100644
> --- a/drivers/net/ethernet/freescale/gianfar.c
> +++ b/drivers/net/ethernet/freescale/gianfar.c
> @@ -514,15 +514,15 @@ static int gfar_parse_group(struct device_node *np,
>  	if (!grp->regs)
>  		return -ENOMEM;
>
> -	gfar_irq(grp, TX)->irq = irq_of_parse_and_map(np, 0);
> +	gfar_irq(grp, TX)->irq = of_irq_get(np, 0);
>
>  	/* If we aren't the FEC we have multiple interrupts */
>  	if (model && strcasecmp(model, "FEC")) {
> -		gfar_irq(grp, RX)->irq = irq_of_parse_and_map(np, 1);
> -		gfar_irq(grp, ER)->irq = irq_of_parse_and_map(np, 2);
> -		if (!gfar_irq(grp, TX)->irq ||
> -		    !gfar_irq(grp, RX)->irq ||
> -		    !gfar_irq(grp, ER)->irq)
> +		gfar_irq(grp, RX)->irq = of_irq_get(np, 1);
> +		gfar_irq(grp, ER)->irq = of_irq_get(np, 2);
> +		if (gfar_irq(grp, TX)->irq < 0 ||
> +		    gfar_irq(grp, RX)->irq < 0 ||
> +		    gfar_irq(grp, ER)->irq < 0)
>  			return -EINVAL;
>  	}

[High]
Does this change miss the 0-return mapping failure case?

The kernel-doc on of_irq_get() in drivers/of/irq.c says it returns:

    Linux IRQ number on success, or 0 on the IRQ mapping failure, or
    -EPROBE_DEFER ..., or error code in case of any other failure.

The old test rejected an irq value of 0 with !gfar_irq(grp, *)->irq,
which caught both irq_of_parse_and_map()'s and of_irq_get()'s
zero-on-failure. The new test only rejects values strictly less than
zero, so a 0 returned from of_irq_get() is now stored as the IRQ and
later flows into request_irq()/free_irq() during probe and unbind.

Should the test be <= 0, or split so that negative errors are
propagated and 0 still triggers a failure?

[High]
Is -EPROBE_DEFER lost here?

The commit message states:

    Changed irq to an int as required by the of_irq_get API as it
    supports -EPROBE_DEFER.

However, the only error path is:

    if (gfar_irq(grp, TX)->irq < 0 ||
        gfar_irq(grp, RX)->irq < 0 ||
        gfar_irq(grp, ER)->irq < 0)
            return -EINVAL;

of_irq_get() returns -EPROBE_DEFER when irq_find_host() cannot yet
locate the parent interrupt controller. Collapsing every negative
return into -EINVAL prevents the driver core from retrying the probe
once the interrupt controller appears.

Should the actual error from of_irq_get() be propagated, for example:

    int ret = gfar_irq(grp, TX)->irq;
    if (ret < 0)
            return ret;

so that -EPROBE_DEFER reaches gfar_of_init() and then gfar_probe()?

[Medium]
Can the commit message description be revisited?

The message states:

    of_irq_get() differs from irq_of_parse_and_map() in that the latter
    requires calling irq_dispose_mapping() when done, which is missing
    in the driver. Meaning it leaks memory.

    No need to map it anyway. Just need the value stored in the irq
    field.

Looking at of_irq_get() in drivers/of/irq.c:

    int of_irq_get(struct device_node *dev, int index)
    {
            ...
            rc = irq_create_of_mapping(&oirq);
            ...
    }

of_irq_get() ends with the same irq_create_of_mapping() call that
irq_of_parse_and_map() makes, so the resulting virq mapping is created
identically by both APIs.

If the previous code leaked, does the new code not leak in the same
way? And does the Fixes: b31a1d8b4151 tag accurately describe this
change, given that it may cause stable-tree selection on the basis of a
leak that does not appear to exist?

^ permalink raw reply

* Re: [PATCH net] net: libwx: fix VMDQ mask for 1-queue mode
From: Larysa Zaremba @ 2026-06-25  9:39 UTC (permalink / raw)
  To: Jiawen Wu
  Cc: netdev, Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Kees Cook
In-Reply-To: <60D88ADD3E295420+20260625090851.539640-1-jiawenwu@trustnetic.com>

On Thu, Jun 25, 2026 at 05:08:51PM +0800, Jiawen Wu wrote:
> In wx_set_vmdq_queues(), the VMDQ mask was not set for the devices not
> support WX_FLAG_MULTI_64_FUNC, i.e., NGBE devices. A mask of 0 causes
> __ALIGN_MASK(1, ~vmdq->mask) to return 0, which incorrectly sets
> q_per_pool to 0 in wx_write_qde().
> 
> Fix the VMDQ 1-queue mask to 0x7F then ensures that __ALIGN_MASK(1,
> 0x7F) correctly evaluates to 1.

__ALIGN_MASK(1, 0x7F) evaulates to 0x80 (128), not to 1. __ALIGN_MASK(1, 0x7E) 
evaluates to 1. Maybe you need 0x7D for 2 queues and 0x7E for 1 queue?

> 
> Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
> Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
> ---
>  drivers/net/ethernet/wangxun/libwx/wx_lib.c  | 1 +
>  drivers/net/ethernet/wangxun/libwx/wx_type.h | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> index d042567b8128..814d88d2aee4 100644
> --- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> +++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> @@ -1802,6 +1802,7 @@ static bool wx_set_vmdq_queues(struct wx *wx)
>  			rss_i = 4;
>  		}
>  	} else {
> +		vmdq_m = WX_VMDQ_1Q_MASK;
>  		/* double check we are limited to maximum pools */
>  		vmdq_i = min_t(u16, 8, vmdq_i);
>  
> diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> index c7befe4cdfe9..65e3e55db1cf 100644
> --- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
> +++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> @@ -486,6 +486,7 @@ enum WX_MSCA_CMD_value {
>  
>  #define WX_VMDQ_4Q_MASK              0x7C
>  #define WX_VMDQ_2Q_MASK              0x7E
> +#define WX_VMDQ_1Q_MASK              0x7F
>  
>  /****************** Manageablility Host Interface defines ********************/
>  #define WX_HI_MAX_BLOCK_BYTE_LENGTH  256 /* Num of bytes in range */
> -- 
> 2.51.0
> 

^ permalink raw reply

* [PATCH net] net: airoha: dma map xmit frags with skb_frag_dma_map()
From: Lorenzo Bianconi @ 2026-06-25  9:42 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: linux-arm-kernel, linux-mediatek, netdev, Lorenzo Bianconi

Map xmit skb fragments using skb_frag_dma_map() instead of
dma_map_single(skb_frag_address()). skb_frag_address() relies on
page_address() to obtain a kernel virtual address, which is not
guaranteed to work for all page types (e.g. highmem pages or
user-pinned pages from MSG_ZEROCOPY).
skb_frag_dma_map() maps the fragment directly via its struct page and
offset through dma_map_page(), avoiding the need for a kernel virtual
address entirely.
Introduce an enum airoha_dma_map_type to track how each queue entry was
mapped (single vs page), so that the matching unmap function is called
on completion and in error paths.

Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 61 ++++++++++++++++++++------------
 drivers/net/ethernet/airoha/airoha_eth.h |  7 ++++
 2 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 932b3a3df2e5..1caf6766f2c0 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -944,6 +944,25 @@ static void airoha_qdma_wake_netdev_txqs(struct airoha_queue *q)
 	q->txq_stopped = false;
 }
 
+static void airoha_unmap_xmit_buf(struct airoha_eth *eth,
+				  struct airoha_queue_entry *e)
+{
+	switch (e->dma_type) {
+	case AIROHA_DMA_MAP_PAGE:
+		dma_unmap_page(eth->dev, e->dma_addr, e->dma_len,
+			       DMA_TO_DEVICE);
+		break;
+	case AIROHA_DMA_MAP_SINGLE:
+		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
+				 DMA_TO_DEVICE);
+		break;
+	case AIROHA_DMA_UNMAPPED:
+	default:
+		break;
+	}
+	e->dma_type = AIROHA_DMA_UNMAPPED;
+}
+
 static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
 {
 	struct airoha_tx_irq_queue *irq_q;
@@ -1006,9 +1025,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
 		skb = e->skb;
 		e->skb = NULL;
 
-		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
-				 DMA_TO_DEVICE);
-		e->dma_addr = 0;
+		airoha_unmap_xmit_buf(eth, e);
 		list_add_tail(&e->list, &q->tx_list);
 
 		WRITE_ONCE(desc->msg0, 0);
@@ -1177,12 +1194,10 @@ static void airoha_qdma_tx_cleanup(struct airoha_qdma *qdma)
 			struct airoha_qdma_desc *desc = &q->desc[j];
 			struct sk_buff *skb = e->skb;
 
-			if (!e->dma_addr)
+			if (e->dma_type == AIROHA_DMA_UNMAPPED)
 				continue;
 
-			dma_unmap_single(qdma->eth->dev, e->dma_addr,
-					 e->dma_len, DMA_TO_DEVICE);
-			e->dma_addr = 0;
+			airoha_unmap_xmit_buf(qdma->eth, e);
 			list_add_tail(&e->list, &q->tx_list);
 
 			WRITE_ONCE(desc->ctrl, 0);
@@ -2193,8 +2208,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
 	struct netdev_queue *txq;
 	struct airoha_queue *q;
 	LIST_HEAD(tx_list);
+	dma_addr_t addr;
 	int i = 0, qid;
-	void *data;
 	u16 index;
 	u8 fport;
 
@@ -2250,24 +2265,22 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
 		return NETDEV_TX_BUSY;
 	}
 
-	len = skb_headlen(skb);
-	data = skb->data;
-
 	e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
 			     list);
+	len = skb_headlen(skb);
+	addr = dma_map_single(netdev->dev.parent, skb->data, len,
+			      DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
+		goto error_unlock;
+
+	e->dma_type = AIROHA_DMA_MAP_SINGLE;
 	index = e - q->entry;
 
 	while (true) {
 		struct airoha_qdma_desc *desc = &q->desc[index];
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-		dma_addr_t addr;
 		u32 val;
 
-		addr = dma_map_single(netdev->dev.parent, data, len,
-				      DMA_TO_DEVICE);
-		if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
-			goto error_unmap;
-
 		list_move_tail(&e->list, &tx_list);
 		e->skb = i == nr_frags - 1 ? skb : NULL;
 		e->dma_addr = addr;
@@ -2291,8 +2304,13 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
 		if (++i == nr_frags)
 			break;
 
-		data = skb_frag_address(frag);
 		len = skb_frag_size(frag);
+		addr = skb_frag_dma_map(netdev->dev.parent, frag, 0, len,
+					DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
+			goto error_unmap;
+
+		e->dma_type = AIROHA_DMA_MAP_PAGE;
 	}
 	q->queued += i;
 
@@ -2313,11 +2331,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
 	return NETDEV_TX_OK;
 
 error_unmap:
-	list_for_each_entry(e, &tx_list, list) {
-		dma_unmap_single(netdev->dev.parent, e->dma_addr, e->dma_len,
-				 DMA_TO_DEVICE);
-		e->dma_addr = 0;
-	}
+	list_for_each_entry(e, &tx_list, list)
+		airoha_unmap_xmit_buf(dev->eth, e);
 	list_splice(&tx_list, &q->tx_list);
 error_unlock:
 	spin_unlock_bh(&q->lock);
diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
index d7ff8c5200e2..2765244d937c 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.h
+++ b/drivers/net/ethernet/airoha/airoha_eth.h
@@ -170,12 +170,19 @@ enum trtcm_param {
 #define TRTCM_TOKEN_RATE_MASK			GENMASK(23, 6)
 #define TRTCM_TOKEN_RATE_FRACTION_MASK		GENMASK(5, 0)
 
+enum airoha_dma_map_type {
+	AIROHA_DMA_UNMAPPED,
+	AIROHA_DMA_MAP_SINGLE,
+	AIROHA_DMA_MAP_PAGE,
+};
+
 struct airoha_queue_entry {
 	union {
 		void *buf;
 		struct {
 			struct list_head list;
 			struct sk_buff *skb;
+			enum airoha_dma_map_type dma_type;
 		};
 	};
 	dma_addr_t dma_addr;

---
base-commit: 232c4ca2343d1181cbfc061f9856d9591e397579
change-id: 20260625-airoha-eth-skb_frag_dma_map-bcccd5d6e4b1

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply related

* RE: [PATCH net] net: libwx: fix VMDQ mask for 1-queue mode
From: Jiawen Wu @ 2026-06-25  9:44 UTC (permalink / raw)
  To: 'Larysa Zaremba'
  Cc: netdev, 'Mengyuan Lou', 'Andrew Lunn',
	'David S. Miller', 'Eric Dumazet',
	'Jakub Kicinski', 'Paolo Abeni',
	'Simon Horman', 'Kees Cook'
In-Reply-To: <ajz3QK96wKoLD4n4@soc-5CG4396X81.clients.intel.com>

On Thu, Jun 25, 2026 5:39 PM, Larysa Zaremba wrote:
> On Thu, Jun 25, 2026 at 05:08:51PM +0800, Jiawen Wu wrote:
> > In wx_set_vmdq_queues(), the VMDQ mask was not set for the devices not
> > support WX_FLAG_MULTI_64_FUNC, i.e., NGBE devices. A mask of 0 causes
> > __ALIGN_MASK(1, ~vmdq->mask) to return 0, which incorrectly sets
> > q_per_pool to 0 in wx_write_qde().
> >
> > Fix the VMDQ 1-queue mask to 0x7F then ensures that __ALIGN_MASK(1,
> > 0x7F) correctly evaluates to 1.
> 
> __ALIGN_MASK(1, 0x7F) evaulates to 0x80 (128), not to 1. __ALIGN_MASK(1, 0x7E)
> evaluates to 1. Maybe you need 0x7D for 2 queues and 0x7E for 1 queue?

Sorry, the commit log is so wrong for that '~' is missing...
I want to describe that __ALIGN_MASK(1, ~0x7F) evaluates to 1.

> 
> >
> > Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
> > Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
> > ---
> >  drivers/net/ethernet/wangxun/libwx/wx_lib.c  | 1 +
> >  drivers/net/ethernet/wangxun/libwx/wx_type.h | 1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > index d042567b8128..814d88d2aee4 100644
> > --- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > +++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > @@ -1802,6 +1802,7 @@ static bool wx_set_vmdq_queues(struct wx *wx)
> >  			rss_i = 4;
> >  		}
> >  	} else {
> > +		vmdq_m = WX_VMDQ_1Q_MASK;
> >  		/* double check we are limited to maximum pools */
> >  		vmdq_i = min_t(u16, 8, vmdq_i);
> >
> > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > index c7befe4cdfe9..65e3e55db1cf 100644
> > --- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > +++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > @@ -486,6 +486,7 @@ enum WX_MSCA_CMD_value {
> >
> >  #define WX_VMDQ_4Q_MASK              0x7C
> >  #define WX_VMDQ_2Q_MASK              0x7E
> > +#define WX_VMDQ_1Q_MASK              0x7F
> >
> >  /****************** Manageablility Host Interface defines ********************/
> >  #define WX_HI_MAX_BLOCK_BYTE_LENGTH  256 /* Num of bytes in range */
> > --
> > 2.51.0
> >
> 


^ permalink raw reply

* [PATCH 2/2] net/sched: sch_multiq: Replace direct dequeue call with peek and qdisc_dequeue_peeked
From: Bryam Vargas via B4 Relay @ 2026-06-25  9:51 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Paolo Abeni, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, David S. Miller, Eric Dumazet
  Cc: Simon Horman, netdev, Jarek Poplawski, Vladimir Oltean,
	linux-kernel
In-Reply-To: <20260625-b4-disp-31bcb279-v1-0-85c40b83c529@proton.me>

From: Bryam Vargas <hexlabsecurity@proton.me>

multiq_dequeue() takes a packet from a band's child with a direct
->dequeue() call after multiq_peek() peeked it. When the child is
non-work-conserving the peek stashes the skb in the child's gso_skb, so
the direct dequeue returns a different skb and orphans the stash,
desyncing the child's qlen/backlog. With a qfq child reached through a
peeking parent (e.g. tbf) this re-enters the child on an emptied list and
dereferences NULL, panicking the kernel from softirq on ordinary egress.

Take the packet through qdisc_dequeue_peeked(), as sch_prio already does
and as sch_red and sch_sfb were just fixed to do. The helper is a no-op
when the child has no stash, so a work-conserving child is unaffected.

Fixes: 77be155cba4e ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.")
Cc: stable@vger.kernel.org
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
 net/sched/sch_multiq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 4e465d11e3d7..a467dd122369 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -103,7 +103,7 @@ static struct sk_buff *multiq_dequeue(struct Qdisc *sch)
 		if (!netif_xmit_stopped(
 		    netdev_get_tx_queue(qdisc_dev(sch), q->curband))) {
 			qdisc = q->queues[q->curband];
-			skb = qdisc->dequeue(qdisc);
+			skb = qdisc_dequeue_peeked(qdisc);
 			if (skb) {
 				qdisc_bstats_update(sch, skb);
 				qdisc_qlen_dec(sch);

-- 
2.43.0



^ permalink raw reply related

* [PATCH 1/2] net/sched: sch_taprio: Replace direct dequeue call with peek and qdisc_dequeue_peeked
From: Bryam Vargas via B4 Relay @ 2026-06-25  9:51 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Paolo Abeni, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, David S. Miller, Eric Dumazet
  Cc: Simon Horman, netdev, Jarek Poplawski, Vladimir Oltean,
	linux-kernel
In-Reply-To: <20260625-b4-disp-31bcb279-v1-0-85c40b83c529@proton.me>

From: Bryam Vargas <hexlabsecurity@proton.me>

When taprio's software path peeks a non-work-conserving child qdisc, the
child stashes the peeked skb in its gso_skb; taprio_dequeue_from_txq()
then takes the packet with a direct child ->dequeue() call, which ignores
that stash, orphans the peeked skb and desyncs the child's qlen/backlog.
With a qfq child this re-enters the child on an emptied list and
dereferences NULL, panicking the kernel from softirq on ordinary egress.

Take the packet through qdisc_dequeue_peeked(), as sch_red and sch_sfb
now do. The helper returns the child's stashed skb first and is a no-op
when there is none, so a work-conserving child is unaffected and the
gated path now consumes the skb whose length was charged to the budget.

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Cc: stable@vger.kernel.org
Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
---
 net/sched/sch_taprio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 558987d9b977..299234a5f0fe 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -749,7 +749,7 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 		return NULL;

 skip_peek_checks:
-	skb = child->ops->dequeue(child);
+	skb = qdisc_dequeue_peeked(child);
 	if (unlikely(!skb))
 		return NULL;

-- 
2.43.0

^ permalink raw reply related

* [PATCH 0/2] net/sched: finish the qdisc_dequeue_peeked conversion (taprio, multiq)
From: Bryam Vargas via B4 Relay @ 2026-06-25  9:51 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Paolo Abeni, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, David S. Miller, Eric Dumazet
  Cc: Simon Horman, netdev, Jarek Poplawski, Vladimir Oltean,
	linux-kernel

Commit 77be155cba4e added peek emulation: a non-work-conserving qdisc's
->peek dequeues one skb and stashes it in the child's gso_skb. A parent
that peeks such a child must then take the packet with
qdisc_dequeue_peeked(), not a direct ->dequeue(), or the stashed skb is
bypassed and the child's qlen/backlog desync. sch_red and sch_sfb were
just fixed for this; taprio and multiq still take the direct path.

With a qfq child the desync re-enters qfq_dequeue on an emptied aggregate
list and dereferences NULL, panicking from softirq on ordinary egress.
taprio reaches it on its own (root-only software path, all gates open);
multiq reaches it when a peeking parent such as tbf wraps it over a
non-work-conserving grandchild. Both need only CAP_NET_ADMIN.

Confirmed under KASAN: the unpatched arm panics, the patched arm is
clean, and a work-conserving-child control is clean. The reproducers and
splats for both are below; the per-patch changes are one line each.

taprio reproducer (self-triggering, no parent qdisc needed):

  ip link add dummy0 numtxqueues 4 type dummy; ip link set dummy0 up
  ip addr add 10.10.11.10/24 dev dummy0
  tc qdisc add dev dummy0 root handle 1: taprio num_tc 2 \
     map 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 1@1 \
     base-time 9000000000000000000 sched-entry S 03 200000 flags 0x0 clockid CLOCK_TAI
  tc qdisc replace dev dummy0 parent 1:1 handle 3: qfq
  tc class  add dev dummy0 classid 3:1 parent 3: qfq maxpkt 512 weight 1
  tc filter add dev dummy0 parent 3: protocol ip prio 1 matchall classid 3:1
  ping -c1 10.10.11.99 -I dummy0

[  903.769174] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI
[  903.769953] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[  903.770456] CPU: 7 UID: 0 PID: 16162 Comm: ping Not tainted 7.1.0-rc5 #1 PREEMPT(lazy)
[  903.771725] RIP: 0010:qfq_dequeue+0x362/0x1580 [sch_qfq]
[  903.777452] Call Trace:
[  903.778311]  taprio_dequeue_from_txq+0x383/0x680 [sch_taprio]
[  903.778685]  taprio_dequeue_tc_priority+0x19a/0x330 [sch_taprio]
[  903.779645]  taprio_dequeue+0xa6/0x330 [sch_taprio]
[  903.780299]  __qdisc_run+0x16c/0x1890
[  903.780854]  __dev_queue_xmit+0x1ece/0x3390
[  903.784109]  ip_finish_output2+0x571/0x1da0
[  903.785996]  ip_output+0x26c/0x4d0
[  903.789572]  ping_v4_sendmsg+0xd22/0x12b0
[  903.796118]  __x64_sys_sendto+0xe0/0x1c0
[  903.796612]  do_syscall_64+0xee/0x590
[  903.818669] Kernel panic - not syncing: Fatal exception in interrupt

multiq reproducer (needs a peeking parent over a stashing child; tbf
values chosen to force it to throttle):

  ip link add dummy0 numtxqueues 2 type dummy; ip link set dummy0 up
  ip addr add 10.10.11.10/24 dev dummy0
  tc qdisc add dev dummy0 root handle 1: tbf rate 88bit burst 1661b \
     peakrate 2257333 minburst 1024 limit 7b
  tc qdisc add dev dummy0 parent 1: handle 2: multiq
  for b in 1 2; do                          # qfq on every band
    tc qdisc  add dev dummy0 parent 2:$b handle 3$b: qfq
    tc class  add dev dummy0 classid 3$b:1 parent 3$b: qfq maxpkt 512 weight 1
    tc filter add dev dummy0 parent 3$b: protocol ip prio 1 matchall classid 3$b:1
  done
  ping -c12 10.10.11.99 -I dummy0

[ 1066.385097] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI
[ 1066.386385] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[ 1066.387227] CPU: 1 UID: 0 PID: 5357 Comm: ping Not tainted 7.1.0-rc5 #1 PREEMPT(lazy)
[ 1066.389183] RIP: 0010:qfq_dequeue+0x362/0x1580 [sch_qfq]
[ 1066.396316] Call Trace:
[ 1066.396768]  multiq_dequeue+0x163/0x360 [sch_multiq]
[ 1066.397885]  tbf_dequeue+0x6b9/0xf17 [sch_tbf]
[ 1066.398269]  __qdisc_run+0x16c/0x1890
[ 1066.399315]  __dev_queue_xmit+0x1ece/0x3390
[ 1066.403276]  ip_finish_output2+0x571/0x1da0
[ 1066.404818]  ip_output+0x26c/0x4d0
[ 1066.408620]  ping_v4_sendmsg+0xd22/0x12b0
[ 1066.415264]  __x64_sys_sendto+0xe0/0x1c0
[ 1066.416251]  do_syscall_64+0xee/0x590
[ 1066.441210] Kernel panic - not syncing: Fatal exception in interrupt

---
Bryam Vargas (2):
      net/sched: sch_taprio: Replace direct dequeue call with peek and qdisc_dequeue_peeked
      net/sched: sch_multiq: Replace direct dequeue call with peek and qdisc_dequeue_peeked

 net/sched/sch_multiq.c | 2 +-
 net/sched/sch_taprio.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
---
base-commit: 02f144fbb4c86c360495d33debe307cb46a57f95
change-id: 20260625-b4-disp-31bcb279-082e59a3aa36

Best regards,
-- 
Bryam Vargas <hexlabsecurity@proton.me>



^ permalink raw reply

* Re: [PATCH net v2] tipc: fix out-of-bounds read in broadcast Gap ACK blocks
From: Sam P @ 2026-06-25 10:00 UTC (permalink / raw)
  To: Tung Quang Nguyen
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org, Jon Maloy
In-Reply-To: <GV1P189MB19881AD711511094760906DCC6EC2@GV1P189MB1988.EURP189.PROD.OUTLOOK.COM>

On 25/06/2026 10:23, Tung Quang Nguyen wrote:
>> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index
>> 76a1585d3f6b..08637c3c9db0 100644
>> --- a/net/tipc/bcast.c
>> +++ b/net/tipc/bcast.c
>> @@ -497,11 +497,12 @@ void tipc_bcast_ack_rcv(struct net *net, struct
>> tipc_link *l,
>>   */
>> int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
>> 			struct tipc_msg *hdr,
>> -			struct sk_buff_head *retrq)
>> +			struct sk_buff_head *retrq, bool *valid)
>> {
>> 	struct sk_buff_head *inputq = &tipc_bc_base(net)->inputq;
>> 	struct tipc_gap_ack_blks *ga;
>> 	struct sk_buff_head xmitq;
>> +	u16 glen;
> 
> Move this variable declaration to the bottom to follow reverse xmas tree style.
> 
>> 	int rc = 0;
>>
>> 	__skb_queue_head_init(&xmitq);
>> @@ -510,13 +511,18 @@ int tipc_bcast_sync_rcv(struct net *net, struct
>> tipc_link *l,
>> 	if (msg_type(hdr) != STATE_MSG) {
>> 		tipc_link_bc_init_rcv(l, hdr);
>> 	} else if (!msg_bc_ack_invalid(hdr)) {
>> -		tipc_get_gap_ack_blks(&ga, l, hdr, false);
>> -		if (!sysctl_tipc_bc_retruni)
>> -			retrq = &xmitq;
>> -		rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),
>> -					  msg_bc_gap(hdr), ga, &xmitq,
>> -					  retrq);
>> -		rc |= tipc_link_bc_sync_rcv(l, hdr, &xmitq);
>> +		glen = tipc_get_gap_ack_blks(&ga, l, hdr, false);
>> +		if (glen > msg_data_sz(hdr)) {
>> +			/* Malformed Gap ACK blocks; caller drops the msg */
>> +			*valid = false;
>> +		} else {
>> +			if (!sysctl_tipc_bc_retruni)
>> +				retrq = &xmitq;
>> +			rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),
>> +						  msg_bc_gap(hdr), ga, &xmitq,
>> +						  retrq);
>> +			rc |= tipc_link_bc_sync_rcv(l, hdr, &xmitq);
>> +		}
>> 	}
>> 	tipc_bcast_unlock(net);
>>
>> diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index
>> 2d9352dc7b0e..55d17b5413e1 100644
>> --- a/net/tipc/bcast.h
>> +++ b/net/tipc/bcast.h
>> @@ -97,7 +97,7 @@ void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
>> 			struct tipc_msg *hdr);
>> int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
>> 			struct tipc_msg *hdr,
>> -			struct sk_buff_head *retrq);
>> +			struct sk_buff_head *retrq, bool *valid);
>> int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg,
>> 			struct tipc_link *bcl);
>> int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]); diff --git
>> a/net/tipc/node.c b/net/tipc/node.c index 97aa970a0d83..2887f94ee28f
>> 100644
>> --- a/net/tipc/node.c
>> +++ b/net/tipc/node.c
>> @@ -1831,12 +1831,13 @@ static void tipc_node_mcast_rcv(struct tipc_node
>> *n)  }
>>
>> static void tipc_node_bc_sync_rcv(struct tipc_node *n, struct tipc_msg *hdr,
>> -				  int bearer_id, struct sk_buff_head *xmitq)
>> +				  int bearer_id, struct sk_buff_head *xmitq,
>> +				  bool *valid)
>> {
>> 	struct tipc_link *ucl;
>> 	int rc;
>>
>> -	rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq);
>> +	rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq, valid);
> 
> 'valid' needs to be checked after this call. Then, return immediately if it is false.
> 
>>
>> 	if (rc & TIPC_LINK_DOWN_EVT) {
>> 		tipc_node_reset_links(n);
>> @@ -2140,12 +2141,18 @@ void tipc_rcv(struct net *net, struct sk_buff *skb,
>> struct tipc_bearer *b)
>>
>> 	/* Ensure broadcast reception is in synch with peer's send state */
>> 	if (unlikely(usr == LINK_PROTOCOL)) {
>> +		bool valid = true;
>> +
>> 		if (unlikely(skb_linearize(skb))) {
>> 			tipc_node_put(n);
>> 			goto discard;
>> 		}
>> 		hdr = buf_msg(skb);
>> -		tipc_node_bc_sync_rcv(n, hdr, bearer_id, &xmitq);
>> +		tipc_node_bc_sync_rcv(n, hdr, bearer_id, &xmitq, &valid);
>> +		if (!valid) {
>> +			tipc_node_put(n);
>> +			goto discard;
>> +		}
>> 	} else if (unlikely(tipc_link_acked(n->bc_entry.link) != bc_ack)) {
>> 		tipc_bcast_ack_rcv(net, n->bc_entry.link, hdr);
>> 	}
>>
>> base-commit: a986fde914d88af47eb78fd29c5d1af7952c3500
>> --
>> 2.54.0
> 

Thanks for the review, I'll address this in a v3.

^ permalink raw reply

* Re: [PATCH net] nfc: nci: fix uninit-value in the RF discover/activated NTF handlers
From: Sam P @ 2026-06-25 10:04 UTC (permalink / raw)
  To: Simon Horman
  Cc: david, davem, edumazet, kuba, pabeni, oe-linux-nfc, netdev,
	linux-kernel, stable
In-Reply-To: <20260625091828.1170615-2-horms@kernel.org>

On 25/06/2026 10:18, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> nfc: nci: fix uninit-value in the RF discover/activated NTF handlers
> 
> This zero-initialises the on-stack nci_rf_discover_ntf and
> nci_rf_intf_activated_ntf structs in the RF discover and interface
> activated NTF handlers so that the technology- and activation-specific
> unions read back as zero when the corresponding length fields are zero.
> 
> Link: https://lore.kernel.org/netdev/20260623172109.1105965-2-horms@kernel.org/
> 
>>   - activated: with rf_tech_specific_params_len == 0 the same union is read
>>     via nci_target_auto_activated(); with activation_params_len == 0 the
>>     activation_params union is read by nci_store_ats_nfc_iso_dep() into
>>     ndev->target_ats.
> 
> [Low]
> Does the third scenario in this paragraph match the code?
> 
> Looking at nci_store_ats_nfc_iso_dep():
> 
>      static int nci_store_ats_nfc_iso_dep(struct nci_dev *ndev,
>                           const struct nci_rf_intf_activated_ntf *ntf)
>      {
>          ndev->target_ats_len = 0;
> 
>          if (ntf->activation_params_len <= 0)
>              return NCI_STATUS_OK;
> 
>          if (ntf->activation_params.nfca_poll_iso_dep.rats_res_len > NFC_ATS_MAXSIZE) {
>          ...
> 
> The function returns NCI_STATUS_OK before any access to the
> activation_params union when activation_params_len is 0, so the union
> is not read in that case.
> 
> The sibling nci_store_general_bytes_nfc_dep() has the same early return
> on activation_params_len <= 0.
> 
> The two rf_tech_specific_params_len == 0 cases listed earlier in the
> commit message do match the code (nci_add_new_protocol() and
> nci_target_auto_activated() unconditionally pass
> &ntf.rf_tech_specific_params even when rf_tech_specific_params_len is
> 0), and the KMSAN trace quoted in the commit message only mentions the
> nci_add_new_protocol() path.
> 
> The ntf = {} change still looks correct and necessary for the
> rf_tech_specific_params union. Could the third bullet be dropped or
> adjusted so the commit message reflects only the paths that actually
> read uninitialised bytes?

Thanks for sharing the review.

This is valid feedback, I'll fix up the commit message in a v2.

^ permalink raw reply

* Re: [PATCH v2] net: mdio: airoha: fix reset control leak in error path
From: Larysa Zaremba @ 2026-06-25 10:10 UTC (permalink / raw)
  To: Wentao Liang
  Cc: Andrew Lunn, Heiner Kallweit, Russell King, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260622115403.39772-1-vulab@iscas.ac.cn>


Please, specify the target tree (probably, net) via a subject prefix.
Also, looks like you are missing to many CCs [0]

[0] 
https://netdev-ctrl.bots.linux.dev/logs/build/1114710/14639274/cc_maintainers/summary

On Mon, Jun 22, 2026 at 07:54:03PM +0800, Wentao Liang wrote:
> In airoha_mdio_probe(), after calling reset_control_deassert(),
> if clk_set_rate() fails, the function returns immediately without
> calling reset_control_assert(). This leaves the reset line
> deasserted and causes a reference count leak on shared reset
> controllers.
>

Sashiko correctly points out that since the reset controller is exclusive, 
there is no refcount leak. [1]

So the problem is missing rstc->rcdev->ops->assert(rstc->rcdev, rstc->id) call. 
It would be great if you could describe, what problem this can cause.

> Fix this by reorganizing the error handling to use a goto label,
> ensuring reset_control_assert() is called on all error paths
> before returning.
> 
> Also add error checking for reset_control_deassert().

Sashiko correctly points out you do not actually do this, which is fine, just 
update the commit message. [1]

[1] https://sashiko.dev/#/patchset/20260622115403.39772-1-vulab%40iscas.ac.cn

> Fixes: 67e3ba978361 ("net: mdio: Add MDIO bus controller for Airoha AN7583")
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
>  drivers/net/mdio/mdio-airoha.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/mdio/mdio-airoha.c b/drivers/net/mdio/mdio-airoha.c
> index 52e7475121ea..4c1b2415687c 100644
> --- a/drivers/net/mdio/mdio-airoha.c
> +++ b/drivers/net/mdio/mdio-airoha.c
> @@ -246,15 +246,17 @@ static int airoha_mdio_probe(struct platform_device *pdev)
>  
>  	ret = clk_set_rate(priv->clk, freq);
>  	if (ret)
> -		return ret;
> +		goto err_reset_assert;
>  
>  	ret = devm_of_mdiobus_register(dev, bus, dev->of_node);
> -	if (ret) {
> -		reset_control_assert(priv->reset);
> -		return ret;
> -	}
> +	if (ret)
> +		goto err_reset_assert;
>  
>  	return 0;
> +
> +err_reset_assert:
> +	reset_control_assert(priv->reset);
> +	return ret;
>  }
>  
>  static const struct of_device_id airoha_mdio_dt_ids[] = {
> -- 
> 2.39.5 (Apple Git-154)
> 
> 

^ permalink raw reply

* Re: [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-25 10:11 UTC (permalink / raw)
  To: david
  Cc: akpm, avagin, axboe, brauner, collin.funk1, david.laight.linux,
	dhowells, fuse-devel, hch, jack, joannelkoong, kernel, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, luto, metze, miklos,
	netdev, patches, pfalcato, safinaskar, torvalds, val, viro, w,
	willy
In-Reply-To: <89ea76b3-e956-4232-8180-ee3929adf905@kernel.org>

"David Hildenbrand (Arm)" <david@kernel.org>:
> I think we concluded that we cannot rip out vmsplice that way at this point, and
> I suspect that Christian will drop that topic branch from -next after -rc1.

I think my patches still have a chance.

On fuse regression: I return EINVAL for particular combination of
flags used by fuse. This causes fuse to fail-back to non-vmsplice
code path. I did Debian code search, and I found none significant
packages, which use same combination of options.

So I think I was able to deal with fuse regression.

On CRIU named fifo "Not supported" regression: it is handled.

On CRIU major performance regression: it is NOT handled. But I still
think my approach is right. (See cover letter for details.)

(I wrote about all these in cover letter for this v2 patchset.)

So all regressions found so far (except for CRIU major performance
regression) are handled.

Other option is to introduce some deprecation period (as
suggested by Andrei Vagin). I can do this, if needed.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH net] net: libwx: fix VMDQ mask for 1-queue mode
From: Larysa Zaremba @ 2026-06-25 10:14 UTC (permalink / raw)
  To: Jiawen Wu
  Cc: netdev, 'Mengyuan Lou', 'Andrew Lunn',
	'David S. Miller', 'Eric Dumazet',
	'Jakub Kicinski', 'Paolo Abeni',
	'Simon Horman', 'Kees Cook'
In-Reply-To: <062401dd0487$3a3152f0$ae93f8d0$@trustnetic.com>

On Thu, Jun 25, 2026 at 05:44:33PM +0800, Jiawen Wu wrote:
> On Thu, Jun 25, 2026 5:39 PM, Larysa Zaremba wrote:
> > On Thu, Jun 25, 2026 at 05:08:51PM +0800, Jiawen Wu wrote:
> > > In wx_set_vmdq_queues(), the VMDQ mask was not set for the devices not
> > > support WX_FLAG_MULTI_64_FUNC, i.e., NGBE devices. A mask of 0 causes
> > > __ALIGN_MASK(1, ~vmdq->mask) to return 0, which incorrectly sets
> > > q_per_pool to 0 in wx_write_qde().
> > >
> > > Fix the VMDQ 1-queue mask to 0x7F then ensures that __ALIGN_MASK(1,
> > > 0x7F) correctly evaluates to 1.
> > 
> > __ALIGN_MASK(1, 0x7F) evaulates to 0x80 (128), not to 1. __ALIGN_MASK(1, 0x7E)
> > evaluates to 1. Maybe you need 0x7D for 2 queues and 0x7E for 1 queue?
> 
> Sorry, the commit log is so wrong for that '~' is missing...
> I want to describe that __ALIGN_MASK(1, ~0x7F) evaluates to 1.
>

Then I do not have any further concerns. Given you fix the lack of "~" in the 
commit message and change "not support" to "not supporting" above, I approve 
this patch.

Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
 
> > 
> > >
> > > Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
> > > Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
> > > ---
> > >  drivers/net/ethernet/wangxun/libwx/wx_lib.c  | 1 +
> > >  drivers/net/ethernet/wangxun/libwx/wx_type.h | 1 +
> > >  2 files changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > > index d042567b8128..814d88d2aee4 100644
> > > --- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > > +++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > > @@ -1802,6 +1802,7 @@ static bool wx_set_vmdq_queues(struct wx *wx)
> > >  			rss_i = 4;
> > >  		}
> > >  	} else {
> > > +		vmdq_m = WX_VMDQ_1Q_MASK;
> > >  		/* double check we are limited to maximum pools */
> > >  		vmdq_i = min_t(u16, 8, vmdq_i);
> > >
> > > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > > index c7befe4cdfe9..65e3e55db1cf 100644
> > > --- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > > +++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > > @@ -486,6 +486,7 @@ enum WX_MSCA_CMD_value {
> > >
> > >  #define WX_VMDQ_4Q_MASK              0x7C
> > >  #define WX_VMDQ_2Q_MASK              0x7E
> > > +#define WX_VMDQ_1Q_MASK              0x7F
> > >
> > >  /****************** Manageablility Host Interface defines ********************/
> > >  #define WX_HI_MAX_BLOCK_BYTE_LENGTH  256 /* Num of bytes in range */
> > > --
> > > 2.51.0
> > >
> > 
> 

^ permalink raw reply

* Re: [PATCH iwl v3] ice: retry reading NVM if admin queue returns EBUSY
From: Robert Malz @ 2026-06-25 10:14 UTC (permalink / raw)
  To: Przemek Kitszel
  Cc: Simon Horman, Grzegorz Nitka, anthony.l.nguyen, intel-wired-lan,
	netdev
In-Reply-To: <5658849b-0425-4132-ba32-5801e2907c60@intel.com>

Hey Przemek,
Thanks a lot for the feedback.
I was sure that we use ICE_NVM_TIMEOUT (180s) as a timeout every time
(ice_acquire_nvm) but your proposal made me rethink it a little.
First of all, the datasheet for E810 specifies the timeout as: "As an
input, the software might specify timeout longer than the default
taken for this resource, and up to one minute."
180s is greater than one minute so I took a look into AQC logs:
[  110.698471] ice 0000:05:00.0: CQ CMD: opcode 0x0008, flags 0x2000,
datalen 0x0000, retval 0x0000
[  110.698474] ice 0000:05:00.0:        cookie (h,l) 0x00000000 0x00000000
[  110.698477] ice 0000:05:00.0:        param (0,1)  0x00010001 0x0002BF20
[  110.698480] ice 0000:05:00.0:        addr (h,l)   0x00000000 0x00000000
[  110.698645] ice 0000:05:00.0: ATQ: desc and buffer writeback:
[  110.698648] ice 0000:05:00.0: CQ CMD: opcode 0x0008, flags 0x2003,
datalen 0x0000, retval 0x0000
[  110.698651] ice 0000:05:00.0:        cookie (h,l) 0x00000000 0x00000000
[  110.698654] ice 0000:05:00.0:        param (0,1)  0x00010001 0x00000BB8
[  110.698657] ice 0000:05:00.0:        addr (h,l)   0x00000000 0x00000000
Based on the above, the driver requested a 0x0002BF20 timeout (180 000
ms) but the FW returned only 0x00000BB8 (3s).
I'm assuming this is expected behavior since the maximum timeout for
NVM read should be 60,000 ms.

If changing the timeout requested by the driver to 60s for read ops is
handled correctly by the FW and the FW respects that lock, the retry
patch submitted in this email thread might not be required at all.
Let me quickly prepare a new patch and test it. I'll update this
thread once I have a results.

Regards,
Robert


On Thu, Jun 25, 2026 at 9:51 AM Przemek Kitszel
<przemyslaw.kitszel@intel.com> wrote:
>
> >> This is an AI-generated review of your patch. The human sending this
>
>
> >> [Severity: Low]
> >> Does this check allow an extra retry execution? Since retry_cnt starts at 0,
> >> using a strictly greater-than check allows the loop to retry 4 times
> >> (5 total executions) instead of the 3 executions intended by
> >> ICE_SQ_SEND_MAX_EXECUTE.
> >>
> >>> +                             break;
> >>> +                     ice_debug(hw, ICE_DBG_NVM,
> >>> +                               "NVM read EBUSY error, retry %d\n",
> >>> +                               retry_cnt + 1);
> >>> +                     last_cmd = false;
> >>> +                     ice_release_nvm(hw);
> >>> +                     msleep(ICE_SQ_SEND_DELAY_TIME_MS);
> >>
> > The review is correct that this allows one extra execution. However, I
> > intentionally kept this verbatim from the OOT driver (1.15.4) to
>
> This OOT change was made entirely by Grzegorz, for E825C. Would be great
> to acknowledge his authorship (+ SoB ofc, below).
> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
>
> What was the platform that you have encountered the bug on?
>
> > retain the significant test exposure it has there. The absolute worst
> > case is one harmless extra retry before giving up. If you strongly
> > prefer strict adherence to ICE_SQ_SEND_MAX_EXECUTE, I can update the
> > condition in a v4.
>
> Please update (to have one attempt less).
>
> >
> >> [Severity: High]
> >> Does dropping the NVM lock here violate the caller's assumption of mutual
> >> exclusion?
> >>
> >> Callers of ice_read_flat_nvm() typically acquire the NVM lock beforehand to
> >> ensure an atomic read of multiple sectors without interleaved writes. By
> >> explicitly dropping the lock during the msleep() window, another thread, VF,
> >> or PF could acquire the NVM lock and modify the NVM before the read resumes,
> >> potentially resulting in reading torn or inconsistent NVM data.
> >>
> >>> +                     status = ice_acquire_nvm(hw, ICE_RES_READ);
> >>> +                     if (status)
> >>> +                             break;
> >>
> >
> > I don't think this is an issue. The NVM resource (req_res 0x0008) is
> > an ownership lease, not a content snapshot. The FW grants it with a
> > hold timeout (see the timeout field handling in ice_aq_req_res). It
> > never guaranteed a point-in-time view of the flash.
> > Very similar behavior already exists in ice_devlink_nvm_snapshot(),
> > which reads the NVM in chunks via ice_read_flat_nvm() and deliberately
> > acquires/releases the lock between chunks (with a comment explaining
> > the FW reclaim).
> > Additionally, the lock is only dropped on the failure path, where
> > neither bytes_read nor offset have been advanced. The driver simply
> > re-reads the failed chunk after re-acquiring. So no data that is
> > returned to the caller is committed across the drop window.
> > Overall the lock drop does not weaken any guarantee callers actually
> > have. If we wanted to introduce a true snapshot guarantee, that would
> > probably require different approach and separate patch.
> >
> >
> >> [Severity: High]
> >> Can this lead to an unbalanced NVM resource lock release and potential data
> >> corruption?
> >>
> >> If ice_acquire_nvm() fails here, the loop breaks and returns an error with the
> >> lock currently unheld. However, callers like ice_read_flash_module() are
> >> unaware of this internal lock state change and will unconditionally call
> >> ice_release_nvm() upon failure:
> >>
> >> drivers/net/ethernet/intel/ice/ice_nvm.c:ice_read_flash_module() {
> >>          status = ice_acquire_nvm(hw, ICE_RES_READ);
> >>          if (status)
> >>                  return status;
> >>
> >>          status = ice_read_flat_nvm(hw, start + offset, &length, data, false);
> >>
> >>          ice_release_nvm(hw);
> >>      ...
> >>
> >> Because firmware tracks resource locks per-PF rather than per-OS-thread, this
> >> errant second release could forcefully strip the lock from another thread on
> >> the same PF that successfully acquired it, exposing the NVM to concurrent
> >> modification.
> >>
> >
> > Agreed, this might be a real bug, and the one of the three I think is
> > worth investigating.
> > If ice_acquire_nvm() fails after the drop, ice_read_flat_nvm() returns
> > with the lock unheld while callers unconditionally call
> > ice_release_nvm(), so a stray release is issued.
> >
> > On probability, though, the window is very small. Reaching it requires
> > sustained EBUSY across the retry budget plus a failed re-acquire
> > (which itself polls up to ICE_NVM_TIMEOUT), and concurrently another
> > requester taking the lock. Most reads happen during init (ice_probe,
> > and reset/rebuild via ice_init_nvm), and NVM writes only happen on an
> > already initialized driver. The devlink/ethtool nvm_read paths are
> > also exposed, but hitting this race would require precise timing
> > against a concurrent NVM owner on the device.
> >
> > I'd prefer to keep the scope of this patch limited to the EBUSY retry
> > path and not take on the unbalanced-release fix here. A proper fix
> > should change the lock-ownership contract of ice_read_flat_nvm() (on
> > error, the lock must be released by ice_read_flat_nvm(), callers
> > release only on success) and update all callers. Code change sould be
> > simple for all callers but ice_discover_flash_size(), it intentionally
> > holds one lease across a read loop and would need to re-acquire after
> > each expected boundary failure.
> >
> > Given how small the original window is, I'd rather not trade tested
> > OOT behavior for the risk of a complex unbalanced NVM lock fix. I
> > actually have a patch mostly ready that fixes the lock-ownership
> > contract, but I really don't like it. It changes the design of
> > ice_read_flat_nvm(), making it less intuitive for callers. More
> > importantly, I just don't have the resources or test coverage right
> > now to properly verify such change.
> >
> > However, I can modify the failure path for ice_acquire_nvm inside
> > ice_read_flat_nvm. Instead of bailing out immediately, we can just
> > retry it within the existing retry budget. In this case, the
> > probability of leaving ice_read_flat_nvm without holding the lock is
> > reduced even further without needing a refactor.
> >
> > Please let me know what you think about my thought process on this.
>
> I think that both AI-reported issues against the lock are valid
> concerns.
>
> I think that sleep was the actual fix, and re-locking were merely
> a necessity due to their expiration (as you said).
>
> A proper fix would be to just increase lock-timeout to accommodate all
> attempts (and still do the retries&sleep, but without unlocking).
>
>
> >
> >
> >
> >>> +                     retry_cnt++;
> >>> +             } else {
> >>> +                     bytes_read += read_size;
> >>> +                     offset += read_size;
> >>> +                     retry_cnt = 0;
> >>> +             }
> >>>        } while (!last_cmd);
> >>>
> >>>        *length = bytes_read;
> >
> > Thanks,
> > Robert
>

^ permalink raw reply

* [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Vimal Agrawal @ 2026-06-25 10:20 UTC (permalink / raw)
  To: netdev; +Cc: kuniyu, edumazet, vimal.agrawal, avimalin
In-Reply-To: <CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com>

Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
on every allocation attempt with no rate limiting. In workloads with mostly
active/reachable entries, the GC walk traverses a large portion of the
neighbour table without reclaiming entries, holding tbl->lock for an
extended period. This causes severe lock contention and allocation
latencies exceeding 16ms under sustained neighbour creation.

Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
performed within the last second, avoiding repeated full table scans and
lock acquisitions on the hot allocation path.

Profiling of neigh_create() shows ~3 orders of magnitude latency
improvement with this change.

Link:https://lore.kernel.org/netdev/CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com/
Signed-off-by: Vimal Agrawal <vimal.agrawal@sophos.com>
---
 net/core/neighbour.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 1349c0eedb64..078842db3c5f 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
 	int shrunk = 0;
 	int loop = 0;

+	if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
+		return 0;
+
 	NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);

 	spin_lock_bh(&tbl->lock);
-- 
2.17.1

^ permalink raw reply related

* Re: [Intel-wired-lan] [PATCH net] igc: Fix RX HW timestamp reporting when NET_RX_BUSY_POLL is disabled
From: Ding Meng @ 2026-06-25 10:33 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Florian Bezdeka, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni, jan.kiszka,
	intel-wired-lan, linux-kernel, netdev, wq.wang
In-Reply-To: <40abd0b5-7f3f-4cd4-9975-9db4498d15d3@molgen.mpg.de>

Dear Paul,

Thanks for your comments.

On Mon, Jun 22, 2026 at 05:59:53PM +0200, Paul Menzel wrote: 
> Am 22.06.26 um 06:13 schrieb Ding Meng via Intel-wired-lan:
> > When CONFIG_NET_RX_BUSY_POLL is deactivated, fetching RX HW timestamps
> > from the NIC no longer works as expected.
> 
> Maybe paste some logs/errors, so it can be easier found by people with the
> same issue.

Will do.
 
> > This occurs because disabling CONFIG_NET_RX_BUSY_POLL disables the
> > SKB NAPI mapping in __skb_mark_napi_id(). Consequently, get_timestamp()
> > fails to perform its driver lookup, and the igc driver's struct
> > net_device_ops::ndo_get_tstamp is never invoked.
> > 
> > Instead, get_timestamp() falls back to use shhwtstamps(skb)->hwtstamp,
> > a field that the driver has not populated.
> > 
> > Fix this by populating the hwtstamp field with the correct timestamp
> > in the default timer when CONFIG_NET_RX_BUSY_POLL is disabled.
> 
> Maybe detail, why the adapter needs to be passed now.
> 
> Also, please describe a test case to check the change.

Will do.

> > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> > index 8ac16808023..1da8d7aa76d 100644
> > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> > @@ -1992,7 +1992,26 @@ static struct sk_buff *igc_build_skb(struct igc_ring *rx_ring,
> >   	return skb;
> >   }
> > -static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
> > +static void igc_construct_skb_timestamps(struct igc_adapter *adapter,
> > +					 struct sk_buff *skb,
> > +					 struct igc_xdp_buff *ctx)
> > +{
> > +	if (!ctx->rx_ts)
> > +		return;
> > +#ifdef CONFIG_NET_RX_BUSY_POLL
> 
> Is there a way to do this in C instead of the pre-processor. That way all
> the code gets build tested. (Is there a config with disabled
> NET_RX_BUSY_POLL?)
> 
 
How about defining a function to replace the pre-processor:

static inline bool is_net_rx_busy_poll()
{
#ifdef CONFIG_NET_RX_BUSY_POLL
        return true;
#else
        return false;
#endif
}

CONFIG_PREEMPT_RT=y && CONFIG_NETCONSOLE=y will cause NET_RX_BUSY_POLL
disabled.


Kind regards,
 
Ding Meng

^ permalink raw reply

* Re: [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: David Hildenbrand (Arm) @ 2026-06-25 10:35 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, avagin, axboe, brauner, collin.funk1, david.laight.linux,
	dhowells, fuse-devel, hch, jack, joannelkoong, kernel, linux-api,
	linux-fsdevel, linux-kernel, linux-mm, luto, metze, miklos,
	netdev, patches, pfalcato, torvalds, val, viro, w, willy
In-Reply-To: <20260625101132.3859505-1-safinaskar@gmail.com>

On 6/25/26 12:11, Askar Safin wrote:
> "David Hildenbrand (Arm)" <david@kernel.org>:
>> I think we concluded that we cannot rip out vmsplice that way at this point, and
>> I suspect that Christian will drop that topic branch from -next after -rc1.
> 
> I think my patches still have a chance.

I talked to Christian and it doesn't sound like it.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
From: Maciej Fijalkowski @ 2026-06-25 10:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, bpf, magnus.karlsson, stfomichev, pabeni, horms,
	tushar.vyavahare, kerneljasonxing
In-Reply-To: <20260624193326.295e3711@kernel.org>

On Wed, Jun 24, 2026 at 07:33:26PM -0700, Jakub Kicinski wrote:
> On Tue, 23 Jun 2026 11:10:08 +0200 Maciej Fijalkowski wrote:
> > Subject: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
> 
> Do you want it in net? Either way - we'll need a rebase

I have not checked if this has been -net propagated already, but the rule
of thumb on bpf side was that all selftests related effort goes to -next.
Is it different on netdev side?

> 
> > Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com
> 
> missing <> around the email 

oof.

> 

^ permalink raw reply

* Re: [PATCH 5.10] netfilter: nf_log: validate MAC header was set before dumping it
From: Sasha Levin @ 2026-06-25 10:41 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Sasha Levin, Alexander Martyniuk, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Jakub Kicinski,
	Patrick McHardy, netfilter-devel, coreteam, netdev, linux-kernel,
	Weiming Shi, Xiang Mei
In-Reply-To: <20260624140117.19799-1-alexevgmart@gmail.com>

> [PATCH 5.10] netfilter: nf_log: validate MAC header was set before
> dumping it
>
> --- a/net/ipv4/netfilter/nf_log_ipv4.c
> +++ b/net/ipv4/netfilter/nf_log_ipv4.c

Thanks for the backport - the retarget to nf_log_ipv4.c is right for 5.10.

One gap though: upstream fixed both loggers via the consolidated
nf_log_syslog.c, but in 5.10 the IPv6 logger (net/ipv6/netfilter/
nf_log_ipv6.c) still has the identical unguarded fallback and is left
vulnerable here - which is also Pablo's "why only 5.10?" point.

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 5.15/6.1/6.6] af_unix: Reject SIOCATMARK on non-stream sockets
From: Sasha Levin @ 2026-06-25 10:41 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Sasha Levin, Alexander Martyniuk, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Kuniyuki Iwashima, Jann Horn, Lee Jones, Rao Shoaib,
	netdev, linux-kernel, stable, Yuan Tan, Yifan Wu, Juefei Pu,
	Xin Liu, Jiexun Wang, Ren Wei
In-Reply-To: <20260624151651.38894-1-alexevgmart@gmail.com>

> [PATCH 5.15/6.1/6.6] af_unix: Reject SIOCATMARK on non-stream sockets
>
> Backport fix for CVE-2026-52928. Reject SIOCATMARK in unix_ioctl()
> for non-stream sockets.

Queued for 6.6, 6.1 and 5.15, thanks!

--
Thanks,
Sasha

^ permalink raw reply

* Re: Please backport bridge multicast exponential field encoding fix series to stable kernels
From: Sasha Levin @ 2026-06-25 10:42 UTC (permalink / raw)
  To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, Ido Schimmel, David Ahern,
	Shuah Khan, Andy Roulin, Yong Wang, Petr Machata, stable, Greg KH,
	Greg Kroah-Hartman
  Cc: Sasha Levin, Ujjal Roy, bridge, Kernel, Kernel, linux-kselftest,
	Ujjal Roy
In-Reply-To: <CAE2MWknz4X_gcNo6jkR87Lg8F0zfubkOc4Ujr57CS3aBMWrjEA@mail.gmail.com>

> Please backport the 5-patch bridge multicast exponential field
> encoding series (726fa7da2d8c, 12cfb4ecc471, 95bfd196f0dc,
> e51560f4220a, 529dbe762de0) to the stable kernels.

I tried, but it doesn't apply to 7.1. Could you provide a backport please?

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH net v3 1/2] iov_iter: export iov_iter_restore
From: Christian Brauner @ 2026-06-25 10:43 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
	David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
	Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
	Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
	Stefano Garzarella, virtualization, Xuan Zhuo, Jens Axboe
In-Reply-To: <20260622222757.2130402-2-tavip@google.com>

> Export iov_iter_restore so that it can be used by modules.
> 
> This is needed by the virtio vsock transport (which can be built as a
> module) to restore the msg_iter state when transmission fails.
> 
> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Octavian Purdila <tavip@google.com>
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 273919b16161..f5df63961fb2 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
>  		i->__iov -= state->nr_segs - i->nr_segs;
>  	i->nr_segs = state->nr_segs;
>  }
> +EXPORT_SYMBOL_GPL(iov_iter_restore);

At least only export it for the module that really needs it. For
example, see:

EXPORT_SYMBOL_FOR_MODULES(__kernel_write, "autofs4");

-- 
Christian Brauner <brauner@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 10:46 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <b7de216a-fd1a-42a0-8711-d822a1ad9319@lunn.ch>

Hi Andrew,

On 5/29/26 14:59, Andrew Lunn wrote:

(This discussion was a while ago, but this bit of context should be enough)

> But we also need to consider that for some APIs, we have decided that
> a configuration can be set now, which does not actually apply in our
> current conditions, but it will be stored away for when conditions
> change and it is applicable. The half duplex case could fit that. When
> the link is currently half duplex, you can configure pause, but you
> don't expect it to actually change the current behaviour. It only
> kicks in when the link renegotiates to full duplex sometime in the
> future. We have to also consider this the other way around. The link
> is full duplex and pause is configured by the user. Something happens
> with the LP and the link renegotiates to half duplex. The local end
> should not throw away the configuration, it simply cannot apply it
> given the current situation.

I'm writing the test description for HD with a better formatting, so the
HD test wouldn't be about "are we using pause stuff while in HD" as it
doesn't make sense, but rather "do we correctly store the pause settings
aside for later".

I'm realising that we don't really have an API to report the *true* in-use pause
settings. Taking HD as an example :

# ethtool -s eth2 duplex half

[588209.379363] mvpp2 f4000000.ethernet eth2: Link is Up - 100Mbps/Half - flow control off

# ethtool eth2
	[...]
	Supported pause frame use: Symmetric Receive-only
	Advertised pause frame use: Symmetric Receive-only
	Link partner advertised pause frame use: Symmetric Receive-only

# ethtool -a eth2
Autonegotiate:	on
RX:		off
TX:		off
RX negotiated: on
TX negotiated: on

Sure, pause and HD don't make sense, however what I find confusing to some
extent is that the only place we have information about the *actual* pause
settings is the "link is Up" log in dmesg.

Maybe the problem in the above situation is that whoever advertises
half-duplex only modes should also not advertise pause ?

Still, I'm wondering if we should even care about all that actually, HD and
Pause are incompatible, and that's it. If you have any thought on this, let
me know.

Maxime

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Jakub Sitnicki @ 2026-06-25 10:48 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Michal Luczaj, Willem de Bruijn, John Fastabend, Jiayuan Chen,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <CAAVpQUAy=EVZcZRmXJPr=neJh7Q+UbYML5L4+nBynVrDPidUkw@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:39 PM -07, Kuniyuki Iwashima wrote:
> On Wed, Jun 24, 2026 at 2:33 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>
>> On Wed, Jun 24, 2026 at 2:26 PM Michal Luczaj <mhal@rbox.co> wrote:
>> >
>> > On 6/24/26 22:01, Willem de Bruijn wrote:
>> > > Jakub Sitnicki wrote:
>> > >> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
>> > >>> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
>> > >>> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
>> > >>>
>> > >>> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
>> > >>> socket's refcount via lookup. If the socket is subsequently bound, the
>> > >>> transition from unbound to bound causes bpf_sk_release() to skip the
>> > >>> decrement of the refcount, causing a memory leak.
>> > >>>
>> > >>> unreferenced object 0xffff88810bc2eb40 (size 1984):
>> > >>>   comm "test_progs", pid 2451, jiffies 4295320596
>> > >>>   hex dump (first 32 bytes):
>> > >>>     7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
>> > >>>     02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
>> > >>>   backtrace (crc bdee079d):
>> > >>>     kmem_cache_alloc_noprof+0x557/0x660
>> > >>>     sk_prot_alloc+0x69/0x240
>> > >>>     sk_alloc+0x30/0x460
>> > >>>     inet_create+0x2ce/0xf80
>> > >>>     __sock_create+0x25b/0x5c0
>> > >>>     __sys_socket+0x119/0x1d0
>> > >>>     __x64_sys_socket+0x72/0xd0
>> > >>>     do_syscall_64+0xa1/0x5f0
>> > >>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> > >>>
>> > >>> Maintain balanced refcounts across sk lookup/release: (re-)set
>> > >>> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
>> > >>> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
>> > >>>
>> > >>> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
>> > >>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> > >>> ---
>> > >>> Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
>> > >>> sockets in bpf_sk_assign").
>> > >>> ---
>> > >>>  net/ipv4/udp_bpf.c | 3 +++
>> > >>>  1 file changed, 3 insertions(+)
>> > >>>
>> > >>> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
>> > >>> index ad57c4c9eaab..970327b59582 100644
>> > >>> --- a/net/ipv4/udp_bpf.c
>> > >>> +++ b/net/ipv4/udp_bpf.c
>> > >>> @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>> > >>>     if (sk->sk_family == AF_INET6)
>> > >>>             udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
>> > >>>
>> > >>> +   /* Treat all sockets as non-refcounted, regardless of binding state. */
>> > >>> +   sock_set_flag(sk, SOCK_RCU_FREE);
>> > >>> +
>> > >>>     sock_replace_proto(sk, &udp_bpf_prots[family]);
>> > >>>     return 0;
>> > >>>  }
>> > >>
>> > >> There is a side effect that an unhashed (unbound) UDP socket can now be
>> > >> selected in sk_lookup with bpf_sk_assign.
>> > >
>> > > The commit does mention a related fix, beneath the ---, commit
>> > > 67312adc96b5 ("bpf: reject unhashed sockets in bpf_sk_assign").
>> > > That fixes a similar issue by exactly disallowing this:
>> > >
>> > >     Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
>> > >     This matches the behaviour of __inet_lookup_skb which is ultimately
>> > >     the goal of bpf_sk_assign().
>> > >
>> > > So ..
>> > >
>> > >> Though perhaps that's for the
>> > >> better because TC bpf_sk_assign doesn't reject non-refcounted UDP
>> > >> sockets either, so we would have both socket dispatch sites behave the
>> > >> same way.
>> > >
>> > > .. there are two conflicting types of consistency here? Consistent with
>> > > __inet_lookup_skb or the TC bpf hook. Of those the first is the more
>> > > canonical.
>> > >
>> > >> Also, with this patch, if we insert & remove an unhashed UDP socket
>> > >> into/from a sockmap, we end up with an unhashed non-refcounted UDP
>> > >> socket. Not entirely sure if that is actually a problem or not.
>> > >>
>> > >> Willem, what is your take on having unhashed non-refcoted UDP sockets?
>> > >
>> > > I don't immediately see a problem, but I'm not an expert on SOCK_RCU_FREE.
>> >
>> > Perhaps it's worth mentioning that unhashed non-refcounted UDP socket is
>> > already possible: first auto-bind via connect(AF_INET) (which also sets
>> > SOCK_RCU_FREE), then unhash via connect(AF_UNSPEC).
>>
>> Setting SOCK_RCU_FREE itself should not cause a problem, but I think
>> we should take a step back.
>>
>> AFAIU, 0c48eefae712 was to allow putting AF_UNIX SOCK_DGRAM sockets
>> into sockmap, not to allow using unconnected UDP sockets in sk_lookup etc.
>>
>> Actually, v4 of the patch was implemented as such but did not get any feedback,
>> https://lore.kernel.org/bpf/20210508220835.53801-9-xiyou.wangcong@gmail.com/#t
>>
>> ... and v5 (the final commit) somehow removed the restriction for unconnected
>> UDP socket as well.
>> https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com/
>>
>> Given the initial use case, sockmap redirect, is still blocked by
>> TCP_ESTABLISHED
>> check in sock_map_redirect_allowed(), I feel there is no point in supporting
>> unconnected UDP sockets in sockmap.  It cannot get any skb from anywhere
>> (without buggy sk_lookup).
>
> s/unconnected/unhashed/g :)

Rejecting unhashed UDP sockets on insert to sockmap SGTM.
It is also in line with disable-problematic-cases strategy.

^ permalink raw reply

* Re: [PATCH net] netpoll: fix a use-after-free on shutdown path
From: Breno Leitao @ 2026-06-25 10:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Amerigo Wang, netdev, linux-kernel, vlad.wing, asantostc,
	kernel-team, stable
In-Reply-To: <20260624192513.33023e54@kernel.org>

On Wed, Jun 24, 2026 at 07:25:13PM -0700, Jakub Kicinski wrote:
> On Mon, 22 Jun 2026 08:01:23 -0700 Breno Leitao wrote:
> > +		 * synchronize_net() does not protect the worker
> > +		 * (queue_process() is not an RCU reader). It fences the
> > +		 * senders -- the real RCU readers -- so they cannot re-arm
> > +		 * tx_work after the np->dev->npinfo was set to NULL.
> > +		 */
> > +		synchronize_net();
> > +		cancel_delayed_work_sync(&npinfo->tx_work);
> 
> Maybe we can avoid the sync_net and the comment by using
> disable_delayed_work_sync() ?

I've been thinking about it, and I think you have a good point.
queue_process() is the only place that take npinfo without RCU
protection.

This is what it happening right now:

CPU0 {
	run tx_work (queue_process())
	npinfo = container_of()...
	while {
A:		deqeue skb from the txq
		try to send
	}
}

CPU 1 {
	call_rcu() -> rcu_cleanup_netpoll_info()
	np->dev->npinfo, NULL
B:	kfree(npinfo);
}

Then, if B happens before A, we have the UAF. That said, if we make sure
that tx_work() is done, then we are OK with rcu_cleanup_netpoll_info

I am not totally sure if the order of pointer zero'ing and disabling
tx work is important, but, it doesn't seem so, any order would be OK
for:

	RCU_INIT_POINTER(np->dev->npinfo, NULL);
	disable_delayed_work_sync(&npinfo->tx_work);

Given that npinfo is not read inside queue_process(), then, order doesn't
matter.

Thanks for the point, I will update.
--breno

---
pw-bot: cr


^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Jani Nikula @ 2026-06-25 11:00 UTC (permalink / raw)
  To: Kaitao Cheng, David Laight, Christian König,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <0ed6b5c3-e955-46e2-9fc6-075a0dfd1c4f@linux.dev>

On Thu, 25 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 在 2026/6/24 22:23, David Laight 写道:
>> On Wed, 24 Jun 2026 15:23:47 +0200
>> Christian König <christian.koenig@amd.com> wrote:
>>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>>  
>>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>
>>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>>> every call site, even though most users only need it for the iterator
>>>>>> implementation and never reference it in the loop body.
>>>>>>
>>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>>> a unique internal cursor.  
>>>>>
>>>>> I'm not really sure 'mutable' means anything either.
>>>>> It is possible to make it valid for the loop body (or even other threads)
>>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>>
>>>>> It might be worth doing something that doesn't need the extra variable,
>>>>> but there is little point doing all the churn just to rename things.
>>>>>  
>>>>>>
>>>>>> This makes call sites that only mutate the list through the current entry
>>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>>> compatibility.
>>>>>>
>>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>> ---
>>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>>> --- a/include/linux/list.h
>>>>>> +++ b/include/linux/list.h
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  #include <linux/stddef.h>
>>>>>>  #include <linux/poison.h>
>>>>>>  #include <linux/const.h>
>>>>>> +#include <linux/args.h>
>>>>>>  
>>>>>>  #include <asm/barrier.h>
>>>>>>  
>>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>>  #define list_for_each_prev(pos, head) \
>>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>>  
>>>>>> -/**
>>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> +/*
>>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>>   */
>>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>>  	     !list_is_head(pos, (head)); \
>>>>>>  	     pos = n, n = pos->next)
>>>>>>  
>>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>>
>>>>> Use auto
>>>>>  
>>>>>> +	     !list_is_head(pos, (head));				\
>>>>>> +	     pos = tmp, tmp = pos->next)
>>>>>> +
>>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>>> +
>>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>>> +	list_for_each_safe(pos, next, head)
>>>>>> +
>>>>>>  /**
>>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> + * @...:	either (head) or (next, head)
>>>>>> + *
>>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>>> + *		the caller.
>>>>>> + * head:	the head for your list.
>>>>>> + */
>>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>>> +		(pos, __VA_ARGS__)  
>>>>>
>>>>> The variable argument count logic really just slows down compilation.
>>>>> Maybe there aren't enough copies of this code to make that significant.
>>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>>> I'm also not sure it really adds anything to the readability.
>>>>>
>>>>> And, it you are going to make the middle argument optional there is
>>>>> no need to change the macro name.  
>>>>
>>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>>> implementation approach. If we abandon that method, it means we will
>>>> inevitably need to add some new macros. If mutable is not a good name,
>>>> suggestions for better alternatives would be welcome; coming up with a
>>>> suitable name is indeed rather tricky.  
>>>
>>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>>
>>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
>> 
>> IIRC currently you have a choice of either:
>> 	define               Item that can't be deleted
>> 	list_for_each()	     The current item.
>> 	list_for_each_safe() The next item.
>> There is also likely to be code that updates the variables to allow
>> for other scenarios.
>> 
>> Note that if increase a reference count and release a lock then list_for_each()
>> is likely safer than list_for_each_safe() :-)
>> 
>> list.h has 9 variants of the 'safe' loop.
>> The bloat of another 9 is getting excessive.
>> 
>> It has to be said that this is one of my least favourite type of list...
>
> Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
> Andy Shevchenko, Alexei Starovoitov
>
> For ease of discussion, I need to summarize the currently possible
> approaches and briefly describe their respective pros and cons,
> using the list_for_each_entry* interfaces as examples.
>
> 1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
> and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
> would be used specifically for safe deletion scenarios that do not
> need to expose the temporary cursor externally. The code can refer to
> the v1 version.
>
> Pros: Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: Requires adding a whole set of mutable interfaces, which makes the
>       code somewhat redundant.

Seems fine, and the original _safe naming is ambiguous anyway.

> 2. Directly optimize away the temporary cursor in list_for_each_entry_safe
> and define it inside the loop instead, changing the interface from four
> arguments to three.
>
> Pros: Does not add redundant interfaces.
> Cons: (1) Users need to manually update special cases that use the
>       traversal variable of list_for_each_entry_safe, the new
>       list_for_each_entry_safe would no longer apply there and would
>       need to be open-coded.
>       (2) Because the macro arguments changes, all list_for_each_entry_safe
>       callers would need to be modified and merged together, making it
>       difficult to merge such a large amount of code at once.

This won't fly because there are literally thousands of
list_for_each_entry_safe() users.

> 3. Use a variadic macro approach to optimize list_for_each_entry_safe,
> so that it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can
>       be merged directly.
> Cons: (1) Increases compile time.
>       (2) Makes the interface harder for users to use.

Basically I'm against any variadic macro tricks where the optional
argument is not the last argument. That's just way too surprising, and
goes against common practice in just about all other languages.

> 4. Optimize list_for_each_entry by defining the temporary cursor internally,
> making it compatible with the functionality of list_for_each_entry_safe.
> The code can refer to the v2 version.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) The number of externally visible arguments of list_for_each_entry
>       remains unchanged, still three.
> Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.
>       (2) Users need to manually update special cases that use the traversal
>       variable of list_for_each_entry, the new list_for_each_entry would no
>       longer apply there and would need to be open-coded. There are 15 such
>       cases in total.

This sounds good to me, though I take it there's some code size increase
and/or performance penalty?

Maybe the 15 cases are questionable anyway?

> 5. Use a variadic macro approach to optimize list_for_each_entry, so that
> it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: (1) Increases compile time.
>       (2) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.

Please don't do the macro tricks.

> 6. Make no changes, keep the current logic unchanged, and close the current
> email discussion.

I like hiding the temporary stuff when possible.


BR,
Jani.

-- 
Jani Nikula, Intel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox