Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Maxime Chevallier @ 2026-06-18 15:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Zhou, Yun
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-kernel
In-Reply-To: <20260618150440.cLDwgyDM@linutronix.de>

Hi,

On 6/18/26 17:04, Sebastian Andrzej Siewior wrote:
> On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>>
>>> But if the thread is idle then you have one enable too many, don't you?
>>> Well you have the NAPI callback which does disable on the local CPU and
>>> this resume which enables it on every CPU. So this does not look right.
>>>
>>
>> The enable in resume is intentionally unconditional and idempotent
>> (writing MPIC_INT_CLEAR_MASK on an already unmasked IRQ is a no-op).
>>
>>> The interesting question is what happens to the enable_percpu_irq() from
>>> the mvneta_poll(). Is it lost? And if so, how/ why?
>>>
>>
>> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
>> gets a chance to execute. The sequence is:
>>
>> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
>> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
> 
> But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
> not threaded. That does not look optimal.
> 
>> 3. NAPI poll cannot run → enable_percpu_irq() is never called
>> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>>     but does NOT execute the completion path (no enable_percpu_irq)
> 
> napi_schedule() sets NAPIF_STATE_SCHED.
> napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
> is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
> will be invoked unless it leaves somewhere early.
> However, if DISABLED was already set then it disables the IRQ source but
> does not schedule NAPI. This is probably what happens.
> 
>> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>>
>> The unconditional enable in resume covers this case. When NAPI was
>> idle at suspend time, the extra enable is harmless.
> 
> There is no desc::depth counting here, that got me confused. But that
> per-CPU irq is not optimal. Is this a SoC limitation or a design choice?

I _think_ that on mvneta, the mapping is done by assigning queues to CPUs
directly, and then you have per-cpu register banks to handle the interrupts :

https://elixir.bootlin.com/linux/v7.1/source/drivers/net/ethernet/marvell/mvneta.c#L138

This seems to be confirmed by some comments here [1] :

static void mvneta_percpu_mask_interrupt(void *arg)
{
	struct mvneta_port *pp = arg;

	/* All the queue are masked, but actually only the ones
	 * mapped to this CPU will be masked
	 */
	mvreg_write(pp, MVNETA_INTR_NEW_MASK, 0);
	mvreg_write(pp, MVNETA_INTR_OLD_MASK, 0);
	mvreg_write(pp, MVNETA_INTR_MISC_MASK, 0);
}

These registers will control different queue's interrupt behaviour depending
on which CPU executes that code...

[1] : https://elixir.bootlin.com/linux/v7.1/source/drivers/net/ethernet/marvell/mvneta.c#L1441

This may be why the design ended-up that way. I'm not saying this is ideal
though :)

The percpu interrupt mechanism isn't used on armada 3700 (hence all the special
conditions) because IIRC the interrupt routing is flawed for the network part, and
all interrupts end-up on CPU0 anyways...

Maxime


^ permalink raw reply

* Re: [PATCH net 4/5] net: hns3: differentiate autoneg default values between copper and fiber
From: Simon Horman @ 2026-06-18 15:34 UTC (permalink / raw)
  To: shaojijie
  Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260617112721.75186-5-shaojijie@huawei.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: hns3: differentiate autoneg default values between copper and fiber

This commit addresses a link loss issue during driver initialization on
optical ports connected to forced-mode remote switches. It modifies
hclge_init_ae_dev to set autonegotiation based on media type. Copper ports
retain the AUTONEG_ENABLE default, while optical ports inherit the firmware
pre-configured autoneg setting.

> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -11916,6 +11916,9 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
>  	if (ret)
>  		goto err_ptp_uninit;
>  
> +	if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_COPPER)
> +		hdev->hw.mac.req_autoneg = hdev->hw.mac.autoneg;
> +

[Severity: High]
Does this apply the wrong speed when autonegotiation is disabled on fiber
ports?

During driver initialization, hclge_update_port_info() dynamically probes
the SFP module and updates mac->speed, but does not update mac->req_speed,
which retains its statically configured default capability.

When hdev->hw.mac.autoneg is 0 (forced-mode), req_autoneg becomes 0, and
then hclge_set_autoneg_speed_dup() uses req_speed:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_set_autoneg_speed_dup() {
...
    if (!hdev->hw.mac.req_autoneg) {
        ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
...
}

Could this override the dynamically probed SFP speed with the static default
speed?

[Severity: High]
Does this unconditionally disable autonegotiation for V1 optical hardware?

Looking at hclge_update_port_info(), it only queries SFP information via
hclge_get_sfp_info() for V2 hardware. For V1 hardware, it falls back to
hclge_get_sfp_speed(), which leaves mac->autoneg uninitialized (0):

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_update_port_info() {
...
    if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) {
        speed = mac->speed;
        ret = hclge_get_sfp_info(hdev, mac);
    } else {
        speed = HCLGE_MAC_SPEED_UNKNOWN;
        ret = hclge_get_sfp_speed(hdev, &speed);
    }
...
}

If mac->autoneg is left as 0, wouldn't req_autoneg unconditionally be set
to 0 here, overriding the speed and breaking connectivity for V1 devices?

>  	ret = hclge_set_autoneg_speed_dup(hdev);
>  	if (ret) {
>  		dev_err(&pdev->dev,

^ permalink raw reply

* Re: [PATCH net] net: airoha: fix BQL underflow and UAF in shared QDMA TX ring
From: Lorenzo Bianconi @ 2026-06-18 15:42 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Wayen Yan, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260618-airoha-bql-fixes-v1-1-ffd2c2089518@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 7943 bytes --]

> When multiple netdevs share a QDMA TX ring and one device is stopped,
> netdev_tx_reset_subqueue() zeroes that device's BQL counters while its
> pending skbs remain in the shared HW TX ring. When NAPI later completes
> those skbs via netdev_tx_completed_queue(), the already-zeroed
> dql->num_queued counter underflows.
> Moreover, in the airoha_remove() path, netdevs are unregistered
> sequentially while skbs from previously unregistered netdevs may still
> reference freed net_device memory via skb->dev, causing a use-after-free
> during BQL accounting.
> Fix both issues:
> - Remove netdev_tx_reset_subqueue() from airoha_dev_stop() so pending
>   skbs are completed naturally by NAPI with proper BQL accounting.
> - Add netdev_tx_completed_queue() in airoha_qdma_cleanup_tx_queue()
>   to properly account for skbs freed during queue teardown.
> - Introduce airoha_qdma_tx_disable() to stop TX on all registered
>   netdevs for a given QDMA instance under RTNL lock.
> - Move DMA engine start/stop into probe/remove and
>   airoha_qdma_cleanup(), ensuring TX queues are cleaned up while all
>   netdevs are still registered and skb->dev is valid.
> 
> Fixes: 6df0488dc9dd ("net: airoha: fix BQL accounting in airoha_qdma_cleanup_tx_queue()")

This Fixes tag is wrong, the proper one is the one below:
Fixes: a9c2ca61fec7 ("net: airoha: Support multiple net_devices for a single FE GDM port")

Regards,
Lorenzo

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/airoha/airoha_eth.c | 95 ++++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 64dde6464f3f..4d6a061cd779 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -1004,6 +1004,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>  
>  		e = &q->entry[index];
>  		skb = e->skb;
> +		e->skb = NULL;
>  
>  		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
>  				 DMA_TO_DEVICE);
> @@ -1147,6 +1148,42 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
>  	return 0;
>  }
>  
> +static void airoha_qdma_tx_disable(struct airoha_qdma *qdma)
> +{
> +	struct airoha_eth *eth = qdma->eth;
> +	int i;
> +
> +	/* Protect netdev->reg_state and netif_tx_disable() calls. */
> +	rtnl_lock();
> +
> +	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
> +		struct airoha_gdm_port *port = eth->ports[i];
> +		int j;
> +
> +		if (!port)
> +			continue;
> +
> +		for (j = 0; j < ARRAY_SIZE(port->devs); j++) {
> +			struct airoha_gdm_dev *dev = port->devs[j];
> +			struct net_device *netdev;
> +
> +			if (!dev)
> +				continue;
> +
> +			if (dev->qdma != qdma)
> +				continue;
> +
> +			netdev = netdev_from_priv(dev);
> +			if (netdev->reg_state != NETREG_REGISTERED)
> +				continue;
> +
> +			netif_tx_disable(netdev);
> +		}
> +	}
> +
> +	rtnl_unlock();
> +}
> +
>  static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
>  {
>  	struct airoha_qdma *qdma = q->qdma;
> @@ -1158,13 +1195,20 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
>  	for (i = 0; i < q->ndesc; i++) {
>  		struct airoha_queue_entry *e = &q->entry[i];
>  		struct airoha_qdma_desc *desc = &q->desc[i];
> +		struct sk_buff *skb = e->skb;
>  
>  		if (!e->dma_addr)
>  			continue;
>  
>  		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
>  				 DMA_TO_DEVICE);
> -		dev_kfree_skb_any(e->skb);
> +		if (skb) {
> +			struct netdev_queue *txq;
> +
> +			txq = skb_get_tx_queue(skb->dev, skb);
> +			netdev_tx_completed_queue(txq, 1, skb->len);
> +			dev_kfree_skb_any(skb);
> +		}
>  		e->dma_addr = 0;
>  		e->skb = NULL;
>  		list_add_tail(&e->list, &q->tx_list);
> @@ -1527,6 +1571,23 @@ static void airoha_qdma_cleanup(struct airoha_qdma *qdma)
>  {
>  	int i;
>  
> +	if (test_bit(DEV_STATE_INITIALIZED, &qdma->eth->state)) {
> +		u32 status;
> +
> +		airoha_qdma_tx_disable(qdma);
> +
> +		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> +				  GLOBAL_CFG_TX_DMA_EN_MASK |
> +				  GLOBAL_CFG_RX_DMA_EN_MASK);
> +		if (read_poll_timeout(airoha_qdma_rr, status,
> +				      !(status & (GLOBAL_CFG_TX_DMA_BUSY_MASK |
> +						  GLOBAL_CFG_RX_DMA_BUSY_MASK)),
> +				      USEC_PER_MSEC, 50 * USEC_PER_MSEC, true,
> +				      qdma, REG_QDMA_GLOBAL_CFG))
> +			dev_warn(qdma->eth->dev,
> +				 "QDMA DMA engine busy timeout\n");
> +	}
> +
>  	for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
>  		if (!qdma->q_rx[i].ndesc)
>  			continue;
> @@ -1837,9 +1898,6 @@ static int airoha_dev_open(struct net_device *netdev)
>  	}
>  	port->users++;
>  
> -	airoha_qdma_set(qdma, REG_QDMA_GLOBAL_CFG,
> -			GLOBAL_CFG_TX_DMA_EN_MASK |
> -			GLOBAL_CFG_RX_DMA_EN_MASK);
>  	qdma->users++;
>  
>  	if (!airoha_is_lan_gdm_dev(dev) &&
> @@ -1880,12 +1938,9 @@ static int airoha_dev_stop(struct net_device *netdev)
>  	struct airoha_gdm_dev *dev = netdev_priv(netdev);
>  	struct airoha_gdm_port *port = dev->port;
>  	struct airoha_qdma *qdma = dev->qdma;
> -	int i;
>  
>  	netif_tx_disable(netdev);
>  	airoha_set_vip_for_gdm_port(dev, false);
> -	for (i = 0; i < netdev->num_tx_queues; i++)
> -		netdev_tx_reset_subqueue(netdev, i);
>  
>  	if (--port->users)
>  		airoha_set_port_mtu(dev->eth, port);
> @@ -1893,19 +1948,7 @@ static int airoha_dev_stop(struct net_device *netdev)
>  		airoha_set_gdm_port_fwd_cfg(qdma->eth,
>  					    REG_GDM_FWD_CFG(port->id),
>  					    FE_PSE_PORT_DROP);
> -
> -	if (!--qdma->users) {
> -		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> -				  GLOBAL_CFG_TX_DMA_EN_MASK |
> -				  GLOBAL_CFG_RX_DMA_EN_MASK);
> -
> -		for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
> -			if (!qdma->q_tx[i].ndesc)
> -				continue;
> -
> -			airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
> -		}
> -	}
> +	qdma->users--;
>  
>  	return 0;
>  }
> @@ -3413,8 +3456,12 @@ static int airoha_probe(struct platform_device *pdev)
>  	if (err)
>  		goto error_netdev_free;
>  
> -	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
> +	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++) {
>  		airoha_qdma_start_napi(&eth->qdma[i]);
> +		airoha_qdma_set(&eth->qdma[i], REG_QDMA_GLOBAL_CFG,
> +				GLOBAL_CFG_TX_DMA_EN_MASK |
> +				GLOBAL_CFG_RX_DMA_EN_MASK);
> +	}
>  
>  	for_each_child_of_node(pdev->dev.of_node, np) {
>  		if (!of_device_is_compatible(np, "airoha,eth-mac"))
> @@ -3440,6 +3487,8 @@ static int airoha_probe(struct platform_device *pdev)
>  	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
>  		airoha_qdma_stop_napi(&eth->qdma[i]);
>  
> +	airoha_hw_cleanup(eth);
> +
>  	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
>  		struct airoha_gdm_port *port = eth->ports[i];
>  		int j;
> @@ -3461,7 +3510,6 @@ static int airoha_probe(struct platform_device *pdev)
>  		}
>  		airoha_metadata_dst_free(port);
>  	}
> -	airoha_hw_cleanup(eth);
>  error_netdev_free:
>  	free_netdev(eth->napi_dev);
>  	platform_set_drvdata(pdev, NULL);
> @@ -3477,6 +3525,8 @@ static void airoha_remove(struct platform_device *pdev)
>  	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
>  		airoha_qdma_stop_napi(&eth->qdma[i]);
>  
> +	airoha_hw_cleanup(eth);
> +
>  	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
>  		struct airoha_gdm_port *port = eth->ports[i];
>  		int j;
> @@ -3497,7 +3547,6 @@ static void airoha_remove(struct platform_device *pdev)
>  		}
>  		airoha_metadata_dst_free(port);
>  	}
> -	airoha_hw_cleanup(eth);
>  
>  	free_netdev(eth->napi_dev);
>  	platform_set_drvdata(pdev, NULL);
> 
> ---
> base-commit: 7d8297e26b4e20b5d1c3c3fe51fe81a1c7fbc823
> change-id: 20260618-airoha-bql-fixes-f57b2d108573
> 
> Best regards,
> -- 
> Lorenzo Bianconi <lorenzo@kernel.org>
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Zhou, Yun @ 2026-06-18 15:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, netdev, linux-kernel
In-Reply-To: <20260618150440.cLDwgyDM@linutronix.de>



On 6/18/2026 11:04 PM, Sebastian Andrzej Siewior wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>>
>> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
>> gets a chance to execute. The sequence is:
>>
>> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
>> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
> 
> But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
> not threaded. That does not look optimal.
> 
Correct, percpu IRQs are not threaded. net_rx_action (which calls 
mvneta_poll and ultimately enable_percpu_irq) runs in softirq context 
via ksoftirqd.

>> 3. NAPI poll cannot run → enable_percpu_irq() is never called
>> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>>      but does NOT execute the completion path (no enable_percpu_irq)
> 
> napi_schedule() sets NAPIF_STATE_SCHED.
> napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
> is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
> will be invoked unless it leaves somewhere early.
> However, if DISABLED was already set then it disables the IRQ source but
> does not schedule NAPI. This is probably what happens.

Your analysis makes perfect sense, and that scenario is indeed much more
likely. It looks like I'll need to update the commit message accordingly.

>> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>>
>> The unconditional enable in resume covers this case. When NAPI was
>> idle at suspend time, the extra enable is harmless.
> 
> There is no desc::depth counting here, that got me confused. But that
> per-CPU irq is not optimal. Is this a SoC limitation or a design choice?
> Having a NAPI instance with IRQ per queue and those configured and
> spread among CPUs should cause less trouble and is what others do.
> In fact is the only net driver using per-CPU interrupts.
> 
It is a SoC limitation. Armada XP's MPIC provides a single shared
interrupt for the ethernet controller with per-CPU masking for
interrupt steering — there are no per-queue MSI vectors. The percpu
IRQ model was the only way to distribute interrupt handling across
CPUs given this hardware constraint.

Newer Marvell SoCs (armada3700+) moved away from this model and use
standard IRQs, which is why the armada3700 path does not have this
problem.

^ permalink raw reply

* Fw: [Bug 221665] New: tcp_ecn: doc says default is 2, but should be 5
From: Stephen Hemminger @ 2026-06-18 15:52 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Thu, 18 Jun 2026 13:05:29 +0000
From: bugzilla-daemon@kernel.org
To: stephen@networkplumber.org
Subject: [Bug 221665] New: tcp_ecn: doc says default is 2, but should be 5


https://bugzilla.kernel.org/show_bug.cgi?id=221665

            Bug ID: 221665
           Summary: tcp_ecn: doc says default is 2, but should be 5
           Product: Networking
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: IPV4
          Assignee: stephen@networkplumber.org
          Reporter: pedretti.fabio@gmail.com
                CC: corbet@lwn.net, pabeni@redhat.com
        Regression: No

Here: https://docs.kernel.org/networking/ip-sysctl.html#:~:text=tcp_ecn

tcp_ecn - INTEGER
...
Default: 2


However it looks like the default was recently changed:

"In previous kernels, the value of tcp_ecn is, by default, two, [...] The new
default value is five, [...]"

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ae3e8e6ceedfb3cf74ca18169c942e073586a39

Please address this.
Thanks.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Sebastian Andrzej Siewior @ 2026-06-18 15:53 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, netdev, linux-kernel
In-Reply-To: <3a4ba3da-47ee-4da7-b0da-af500ff1a369@windriver.com>

On 2026-06-18 23:45:51 [+0800], Zhou, Yun wrote:
> > Having a NAPI instance with IRQ per queue and those configured and
> > spread among CPUs should cause less trouble and is what others do.
> > In fact is the only net driver using per-CPU interrupts.
> > 
> It is a SoC limitation. Armada XP's MPIC provides a single shared
> interrupt for the ethernet controller with per-CPU masking for
> interrupt steering — there are no per-queue MSI vectors. The percpu
> IRQ model was the only way to distribute interrupt handling across
> CPUs given this hardware constraint.

Is this a hardware constraint or more a software design choice? From the
other comment it read like it could be changed.
There is nothing wrong to provide 4 interrupts for a device from the
device-tree and then allocate and request all four. This requires that
SMP affinity is supported properly in order to spread it across CPU. You
would also be able to reduce the amount of queues/ interrupts via
ethtool if you would like to isolate a CPU for NOHZ reasons.

Sebastian

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-net 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Loktionov, Aleksandr @ 2026-06-18 16:01 UTC (permalink / raw)
  To: Oros, Petr, netdev@vger.kernel.org
  Cc: Vecera, Ivan, Alice Michael, Kitszel, Przemyslaw, Eric Dumazet,
	linux-kernel@vger.kernel.org, Andrew Lunn, Nguyen, Anthony L,
	Michal Swiatkowski, Keller, Jacob E, Jakub Kicinski, Paolo Abeni,
	David S. Miller, intel-wired-lan@lists.osuosl.org
In-Reply-To: <89efbea9831175e6f57e9fe8557f7a0e48e050b7.1781786935.git.poros@redhat.com>



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Petr Oros
> Sent: Thursday, June 18, 2026 5:09 PM
> To: netdev@vger.kernel.org
> Cc: Vecera, Ivan <ivecera@redhat.com>; Alice Michael
> <alice.michael@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Eric Dumazet <edumazet@google.com>;
> linux-kernel@vger.kernel.org; Andrew Lunn <andrew+netdev@lunn.ch>;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>; David S. Miller <davem@davemloft.net>;
> intel-wired-lan@lists.osuosl.org
> Subject: [Intel-wired-lan] [PATCH iwl-net 1/2] ice: skip per-VLAN
> promisc rules when default VSI Rx rule is set
> 
> When an ice port is part of a vlan-filtering bridge with a wide VLAN
> trunk and the netdev is in IFF_PROMISC (typical for bond slaves
> attached to a bridge), the driver installs per-VLAN
> ICE_SW_LKUP_PROMISC_VLAN entries (recipe 9) in addition to the broad
> ICE_SW_LKUP_DFLT VSI Rx rule (recipe 5). Each per-VLAN rule consumes
> one Flow Lookup Unit (FLU) entry from a fixed hardware pool of "up to
> 32K FLU entries" per device, documented in the E810 datasheet
> (613875-009 section 7.8.10, Table 7-18, page 1015).
> 
> With three active PFs sharing one switch context and a bridge trunk of
> vid 2-4094, the configuration would require roughly
> 
>   3 PFs * 4093 VLANs * 3 rules per VLAN per PF ~= 36,800 rules
> 
> which exceeds the 32K FLU budget. Firmware then responds to further
> Add Switch Rules requests with AQ retval 0x10 (LIBIE_AQ_RC_ENOSPC) and
> the user-visible failure surfaces as
> 
>   ice 0000:5c:00.1: Failed to set VSI 14 as the default forwarding
>                     VSI, error -5
>   ice 0000:5c:00.1 ens1f1: Error -5 setting default VSI 14 Rx rule
> 
> After a switch context has been driven into overrun, subsequent
> retries can come back as AQ retval 0x2 (LIBIE_AQ_RC_ENOENT), which has
> misled triage attempts toward a perceived recipe binding defect rather
> than a capacity issue.
> 
> When the DFLT VSI Rx rule is in place it catches every packet on the
> lport regardless of VLAN tag, so the per-VLAN PROMISC_VLAN expansion
> is redundant. The recipe 4 VLAN prune entries are still installed per
> VLAN and continue to track the allowed VID set, but the IFF_PROMISC
> sync path disables their enforcement on the VSI via
> vlan_ops->dis_rx_filtering() before ice_set_promisc() runs.
> ena_rx_filtering() is restored when IFF_PROMISC is cleared.
> 
> Skip the per-VLAN expansion at the two call sites that drive it:
> ice_set_promisc() falls through to ice_fltr_set_vsi_promisc() and
> ice_vlan_rx_add_vid() omits the per-VLAN ICE_MCAST_VLAN_PROMISC_BITS
> add. Plain IFF_ALLMULTI without an installed DFLT VSI rule is
> unchanged and still installs per-VLAN multicast promisc rules.
> 
> Both checks use ice_is_vsi_dflt_vsi() which inspects the recipe filter
> list for an installed DFLT rule on this VSI, not
> netdev->flags & IFF_PROMISC. The HW-state predicate avoids two
> regression vectors that a user-intent predicate would introduce:
> 
> 1. ice_lag_is_switchdev_running() short-circuits ice_set_dflt_vsi()
>    to return 0 without installing the DFLT rule for a PF in
>    switchdev LAG mode. An IFF_PROMISC-only check would also
>    suppress the per-VLAN fallback, leaving the PF with no rule.
> 
> 2. When ice_set_dflt_vsi() returns a non-EEXIST error (FLU
>    exhausted, switch context divergence), the driver clears
>    IFF_PROMISC from vsi->current_netdev_flags but the netdev's own
>    flags retain IFF_PROMISC. The user-intent predicate would still
>    suppress the per-VLAN fallback even though DFLT failed to
>    install.
> 
> The predicate is install-time only. The IFF_PROMISC off path closes
> the lifecycle gap in ice_vsi_exit_dflt_promisc(): for an IFF_ALLMULTI
> VSI with VLANs it reinstates the per-VID rules before clearing the
> default rule, so multicast coverage never lapses. If that AQ call
> fails the default rule is left in place, ice_vsi_exit_dflt_promisc()
> returns the error, and the sync_fltr pass bails with
> vsi->current_netdev_flags |= IFF_PROMISC; the current/netdev flag
> mismatch re-fires the IFF_PROMISC off path on the next sync. Clearing
> the default rule first would instead expose a window where neither the
> default rule nor the per-VID rules carry multicast.
> 
> If ice_clear_dflt_vsi() fails after the per-VID rules were reinstated
> they are deliberately not rolled back. Clearing the default rule is a
> removal that frees an FLU entry rather than allocating one, so it
> cannot fail for lack of space; a failure is a transient AdminQ error.
> The per-VID rules are the steady state for an IFF_ALLMULTI VLAN VSI,
> so the only redundant entry left behind is the single un-removed
> default rule, not the per-VID set. The retry re-enters this path,
> ice_fltr_set_vlan_vsi_promisc() returns -EEXIST for the rules that
> already exist so nothing is reallocated, and the default rule is
> removed on the next attempt. Rolling the per-VID rules back here would
> instead churn thousands of removes and re-adds on every retry.
> 
> After the default rule is gone the vid=0 PROMISC rule that paired with
> it is redundant and is dropped, but only to reclaim a filter entry, so
> a failure there is logged and does not abort the transition.
> 
> ice_set_vsi_promisc() and ice_clear_vsi_promisc() dispatch the recipe
> based on whether ICE_PROMISC_VLAN_RX/TX bits are present in the mask:
> with the bits set, recipe ICE_SW_LKUP_PROMISC_VLAN is used; otherwise
> ICE_SW_LKUP_PROMISC. The else branch in
> ice_set_promisc() installs the vid=0 rule in ICE_SW_LKUP_PROMISC.
> Because ice_clear_promisc() with VLANs present adds the VLAN bits and
> would search ICE_SW_LKUP_PROMISC_VLAN, the recipe mismatch would leave
> the vid=0 ICE_SW_LKUP_PROMISC rule orphaned when VLANs are configured.
> This is a single stale rule, not a per-cycle leak:
> re-adding it on the next promisc on returns -EEXIST rather than
> allocating a new entry. The set-time recipe is not recorded, so
> ice_clear_promisc() clears both recipes; clearing a rule that is not
> present succeeds, both clears run unconditionally, and the first error
> is returned.
> 
> The two VLAN-0 recipe transition blocks in ice_vlan_rx_add_vid() and
> ice_vlan_rx_kill_vid() that promote / demote the vid=0 rule between
> ICE_SW_LKUP_PROMISC and ICE_SW_LKUP_PROMISC_VLAN are likewise guarded
> by !ice_is_vsi_dflt_vsi(). With DFLT in place the
> vid=0 rule already covers every VID and a recipe swap would only
> install a redundant rule.
> 
> Lab reproduction on an E810-C with the same firmware family (4.80, NVM
> 1.3805.0, DDP 1.3.43.0) using four PFs in vlan-filtering bridges with
> vid 2-4094 and the slaves brought to IFF_PROMISC before the bridge
> VLAN bulk add:
> 
>   before fix:  ~12,279 AQ Add Switch Rules per PF, ENOSPC and ENOENT
>                responses in dmesg, DFLT VSI Rx rule install fails on
>                the affected PF
>   after fix:   ~4,093 AQ Add Switch Rules per PF, no AQ errors, DFLT
>                VSI Rx rule installs on every PF
> 
> The 66.7% reduction in installed switch rules per PF matches the
> expected per-VLAN saving: a single DFLT rule replaces the per-VID
> PROMISC_VLAN expansion.
> 
> Functional regression test with vid 2-100 trunk between two ice ports
> through the lab switch (40/40 PASS, 0 AQ errors, 0 ENOSPC at 4093-VID
> customer scale):
> 
>   vid 50 unicast, vid 100 unicast, vid 50 broadcast ARP,
>     vid 100 multicast IPv6 ND
>   vid 200/500/1500/4000 isolation (out-of-trunk) and untagged not
>     leaked: 0 packets reach any bridge endpoint
>   IGMP/MLD snooping, Jumbo MTU 9000, reserved-multicast STP BPDU
>   IFF_PROMISC + IFF_ALLMULTI transition (off while allmulti stays)
>   Regression reproducer for commit 1273f89578f2 ("ice: Fix broken
>     IFF_ALLMULTI handling"): allmulti on -> add vid -> allmulti off
>     -> allmulti on plus the orphan-rule Scenario 2; both converge
>     with no stale rules
>   100-VID, 1000-VID, 4093-VID stress cycles (5/3/2 iterations each)
>   switchdev mode toggle preserves IFF_PROMISC pruning state across
>     the session (vid 999 multicast received before and after the
>     legacy -> switchdev -> legacy cycle)
>   SR-IOV: VFs unaffected because ice_set_promisc() early-returns
>     for non-PF VSI and VF representors do not register
>     ndo_vlan_rx_add_vid
> 
> Fixes: 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling")
> Signed-off-by: Petr Oros <poros@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 90 ++++++++++++++++++----
> -
>  1 file changed, 70 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c
> b/drivers/net/ethernet/intel/ice/ice_main.c
> index 6d24056c247cf4..af8df81fc45623 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -274,7 +274,8 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8
> promisc_m)
>  	if (vsi->type != ICE_VSI_PF)
>  		return 0;
> 
> -	if (ice_vsi_has_non_zero_vlans(vsi)) {
> +	/* skip per-VID expansion; the DFLT Rx rule already covers
> every VID */
> +	if (ice_vsi_has_non_zero_vlans(vsi) &&
> !ice_is_vsi_dflt_vsi(vsi)) {
>  		promisc_m |= (ICE_PROMISC_VLAN_RX |
> ICE_PROMISC_VLAN_TX);
>  		status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw,
> vsi,
>  						       promisc_m);
> @@ -304,9 +305,19 @@ static int ice_clear_promisc(struct ice_vsi *vsi,
> u8 promisc_m)
>  		return 0;
> 
>  	if (ice_vsi_has_non_zero_vlans(vsi)) {

...

>  			ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi-
> >idx,
> 
> ICE_MCAST_VLAN_PROMISC_BITS,
>  						   0);
> --
> 2.53.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-net 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Loktionov, Aleksandr @ 2026-06-18 16:02 UTC (permalink / raw)
  To: Oros, Petr, netdev@vger.kernel.org
  Cc: Vecera, Ivan, Alice Michael, Kitszel, Przemyslaw, Eric Dumazet,
	linux-kernel@vger.kernel.org, Andrew Lunn, Nguyen, Anthony L,
	Michal Swiatkowski, Keller, Jacob E, Jakub Kicinski, Paolo Abeni,
	David S. Miller, intel-wired-lan@lists.osuosl.org
In-Reply-To: <deef5756e534ef06c12d910c5305d3fd205d30a0.1781786935.git.poros@redhat.com>



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Petr Oros
> Sent: Thursday, June 18, 2026 5:09 PM
> To: netdev@vger.kernel.org
> Cc: Vecera, Ivan <ivecera@redhat.com>; Alice Michael
> <alice.michael@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Eric Dumazet <edumazet@google.com>;
> linux-kernel@vger.kernel.org; Andrew Lunn <andrew+netdev@lunn.ch>;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>; David S. Miller <davem@davemloft.net>;
> intel-wired-lan@lists.osuosl.org
> Subject: [Intel-wired-lan] [PATCH iwl-net 2/2] ice: preserve uplink
> DFLT Rx rule on switchdev release
> 
> ice_eswitch_setup_env() calls ice_set_dflt_vsi() to install the
> ICE_SW_LKUP_DFLT Rx rule on the uplink VSI. The helper returns 0 even
> when the rule is already in place, so the call is a no-op if
> ice_vsi_sync_fltr() had previously installed the DFLT rule in response
> to IFF_PROMISC on the uplink netdev. ice_remove_vsi_fltr() called
> earlier in ice_eswitch_setup_env() does not affect this rule because
> ice_remove_vsi_lkup_fltr() lacks a case for ICE_SW_LKUP_DFLT and falls
> into its default branch which only logs. Switchdev mode then adds an
> ICE_FLTR_TX leg via ice_cfg_dflt_vsi() on the same VSI handle.
> 
> ice_eswitch_release_env() unconditionally removed both the Rx and Tx
> DFLT rules. When the Rx DFLT was installed by ice_vsi_sync_fltr()
> before the switchdev session started, this clobbered promisc state the
> operator had asked for: the DFLT Rx rule disappeared while IFF_PROMISC
> was still set on the netdev, and the IFF_PROMISC sync path was not
> retriggered, so the uplink ended the session without the catch-all
> rule the netdev flags requested.
> 
> Skip the Rx DFLT removal when the uplink is still promiscuous, both in
> ice_eswitch_release_env() and in the err_def_tx unwind of
> ice_eswitch_setup_env(). The Tx leg installed by switchdev is always
> removed since switchdev owns it.
> 
> The ena_rx_filtering() call earlier in ice_eswitch_release_env() is
> left unconditional: it calls ice_cfg_vlan_pruning(), which returns
> without enabling pruning while the netdev is in IFF_PROMISC, so it
> cannot re-enable VLAN pruning under the preserved DFLT rule and drop
> tagged traffic. Pruning is re-enabled later, when the IFF_PROMISC sync
> path runs after promisc is actually cleared.
> 
> Use vsi->current_netdev_flags rather than the live netdev->flags for
> this test. netdev->flags is written under RTNL by dev_change_flags(),
> while ice_eswitch_release_env() runs under devl_lock, so reading it
> here would be a TOCTOU against a concurrent promisc change. The
> IFF_PROMISC bit of current_netdev_flags is written only under
> ICE_CFG_BUSY by ice_vsi_sync_fltr(), and ice_set_rx_mode() gates that
> sync off for the uplink while ice_is_switchdev_running() is true. The
> bit is therefore frozen for the whole session and stable when
> release_env reads it.
> 
> Because the sync is gated off during the session, a promisc change the
> operator makes while switchdev runs never reaches ice_vsi_sync_fltr():
> current_netdev_flags keeps the value captured before the session while
> netdev->flags carries the new one. Once switchdev is torn down and
> pf->eswitch.is_running is cleared, schedule a filter sync from
> ice_eswitch_disable_switchdev() so the suppressed change is replayed
> and the DFLT Rx rule is reconciled with the current netdev flags. This
> also closes the window where release_env kept the rule based on the
> frozen flag but the operator had since cleared IFF_PROMISC.
> 
> Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
> Signed-off-by: Petr Oros <poros@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_eswitch.c | 32 +++++++++++++++++--
> -
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f77..b6073fc2375019 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> @@ -66,8 +66,10 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
>  	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
>  			 ICE_FLTR_TX);
>  err_def_tx:
> -	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> -			 ICE_FLTR_RX);
> +	/* keep the Rx DFLT rule if still promiscuous (see release_env)
> */
> +	if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
> +		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> +				 false, ICE_FLTR_RX);
>  err_def_rx:
>  	ice_vsi_del_vlan_zero(uplink_vsi);
>  err_vlan_zero:
> @@ -275,11 +277,23 @@ static void ice_eswitch_release_env(struct
> ice_pf *pf)
>  	vlan_ops = ice_get_compat_vsi_vlan_ops(uplink_vsi);
> 
>  	ice_vsi_update_local_lb(uplink_vsi, false);
> +	/* No-op while IFF_PROMISC is set: ice_cfg_vlan_pruning() self-
> gates on
> +	 * it, so this cannot re-enable VLAN pruning under a preserved
> DFLT rule.
> +	 */
>  	vlan_ops->ena_rx_filtering(uplink_vsi);
>  	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
>  			 ICE_FLTR_TX);
> -	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> -			 ICE_FLTR_RX);
> +
> +	/* Keep the Rx DFLT rule if the uplink is still promiscuous; it
> must
> +	 * outlive the session. current_netdev_flags is used because
> its
> +	 * IFF_PROMISC bit only changes under ice_vsi_sync_fltr(),
> gated off
> +	 * during switchdev, so the read cannot race the RTNL netdev-
> >flags.
> +	 * Any change made during the session is replayed on teardown.
> +	 */
> +	if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
> +		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> +				 false, ICE_FLTR_RX);
> +
>  	ice_fltr_add_mac_and_broadcast(uplink_vsi,
>  				       uplink_vsi->port_info-
> >mac.perm_addr,
>  				       ICE_FWD_TO_VSI);
> @@ -327,10 +341,20 @@ static int ice_eswitch_enable_switchdev(struct
> ice_pf *pf)
>   */
>  static void ice_eswitch_disable_switchdev(struct ice_pf *pf)  {
> +	struct ice_vsi *uplink_vsi = pf->eswitch.uplink_vsi;
> +
>  	ice_eswitch_br_offloads_deinit(pf);
>  	ice_eswitch_release_env(pf);
> 
>  	pf->eswitch.is_running = false;
> +
> +	/* ice_set_rx_mode() was gated off during the session; replay a
> filter
> +	 * sync so any suppressed promisc change reconciles the DFLT Rx
> rule.
> +	 */
> +	set_bit(ICE_VSI_UMAC_FLTR_CHANGED, uplink_vsi->state);
> +	set_bit(ICE_VSI_MMAC_FLTR_CHANGED, uplink_vsi->state);
> +	set_bit(ICE_FLAG_FLTR_SYNC, pf->flags);
> +	ice_service_task_schedule(pf);
>  }
> 
>  /**
> --
> 2.53.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net v2] ice: eswitch: fix use-after-free of metadata_dst in repr release
From: Loktionov, Aleksandr @ 2026-06-18 16:02 UTC (permalink / raw)
  To: Doruk Tan Ozturk, Nguyen, Anthony L, Kitszel, Przemyslaw,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com
  Cc: michal.swiatkowski@linux.intel.com, Drewek, Wojciech,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	horms@kernel.org
In-Reply-To: <20260618145003.47471-1-doruk@0sec.ai>



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Doruk Tan Ozturk
> Sent: Thursday, June 18, 2026 4:50 PM
> To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel,
> Przemyslaw <przemyslaw.kitszel@intel.com>; andrew+netdev@lunn.ch;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com
> Cc: michal.swiatkowski@linux.intel.com; Drewek, Wojciech
> <wojciech.drewek@intel.com>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> stable@vger.kernel.org; horms@kernel.org
> Subject: [Intel-wired-lan] [PATCH net v2] ice: eswitch: fix use-after-
> free of metadata_dst in repr release
> 
> ice_eswitch_release_repr() frees the port representor metadata_dst via
> metadata_dst_free(), which directly kfree()s the object and ignores
> the dst_entry refcount. The eswitch slow-path TX routine
> ice_eswitch_port_start_xmit() takes a reference on this dst with
> dst_hold() and attaches it to the skb via skb_dst_set(). If such an
> skb is still in flight (e.g. queued in a qdisc) when the representor
> is torn down, the metadata_dst is freed while the skb still points at
> it. When the skb is later freed, dst_release() operates on already-
> freed memory.
> 
> Replace metadata_dst_free() with dst_release() so the metadata_dst is
> freed only after the last reference is dropped. The dst subsystem
> frees metadata_dst objects from dst_destroy() once the refcount
> reaches zero (DST_METADATA is set by metadata_dst_alloc()).
> 
> Same class of bug and fix as commit c32b26aaa2f9 ("netfilter:
> nft_tunnel: fix use-after-free on object destroy").
> 
> Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
> Cc: stable@vger.kernel.org
> Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
> Reviewed-by: Simon Horman <horms@kernel.org>
> ---
> v2:
>  - Correct the Fixes: tag to 1a1c40df2e80 ("ice: set and release
>    switchdev environment"); the previously cited fff292b47ac1 only
> moved
>    the affected code rather than introducing the unbalanced free, and
> the
>    bug dates back to when switchdev support was added (Simon Horman).
>  - Add Simon Horman's Reviewed-by. No functional change.
> 
>  drivers/net/ethernet/intel/ice/ice_eswitch.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f..41b30a7ca4a9 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> @@ -95,7 +95,7 @@ ice_eswitch_release_repr(struct ice_pf *pf, struct
> ice_repr *repr)
>  		return;
> 
>  	ice_vsi_update_security(vsi, ice_vsi_ctx_set_antispoof);
> -	metadata_dst_free(repr->dst);
> +	dst_release(&repr->dst->dst);
>  	repr->dst = NULL;
>  	ice_fltr_add_mac_and_broadcast(vsi, repr->parent_mac,
>  				       ICE_FWD_TO_VSI);
> --
> 2.43.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-18 16:03 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-2-551fa514257e@proton.me>

On 2026-06-14 03:23:31, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>conn->bytes_to_rcv is accumulated in the receive tasklet from the peer's
>producer cursor:
>
>	diff_prod = smc_curs_diff(rmb_desc->len, &prod_old, &prod_new);
>	atomic_add(diff_prod, &conn->bytes_to_rcv);
>
>smc_curs_diff()'s differing-wrap branch returns (size - old.count) +
>new.count, which exceeds rmb_desc->len for a forged producer cursor and
>accumulates across CDC messages, so bytes_to_rcv can grow past the RMB
>(and across many messages can overflow the signed counter negative).
>smc_rx_recvmsg() reads it as the number of readable bytes and performs a
>wrap-around copy whose second chunk (copylen - first_chunk, read from
>ring offset 0) is never re-bounded to rmb_desc->len, reading past the
>RMB into adjacent kernel memory and disclosing it to the peer.

Hi Bryam,

Once we validate the CDC message at the input boundary (as in the
previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
this check becomes unreachable. So I don't think this patch is needed.

Best regards,
Dust


>
>Bound the readable count to rmb_desc->len where it is consumed, treating
>a negative (sign-overflowed) value as out of range too, so the copy
>length can never exceed the ring.  This enforces the documented
>0 <= bytes_to_rcv <= rmb_desc->len invariant at the consumer, where it
>is race-free against the producer update that runs in the receive
>tasklet.
>
>Fixes: 952310ccf2d8 ("smc: receive data from RMBE")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_rx.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
>diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
>index c1d9b923938d..f461cf10b085 100644
>--- a/net/smc/smc_rx.c
>+++ b/net/smc/smc_rx.c
>@@ -442,6 +442,18 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
> 		/* initialize variables for 1st iteration of subsequent loop */
> 		/* could be just 1 byte, even after waiting on data above */
> 		readable = smc_rx_data_available(conn, peeked_bytes);
>+		/* bytes_to_rcv is accumulated from the peer's wire-controlled
>+		 * producer cursor; a forged cursor can drive it past the RMB,
>+		 * or overflow the signed accumulator to a negative value across
>+		 * many CDC messages (which a plain "> len" check would miss
>+		 * before the size_t cast below turns it huge).  Bound it to the
>+		 * RMB in either case so the wrap-around copy cannot run past
>+		 * rmb_desc->len.  This enforces the documented
>+		 * 0 <= bytes_to_rcv <= rmb_desc->len invariant at the consumer,
>+		 * race-free against the producer update in the receive tasklet.
>+		 */
>+		if (readable < 0 || readable > conn->rmb_desc->len)
>+			readable = conn->rmb_desc->len;
> 		splbytes = atomic_read(&conn->splice_pending);
> 		if (!readable || (msg && splbytes)) {
> 			if (splbytes)
>
>-- 
>2.43.0
>

^ permalink raw reply

* Re: [PATCH v3 3/3] net/smc: bound the send length to the send buffer in smc_tx_sendmsg()
From: Dust Li @ 2026-06-18 16:08 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-3-551fa514257e@proton.me>

On 2026-06-14 03:23:32, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>On the SMC-D DMB-merge (nocopy) path, smc_cdc_msg_recv_action() advances
>conn->sndbuf_space from the peer's consumer cursor:
>
>	diff_tx = smc_curs_diff(sndbuf_desc->len, &tx_curs_fin,
>				&local_rx_ctrl.cons);
>	atomic_add(diff_tx, &conn->sndbuf_space);
>
>The consumer cursor is wire-controlled and unvalidated, and
>smc_curs_diff()'s differing-wrap branch can return more than
>sndbuf_desc->len, so a forged cursor drives sndbuf_space past the send
>buffer (and across many CDC messages can overflow the signed counter
>negative).  smc_tx_sendmsg() reads it as the available write space and
>performs a wrap-around copy whose second chunk (copylen - first_chunk,
>written at ring offset 0) is never re-bounded to sndbuf_desc->len,
>writing user data past the send buffer -- a heap out-of-bounds write of
>attacker-influenced length and content.

Hi Bryam,

I think this is the same as patch #2.

Best regards,
Dust


>
>Bound the write space to sndbuf_desc->len where it is consumed, treating
>a negative (sign-overflowed) value as out of range too, so the copy
>length can never exceed the ring.  This enforces the documented
>0 <= sndbuf_space <= sndbuf_desc->len invariant at the producer,
>race-free against the CDC tasklet that advances sndbuf_space.
>
>Fixes: cc0ab806fc52 ("net/smc: adapt cursor update when sndbuf and peer DMB are merged")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_tx.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>index 3144b4b1fe29..5916f02060fb 100644
>--- a/net/smc/smc_tx.c
>+++ b/net/smc/smc_tx.c
>@@ -233,6 +233,19 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
> 		/* initialize variables for 1st iteration of subsequent loop */
> 		/* could be just 1 byte, even after smc_tx_wait above */
> 		writespace = atomic_read(&conn->sndbuf_space);
>+		/* sndbuf_space is advanced from the peer's wire-controlled
>+		 * consumer cursor on the SMC-D DMB-merge path; a forged cursor
>+		 * can inflate it past the send buffer, or overflow the signed
>+		 * accumulator to a negative value across many CDC messages
>+		 * (which a plain "> len" check would miss before the size_t
>+		 * cast below turns it huge).  Bound it to the send buffer in
>+		 * either case so the wrap-around write cannot run past
>+		 * sndbuf_desc->len.  This enforces the documented
>+		 * 0 <= sndbuf_space <= sndbuf_desc->len invariant at the
>+		 * producer, race-free against the CDC tasklet.
>+		 */
>+		if (writespace < 0 || writespace > conn->sndbuf_desc->len)
>+			writespace = conn->sndbuf_desc->len;
> 		/* not more than what user space asked for */
> 		copylen = min_t(size_t, send_remaining, writespace);
> 		/* determine start of sndbuf */
>
>-- 
>2.43.0
>

^ permalink raw reply

* [PATCH net 0/6] ipv6: fix sysctl error handling and missing notifications
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera

While working on a different IPv6 patch series I have spotted multiple
minor bugs around sysctl error handling and notifications. In general,
they are not serious issues.

In addition, there is one more issue in forwarding sysctl as it does not
check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
of this series and I am aiming it at the net-next tree once it re-opens.

Fernando Fernandez Mancera (6):
  ipv6: fix error handling in disable_ipv6 sysctl
  ipv6: fix error handling in ignore_routes_with_linkdown sysctl
  ipv6: fix error handling in forwarding sysctl
  ipv6: fix error handling in disable_policy sysctl
  ipv6: reset value and position for proxy_ndp sysctl restart
  ipv6: fix missing notification for ignore_routes_with_linkdown

 net/ipv6/addrconf.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

-- 
2.54.0

^ permalink raw reply

* [PATCH net 1/6] ipv6: fix error handling in disable_ipv6 sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 56d417b12e57 ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1f21ccb55caa..c901b444a995 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6467,6 +6467,8 @@ static int addrconf_sysctl_disable(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_disable_ipv6(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net 4/6] ipv6: fix error handling in disable_policy sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the disable_policy sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is resetting the position argument even if an error
occurred during proc_dointvec() and not only during sysctl restart.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 0b128b6cff60..8ff015975e27 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6769,6 +6769,8 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
 	lctl = *ctl;
 	lctl.data = &val;
 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
 
 	if (write && (*valp != val))
 		ret = addrconf_disable_policy(ctl, valp, val);
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 2/6] ipv6: fix error handling in ignore_routes_with_linkdown sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the ignore_routes_with_linkdown sysctl, if
proc_dointvec() fails to parse the input, it returns a negative error
code. The current implementation is overwriting that error for write
operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c901b444a995..70058d971205 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6671,6 +6671,8 @@ int addrconf_sysctl_ignore_routes_with_linkdown(const struct ctl_table *ctl,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_linkdown(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net 3/6] ipv6: fix error handling in forwarding sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the forwarding sysctl, if proc_dointvec() fails to parse
the input, it returns a negative error code. The current implementation
is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: b325fddb7f86 ("ipv6: Fix sysctl unregistration deadlock")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 70058d971205..0b128b6cff60 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6370,6 +6370,8 @@ static int addrconf_sysctl_forward(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_forwarding(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net 5/6] ipv6: reset value and position for proxy_ndp sysctl restart
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
retried but as the value was already modified by the initial
proc_dointvec() call, the restarted syscall will read the newly modified
value as the 'old' state.

Fix this by restoring the original value and position pointer before
restarting the syscall.

Fixes: c92d5491a6d9 ("netconf: add support for IPv6 proxy_ndp")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 8ff015975e27..1cfb223476bd 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6483,8 +6483,9 @@ static int addrconf_sysctl_proxy_ndp(const struct ctl_table *ctl, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
 {
 	int *valp = ctl->data;
-	int ret;
+	loff_t pos = *ppos;
 	int old, new;
+	int ret;
 
 	old = *valp;
 	ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
@@ -6493,8 +6494,12 @@ static int addrconf_sysctl_proxy_ndp(const struct ctl_table *ctl, int write,
 	if (write && old != new) {
 		struct net *net = ctl->extra2;
 
-		if (!rtnl_net_trylock(net))
+		if (!rtnl_net_trylock(net)) {
+			/* Restore the original values before restarting */
+			*valp = old;
+			*ppos = pos;
 			return restart_syscall();
+		}
 
 		if (valp == &net->ipv6.devconf_dflt->proxy_ndp) {
 			inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 6/6] ipv6: fix missing notification for ignore_routes_with_linkdown
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When changing the ignore_routes_with_linkdown sysctl for a specific
interface, the RTM_NEWNETCONF netlink notification was not being emitted
to userspace. Fix this by emitting the notification when needed.

In addition, fix bogus return value for successful "all" and specific
interface write operation leading to a wrong reset of the position
pointer.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1cfb223476bd..2d81569b692b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -955,11 +955,7 @@ static int addrconf_fixup_linkdown(const struct ctl_table *table, int *p, int ne
 						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
 						     NETCONFA_IFINDEX_DEFAULT,
 						     net->ipv6.devconf_dflt);
-		rtnl_net_unlock(net);
-		return 0;
-	}
-
-	if (p == &net->ipv6.devconf_all->ignore_routes_with_linkdown) {
+	} else if (p == &net->ipv6.devconf_all->ignore_routes_with_linkdown) {
 		WRITE_ONCE(net->ipv6.devconf_dflt->ignore_routes_with_linkdown, newf);
 		addrconf_linkdown_change(net, newf);
 		if ((!newf) ^ (!old))
@@ -968,11 +964,21 @@ static int addrconf_fixup_linkdown(const struct ctl_table *table, int *p, int ne
 						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
 						     NETCONFA_IFINDEX_ALL,
 						     net->ipv6.devconf_all);
+	} else {
+		if (!newf ^ !old) {
+			struct inet6_dev *idev = table->extra1;
+
+			inet6_netconf_notify_devconf(net,
+						     RTM_NEWNETCONF,
+						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
+						     idev->dev->ifindex,
+						     &idev->cnf);
+		}
 	}
 
 	rtnl_net_unlock(net);
 
-	return 1;
+	return 0;
 }
 
 #endif
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/1] xfrm: nat_keepalive: avoid double free on send error
From: Ren Wei @ 2026-06-18 16:36 UTC (permalink / raw)
  To: netdev
  Cc: steffen.klassert, herbert, davem, eyal.birger, yuantan098, bird,
	qianyuluo3, n05ec
In-Reply-To: <cover.1781788385.git.qianyuluo3@gmail.com>

From: Qianyu Luo <qianyuluo3@gmail.com>

nat_keepalive_send() frees the keepalive skb whenever the IPv4 or IPv6
send helper reports an error.

That cleanup is only correct before the skb is handed to the output
path. Once ip_build_and_send_pkt() or ip6_xmit() takes ownership, the
networking stack may already have consumed the skb before returning an
error, so freeing it again is unsafe.

Handle the pre-handoff failure cases inside nat_keepalive_send_ipv4()
and nat_keepalive_send_ipv6(), where the caller still owns the skb, and
keep nat_keepalive_send() responsible only for family dispatch and the
unsupported-family cleanup path.

Fixes: f531d13bdfe3 ("xfrm: support sending NAT keepalives in ESP in UDP states")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Qianyu Luo <qianyuluo3@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 net/xfrm/xfrm_nat_keepalive.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c
index 458931062a04..f71328096f11 100644
--- a/net/xfrm/xfrm_nat_keepalive.c
+++ b/net/xfrm/xfrm_nat_keepalive.c
@@ -55,8 +55,10 @@ static int nat_keepalive_send_ipv4(struct sk_buff *skb,
 			   ka->encap_sport, sock_net_uid(net, NULL));
 
 	rt = ip_route_output_key(net, &fl4);
-	if (IS_ERR(rt))
+	if (IS_ERR(rt)) {
+		kfree_skb(skb);
 		return PTR_ERR(rt);
+	}
 
 	skb_dst_set(skb, &rt->dst);
 
@@ -100,6 +102,7 @@ static int nat_keepalive_send_ipv6(struct sk_buff *skb,
 	sock_net_set(sk, net);
 	dst = ip6_dst_lookup_flow(net, sk, &fl6, NULL);
 	if (IS_ERR(dst)) {
+		kfree_skb(skb);
 		local_unlock_nested_bh(&nat_keepalive_sk_ipv6.bh_lock);
 		return PTR_ERR(dst);
 	}
@@ -118,7 +121,6 @@ static void nat_keepalive_send(struct nat_keepalive *ka)
 					sizeof(struct ipv6hdr)) +
 				    sizeof(struct udphdr);
 	const u8 nat_ka_payload = 0xFF;
-	int err = -EAFNOSUPPORT;
 	struct sk_buff *skb;
 	struct udphdr *uh;
 
@@ -140,16 +142,17 @@ static void nat_keepalive_send(struct nat_keepalive *ka)
 
 	switch (ka->family) {
 	case AF_INET:
-		err = nat_keepalive_send_ipv4(skb, ka);
+		nat_keepalive_send_ipv4(skb, ka);
 		break;
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		err = nat_keepalive_send_ipv6(skb, ka, uh);
+		nat_keepalive_send_ipv6(skb, ka, uh);
 		break;
 #endif
-	}
-	if (err)
+	default:
 		kfree_skb(skb);
+		break;
+	}
 }
 
 struct nat_keepalive_work_ctx {
-- 
2.43.7


^ permalink raw reply related

* [PATCH 00/10] rust: driver: use pointers instead of indices for ID info
From: Gary Guo @ 2026-06-18 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	Miguel Ojeda, Boqun Feng, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Daniel Almeida,
	Tamir Duberstein, Alexandre Courbot, Onur Özkan,
	FUJITA Tomonori, David Airlie, Simona Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, Abdiel Janulgue, Robin Murphy,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Len Brown, Igor Korotin,
	Rob Herring, Saravana Kannan, Viresh Kumar, Michal Wilczynski,
	Drew Fustini, Guo Ren, Fu Wei, Uwe Kleine-König
  Cc: driver-core, rust-for-linux, linux-kernel, netdev, nova-gpu,
	dri-devel, linux-pci, linux-acpi, devicetree, linux-pm,
	linux-riscv, linux-pwm, Gary Guo

Most C drivers use a pointer (and cast to kernel_ulong_t) for driver_data
fields in device_id. Rust code currently does not do this, but rather use
indices. These indices then needs to be translated to `&IdInfo` separately
and this is by a side table.

This leads to open-coded ACPI/OF handling in driver.rs, which is not
desirable. Convert the code to use pointers (or rather, static references)
instead.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
Gary Guo (10):
      rust: driver: remove `IdTable::id`
      rust: driver: simplify `IdArray::new_without_index`
      rust: pci: use `Option<&IdInfo>` for device ID info
      rust: net/phy: remove expansion from doc
      rust: driver: centralize device ID handling
      rust: driver: remove `$module_table_name` from `module_device_table`
      rust: driver: store pointers in `DeviceId`
      rust: driver: remove open-coded matching logic
      rust: driver: remove duplicate ID table
      RFC: rust: driver: support map-like syntax for ID table

 drivers/cpufreq/rcpufreq_dt.rs        |   1 -
 drivers/gpu/drm/nova/driver.rs        |   1 -
 drivers/gpu/drm/tyr/driver.rs         |   1 -
 drivers/gpu/nova-core/driver.rs       |   3 +-
 drivers/pwm/pwm_th1520.rs             |   1 -
 rust/kernel/acpi.rs                   |  14 +--
 rust/kernel/auxiliary.rs              |  18 +--
 rust/kernel/device_id.rs              | 205 +++++++++++++++++++---------------
 rust/kernel/driver.rs                 | 114 ++-----------------
 rust/kernel/i2c.rs                    |  26 ++---
 rust/kernel/net/phy.rs                |  66 +----------
 rust/kernel/of.rs                     |  14 +--
 rust/kernel/pci.rs                    |  24 ++--
 rust/kernel/platform.rs               |   5 +-
 rust/kernel/usb.rs                    |  18 +--
 samples/rust/rust_debugfs.rs          |   1 -
 samples/rust/rust_dma.rs              |   3 +-
 samples/rust/rust_driver_auxiliary.rs |   4 +-
 samples/rust/rust_driver_i2c.rs       |   3 -
 samples/rust/rust_driver_pci.rs       |  11 +-
 samples/rust/rust_driver_platform.rs  |   2 -
 samples/rust/rust_driver_usb.rs       |   1 -
 samples/rust/rust_i2c_client.rs       |   2 -
 samples/rust/rust_soc.rs              |   2 -
 24 files changed, 166 insertions(+), 374 deletions(-)
---
base-commit: 4fa3f5fabb30bf00d7475d5a33459ea83d639bf9
change-id: 20260612-id_info-23eca472ccd8

Best regards,
--  
Gary Guo <gary@garyguo.net>


^ permalink raw reply

* [PATCH 02/10] rust: driver: simplify `IdArray::new_without_index`
From: Gary Guo @ 2026-06-18 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	Miguel Ojeda, Boqun Feng, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Daniel Almeida,
	Tamir Duberstein, Alexandre Courbot, Onur Özkan,
	FUJITA Tomonori, David Airlie, Simona Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, Abdiel Janulgue, Robin Murphy,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Len Brown, Igor Korotin,
	Rob Herring, Saravana Kannan, Viresh Kumar, Michal Wilczynski,
	Drew Fustini, Guo Ren, Fu Wei, Uwe Kleine-König
  Cc: driver-core, rust-for-linux, linux-kernel, netdev, nova-gpu,
	dri-devel, linux-pci, linux-acpi, devicetree, linux-pm,
	linux-riscv, linux-pwm, Gary Guo
In-Reply-To: <20260618-id_info-v1-0-96af1e559ef9@garyguo.net>

This method can very easily construct the `IdArray` on its own without
delegating to `Self::build`. Doing so also simplifies the phy device table
macro because it does not need to construct tuples anymore.

This also allows simplification of `new` and `build` which removes the
`unsafe`.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 rust/kernel/device_id.rs | 64 ++++++++++++++++++++----------------------------
 rust/kernel/net/phy.rs   |  2 +-
 2 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/rust/kernel/device_id.rs b/rust/kernel/device_id.rs
index fbf6d8e6afb9..eeef3f5e7b63 100644
--- a/rust/kernel/device_id.rs
+++ b/rust/kernel/device_id.rs
@@ -73,19 +73,11 @@ pub struct IdArray<T: RawDeviceId, U, const N: usize> {
     id_infos: [U; N],
 }
 
-impl<T: RawDeviceId, U, const N: usize> IdArray<T, U, N> {
+impl<T: RawDeviceId + RawDeviceIdIndex, U, const N: usize> IdArray<T, U, N> {
     /// Creates a new instance of the array.
     ///
     /// The contents are derived from the given identifiers and context information.
-    ///
-    /// # Safety
-    ///
-    /// `data_offset` as `None` is always safe.
-    /// If `data_offset` is `Some(data_offset)`, then:
-    /// - `data_offset` must be the correct offset (in bytes) to the context/data field
-    ///   (e.g., the `driver_data` field) within the raw device ID structure.
-    /// - The field at `data_offset` must be correctly sized to hold a `usize`.
-    const unsafe fn build(ids: [(T, U); N], data_offset: Option<usize>) -> Self {
+    pub const fn new(ids: [(T, U); N]) -> Self {
         let mut raw_ids = [const { MaybeUninit::<T::RawType>::uninit() }; N];
         let mut infos = [const { MaybeUninit::uninit() }; N];
 
@@ -94,16 +86,14 @@ impl<T: RawDeviceId, U, const N: usize> IdArray<T, U, N> {
             // SAFETY: by the safety requirement of `RawDeviceId`, we're guaranteed that `T` is
             // layout-wise compatible with `RawType`.
             raw_ids[i] = unsafe { core::mem::transmute_copy(&ids[i].0) };
-            if let Some(data_offset) = data_offset {
-                // SAFETY: by the safety requirement of this function, this would be effectively
-                // `raw_ids[i].driver_data = i;`.
-                unsafe {
-                    raw_ids[i]
-                        .as_mut_ptr()
-                        .byte_add(data_offset)
-                        .cast::<usize>()
-                        .write(i);
-                }
+            // SAFETY: by the safety requirement of `RawDeviceIdIndex`, this would be effectively
+            // `raw_ids[i].driver_data = i;`.
+            unsafe {
+                raw_ids[i]
+                    .as_mut_ptr()
+                    .byte_add(T::DRIVER_DATA_OFFSET)
+                    .cast::<usize>()
+                    .write(i);
             }
 
             // SAFETY: this is effectively a move: `infos[i] = ids[i].1`. We make a copy here but
@@ -127,32 +117,32 @@ impl<T: RawDeviceId, U, const N: usize> IdArray<T, U, N> {
             id_infos: unsafe { core::mem::transmute_copy(&infos) },
         }
     }
+}
 
-    /// Creates a new instance of the array without writing index values.
-    ///
-    /// The contents are derived from the given identifiers and context information.
-    /// If the device implements [`RawDeviceIdIndex`], consider using [`IdArray::new`] instead.
-    pub const fn new_without_index(ids: [(T, U); N]) -> Self {
-        // SAFETY: Calling `Self::build` with `offset = None` is always safe,
-        // because no raw memory writes are performed in this case.
-        unsafe { Self::build(ids, None) }
-    }
-
+impl<T: RawDeviceId, U, const N: usize> IdArray<T, U, N> {
     /// Reference to the contained [`RawIdArray`].
     pub const fn raw_ids(&self) -> &RawIdArray<T, N> {
         &self.raw_ids
     }
 }
 
-impl<T: RawDeviceId + RawDeviceIdIndex, U, const N: usize> IdArray<T, U, N> {
-    /// Creates a new instance of the array.
+impl<T: RawDeviceId, const N: usize> IdArray<T, (), N> {
+    /// Creates a new instance of the array without writing index values.
     ///
     /// The contents are derived from the given identifiers and context information.
-    pub const fn new(ids: [(T, U); N]) -> Self {
-        // SAFETY: by the safety requirement of `RawDeviceIdIndex`,
-        // `T::DRIVER_DATA_OFFSET` is guaranteed to be the correct offset (in bytes) to
-        // a field within `T::RawType`.
-        unsafe { Self::build(ids, Some(T::DRIVER_DATA_OFFSET)) }
+    /// If the device implements [`RawDeviceIdIndex`], consider using [`IdArray::new`] instead.
+    pub const fn new_without_index(ids: [T; N]) -> Self {
+        // SAFETY: `T` is layout-wise compatible with `T::RawType`, so is the array of them.
+        let raw_ids: [T::RawType; N] = unsafe { core::mem::transmute_copy(&ids) };
+        core::mem::forget(ids);
+
+        Self {
+            raw_ids: RawIdArray {
+                ids: raw_ids,
+                sentinel: MaybeUninit::zeroed(),
+            },
+            id_infos: [(); N],
+        }
     }
 }
 
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index 3ca99db5cccf..2868e3a9e02c 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -868,7 +868,7 @@ macro_rules! module_phy_driver {
         const N: usize = $crate::module_phy_driver!(@count_devices $($dev),+);
 
         const TABLE: $crate::device_id::IdArray<$crate::net::phy::DeviceId, (), N> =
-            $crate::device_id::IdArray::new_without_index([ $(($dev,())),+, ]);
+            $crate::device_id::IdArray::new_without_index([ $($dev),+, ]);
 
         $crate::module_device_table!("mdio", phydev, TABLE);
     };

-- 
2.54.0


^ permalink raw reply related

* [PATCH 01/10] rust: driver: remove `IdTable::id`
From: Gary Guo @ 2026-06-18 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	Miguel Ojeda, Boqun Feng, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Daniel Almeida,
	Tamir Duberstein, Alexandre Courbot, Onur Özkan,
	FUJITA Tomonori, David Airlie, Simona Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, Abdiel Janulgue, Robin Murphy,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Len Brown, Igor Korotin,
	Rob Herring, Saravana Kannan, Viresh Kumar, Michal Wilczynski,
	Drew Fustini, Guo Ren, Fu Wei, Uwe Kleine-König
  Cc: driver-core, rust-for-linux, linux-kernel, netdev, nova-gpu,
	dri-devel, linux-pci, linux-acpi, devicetree, linux-pm,
	linux-riscv, linux-pwm, Gary Guo
In-Reply-To: <20260618-id_info-v1-0-96af1e559ef9@garyguo.net>

This is unused.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 rust/kernel/device_id.rs | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/rust/kernel/device_id.rs b/rust/kernel/device_id.rs
index 8e9721446014..fbf6d8e6afb9 100644
--- a/rust/kernel/device_id.rs
+++ b/rust/kernel/device_id.rs
@@ -166,9 +166,6 @@ pub trait IdTable<T: RawDeviceId, U> {
     /// Obtain the pointer to the ID table.
     fn as_ptr(&self) -> *const T::RawType;
 
-    /// Obtain the pointer to the bus specific device ID from an index.
-    fn id(&self, index: usize) -> &T::RawType;
-
     /// Obtain the pointer to the driver-specific information from an index.
     fn info(&self, index: usize) -> &U;
 }
@@ -180,10 +177,6 @@ fn as_ptr(&self) -> *const T::RawType {
         core::ptr::from_ref(self).cast()
     }
 
-    fn id(&self, index: usize) -> &T::RawType {
-        &self.raw_ids.ids[index]
-    }
-
     fn info(&self, index: usize) -> &U {
         &self.id_infos[index]
     }

-- 
2.54.0


^ permalink raw reply related

* [PATCH 03/10] rust: pci: use `Option<&IdInfo>` for device ID info
From: Gary Guo @ 2026-06-18 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	Miguel Ojeda, Boqun Feng, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Daniel Almeida,
	Tamir Duberstein, Alexandre Courbot, Onur Özkan,
	FUJITA Tomonori, David Airlie, Simona Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, Abdiel Janulgue, Robin Murphy,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Len Brown, Igor Korotin,
	Rob Herring, Saravana Kannan, Viresh Kumar, Michal Wilczynski,
	Drew Fustini, Guo Ren, Fu Wei, Uwe Kleine-König
  Cc: driver-core, rust-for-linux, linux-kernel, netdev, nova-gpu,
	dri-devel, linux-pci, linux-acpi, devicetree, linux-pm,
	linux-riscv, linux-pwm, Gary Guo
In-Reply-To: <20260618-id_info-v1-0-96af1e559ef9@garyguo.net>

It is possible that `pci_device_id_any` will be passed to the driver, e.g.
`driver_override` is used on the device. Therefore, the driver must be able
to handle the case where `driver_data` is 0. Thus, update the `probe`
functions to get `Option`.

The current code cannot tell if the info does not exist or is the first
entry; however this will be achievable once the code is updated to use a
`&'static IdInfo` pointer instead of indices.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 drivers/gpu/nova-core/driver.rs       | 2 +-
 rust/kernel/pci.rs                    | 6 +++---
 samples/rust/rust_dma.rs              | 2 +-
 samples/rust/rust_driver_auxiliary.rs | 2 +-
 samples/rust/rust_driver_pci.rs       | 3 ++-
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/nova-core/driver.rs b/drivers/gpu/nova-core/driver.rs
index 5738d4ac521b..5a5f0b63e0f3 100644
--- a/drivers/gpu/nova-core/driver.rs
+++ b/drivers/gpu/nova-core/driver.rs
@@ -70,7 +70,7 @@ impl pci::Driver for NovaCoreDriver {
 
     fn probe<'bound>(
         pdev: &'bound pci::Device<Core<'_>>,
-        _info: &'bound Self::IdInfo,
+        _info: Option<&'bound Self::IdInfo>,
     ) -> impl PinInit<Self::Data<'bound>, Error> + 'bound {
         pin_init::pin_init_scope(move || {
             dev_dbg!(pdev, "Probe Nova Core GPU driver.\n");
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index 5071cae6543f..0e055e4df99e 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -113,7 +113,7 @@ extern "C" fn probe_callback(
         let info = T::ID_TABLE.info(id.index());
 
         from_result(|| {
-            let data = T::probe(pdev, info);
+            let data = T::probe(pdev, Some(info));
 
             pdev.as_ref().set_drvdata(data)?;
             Ok(0)
@@ -284,7 +284,7 @@ macro_rules! pci_device_table {
 ///
 ///     fn probe<'bound>(
 ///         _pdev: &'bound pci::Device<Core<'_>>,
-///         _id_info: &'bound Self::IdInfo,
+///         _id_info: Option<&'bound Self::IdInfo>,
 ///     ) -> impl PinInit<Self::Data<'bound>, Error> + 'bound {
 ///         Err(ENODEV)
 ///     }
@@ -313,7 +313,7 @@ pub trait Driver {
     /// attempt to initialize the device here.
     fn probe<'bound>(
         dev: &'bound Device<device::Core<'_>>,
-        id_info: &'bound Self::IdInfo,
+        id_info: Option<&'bound Self::IdInfo>,
     ) -> impl PinInit<Self::Data<'bound>, Error> + 'bound;
 
     /// PCI driver unbind.
diff --git a/samples/rust/rust_dma.rs b/samples/rust/rust_dma.rs
index 5046b4628d0e..9beb37275e0d 100644
--- a/samples/rust/rust_dma.rs
+++ b/samples/rust/rust_dma.rs
@@ -63,7 +63,7 @@ impl pci::Driver for DmaSampleDriver {
 
     fn probe<'bound>(
         pdev: &'bound pci::Device<Core<'_>>,
-        _info: &'bound Self::IdInfo,
+        _info: Option<&'bound Self::IdInfo>,
     ) -> impl PinInit<Self, Error> + 'bound {
         pin_init::pin_init_scope(move || {
             dev_info!(pdev, "Probe DMA test driver.\n");
diff --git a/samples/rust/rust_driver_auxiliary.rs b/samples/rust/rust_driver_auxiliary.rs
index 2c1351040e45..73c63afc046a 100644
--- a/samples/rust/rust_driver_auxiliary.rs
+++ b/samples/rust/rust_driver_auxiliary.rs
@@ -79,7 +79,7 @@ impl pci::Driver for ParentDriver {
 
     fn probe<'bound>(
         pdev: &'bound pci::Device<Core<'_>>,
-        _info: &'bound Self::IdInfo,
+        _info: Option<&'bound Self::IdInfo>,
     ) -> impl PinInit<Self::Data<'bound>, Error> + 'bound {
         Ok(ParentData {
             // SAFETY: `ParentData` is the driver's private data, which is dropped when the
diff --git a/samples/rust/rust_driver_pci.rs b/samples/rust/rust_driver_pci.rs
index 1aa8197d8698..5547dd704a1b 100644
--- a/samples/rust/rust_driver_pci.rs
+++ b/samples/rust/rust_driver_pci.rs
@@ -144,7 +144,7 @@ impl pci::Driver for SampleDriver {
 
     fn probe<'bound>(
         pdev: &'bound pci::Device<Core<'_>>,
-        info: &'bound Self::IdInfo,
+        info: Option<&'bound Self::IdInfo>,
     ) -> impl PinInit<Self::Data<'bound>, Error> + 'bound {
         let vendor = pdev.vendor_id();
         dev_dbg!(
@@ -153,6 +153,7 @@ fn probe<'bound>(
             vendor,
             pdev.device_id()
         );
+        let info = info.ok_or(ENODEV)?;
 
         pdev.enable_device_mem()?;
         pdev.set_master();

-- 
2.54.0


^ permalink raw reply related

* [PATCH 04/10] rust: net/phy: remove expansion from doc
From: Gary Guo @ 2026-06-18 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	Miguel Ojeda, Boqun Feng, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Daniel Almeida,
	Tamir Duberstein, Alexandre Courbot, Onur Özkan,
	FUJITA Tomonori, David Airlie, Simona Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, Abdiel Janulgue, Robin Murphy,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Len Brown, Igor Korotin,
	Rob Herring, Saravana Kannan, Viresh Kumar, Michal Wilczynski,
	Drew Fustini, Guo Ren, Fu Wei, Uwe Kleine-König
  Cc: driver-core, rust-for-linux, linux-kernel, netdev, nova-gpu,
	dri-devel, linux-pci, linux-acpi, devicetree, linux-pm,
	linux-riscv, linux-pwm, Gary Guo
In-Reply-To: <20260618-id_info-v1-0-96af1e559ef9@garyguo.net>

The expansion serves little purpose and it can easily diverge.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 rust/kernel/net/phy.rs | 56 --------------------------------------------------
 1 file changed, 56 deletions(-)

diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index 2868e3a9e02c..965ecca7d55f 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -800,62 +800,6 @@ const fn as_int(&self) -> u32 {
 /// }
 /// # }
 /// ```
-///
-/// This expands to the following code:
-///
-/// ```ignore
-/// use kernel::net::phy::{self, DeviceId};
-/// use kernel::prelude::*;
-///
-/// struct Module {
-///     _reg: ::kernel::net::phy::Registration,
-/// }
-///
-/// module! {
-///     type: Module,
-///     name: "rust_sample_phy",
-///     authors: ["Rust for Linux Contributors"],
-///     description: "Rust sample PHYs driver",
-///     license: "GPL",
-/// }
-///
-/// struct PhySample;
-///
-/// #[vtable]
-/// impl phy::Driver for PhySample {
-///     const NAME: &'static CStr = c"PhySample";
-///     const PHY_DEVICE_ID: phy::DeviceId = phy::DeviceId::new_with_exact_mask(0x00000001);
-/// }
-///
-/// const _: () = {
-///     static mut DRIVERS: [::kernel::net::phy::DriverVTable; 1] =
-///         [::kernel::net::phy::create_phy_driver::<PhySample>()];
-///
-///     impl ::kernel::Module for Module {
-///         fn init(module: &'static ::kernel::ThisModule) -> Result<Self> {
-///             let drivers = unsafe { &mut DRIVERS };
-///             let mut reg = ::kernel::net::phy::Registration::register(
-///                 module,
-///                 ::core::pin::Pin::static_mut(drivers),
-///             )?;
-///             Ok(Module { _reg: reg })
-///         }
-///     }
-/// };
-///
-/// const N: usize = 1;
-///
-/// const TABLE: ::kernel::device_id::IdArray<::kernel::net::phy::DeviceId, (), N> =
-///     ::kernel::device_id::IdArray::new_without_index([
-///         ::kernel::net::phy::DeviceId(
-///             ::kernel::bindings::mdio_device_id {
-///                 phy_id: 0x00000001,
-///                 phy_id_mask: 0xffffffff,
-///             }),
-///     ]);
-///
-/// ::kernel::module_device_table!("mdio", phydev, TABLE);
-/// ```
 #[macro_export]
 macro_rules! module_phy_driver {
     (@replace_expr $_t:tt $sub:expr) => {$sub};

-- 
2.54.0


^ permalink raw reply related

* [PATCH 05/10] rust: driver: centralize device ID handling
From: Gary Guo @ 2026-06-18 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	Miguel Ojeda, Boqun Feng, Björn Roy Baron, Benno Lossin,
	Andreas Hindborg, Alice Ryhl, Trevor Gross, Daniel Almeida,
	Tamir Duberstein, Alexandre Courbot, Onur Özkan,
	FUJITA Tomonori, David Airlie, Simona Vetter, Bjorn Helgaas,
	Krzysztof Wilczyński, Abdiel Janulgue, Robin Murphy,
	Dave Ertman, Ira Weiny, Leon Romanovsky, Len Brown, Igor Korotin,
	Rob Herring, Saravana Kannan, Viresh Kumar, Michal Wilczynski,
	Drew Fustini, Guo Ren, Fu Wei, Uwe Kleine-König
  Cc: driver-core, rust-for-linux, linux-kernel, netdev, nova-gpu,
	dri-devel, linux-pci, linux-acpi, devicetree, linux-pm,
	linux-riscv, linux-pwm, Gary Guo
In-Reply-To: <20260618-id_info-v1-0-96af1e559ef9@garyguo.net>

Move the `IdArray` creation from individual buses to be handled by shared
code in `device_id.rs`.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 rust/kernel/acpi.rs      | 10 ++--------
 rust/kernel/auxiliary.rs | 10 ++--------
 rust/kernel/device_id.rs | 31 ++++++++++++++++++++++++++++++-
 rust/kernel/i2c.rs       | 10 ++--------
 rust/kernel/net/phy.rs   | 10 ++++------
 rust/kernel/of.rs        | 10 ++--------
 rust/kernel/pci.rs       | 10 ++--------
 rust/kernel/usb.rs       | 10 ++--------
 8 files changed, 46 insertions(+), 55 deletions(-)

diff --git a/rust/kernel/acpi.rs b/rust/kernel/acpi.rs
index 9b8efa623130..315f2f2af446 100644
--- a/rust/kernel/acpi.rs
+++ b/rust/kernel/acpi.rs
@@ -53,13 +53,7 @@ pub const fn new(id: &'static CStr) -> Self {
 /// Create an ACPI `IdTable` with an "alias" for modpost.
 #[macro_export]
 macro_rules! acpi_device_table {
-    ($table_name:ident, $module_table_name:ident, $id_info_type: ty, $table_data: expr) => {
-        const $table_name: $crate::device_id::IdArray<
-            $crate::acpi::DeviceId,
-            $id_info_type,
-            { $table_data.len() },
-        > = $crate::device_id::IdArray::new($table_data);
-
-        $crate::module_device_table!("acpi", $module_table_name, $table_name);
+    ($($tt:tt)*) => {
+        $crate::module_device_table!("acpi", $crate::acpi::DeviceId, $($tt)*);
     };
 }
diff --git a/rust/kernel/auxiliary.rs b/rust/kernel/auxiliary.rs
index c42928d5a239..59787c9bff26 100644
--- a/rust/kernel/auxiliary.rs
+++ b/rust/kernel/auxiliary.rs
@@ -181,14 +181,8 @@ fn index(&self) -> usize {
 /// Create a auxiliary `IdTable` with its alias for modpost.
 #[macro_export]
 macro_rules! auxiliary_device_table {
-    ($table_name:ident, $module_table_name:ident, $id_info_type: ty, $table_data: expr) => {
-        const $table_name: $crate::device_id::IdArray<
-            $crate::auxiliary::DeviceId,
-            $id_info_type,
-            { $table_data.len() },
-        > = $crate::device_id::IdArray::new($table_data);
-
-        $crate::module_device_table!("auxiliary", $module_table_name, $table_name);
+    ($($tt:tt)*) => {
+        $crate::module_device_table!("auxiliary", $crate::auxiliary::DeviceId, $($tt)*);
     };
 }
 
diff --git a/rust/kernel/device_id.rs b/rust/kernel/device_id.rs
index eeef3f5e7b63..0239f89d5f69 100644
--- a/rust/kernel/device_id.rs
+++ b/rust/kernel/device_id.rs
@@ -175,7 +175,36 @@ fn info(&self, index: usize) -> &U {
 /// Create device table alias for modpost.
 #[macro_export]
 macro_rules! module_device_table {
-    ($table_type: literal, $module_table_name:ident, $table_name:ident) => {
+    (
+        $table_type: literal, $device_id_ty: ty,
+        $table_name: ident, $module_table_name: ident, $id_info_type: ty,
+        [$(($id: expr, $info:expr $(,)?)),* $(,)?]
+    ) => {
+        const $table_name: $crate::device_id::IdArray<
+            $device_id_ty,
+            $id_info_type,
+            { <[$device_id_ty]>::len(&[$($id,)*]) },
+        > = $crate::device_id::IdArray::new([$(($id, $info),)*]);
+
+        $crate::module_device_table!($table_type, $module_table_name, $table_name);
+    };
+
+    // Case for no ID info.
+    (
+        $table_type: literal, $device_id_ty: ty,
+        $table_name: ident, $module_table_name: ident, @none,
+        [$($id: expr),* $(,)?]
+    ) => {
+        const $table_name: $crate::device_id::IdArray<
+            $device_id_ty,
+            (),
+            { <[$device_id_ty]>::len(&[$($id,)*]) },
+        > = $crate::device_id::IdArray::new_without_index([$($id),*]);
+
+        $crate::module_device_table!($table_type, $module_table_name, $table_name);
+    };
+
+    ($table_type: literal, $module_table_name: ident, $table_name:ident) => {
         #[rustfmt::skip]
         #[export_name =
             concat!("__mod_device_table__", line!(),
diff --git a/rust/kernel/i2c.rs b/rust/kernel/i2c.rs
index 7655d61daac8..a7d9b88ae616 100644
--- a/rust/kernel/i2c.rs
+++ b/rust/kernel/i2c.rs
@@ -77,14 +77,8 @@ fn index(&self) -> usize {
 /// Create a I2C `IdTable` with its alias for modpost.
 #[macro_export]
 macro_rules! i2c_device_table {
-    ($table_name:ident, $module_table_name:ident, $id_info_type: ty, $table_data: expr) => {
-        const $table_name: $crate::device_id::IdArray<
-            $crate::i2c::DeviceId,
-            $id_info_type,
-            { $table_data.len() },
-        > = $crate::device_id::IdArray::new($table_data);
-
-        $crate::module_device_table!("i2c", $module_table_name, $table_name);
+    ($($tt:tt)*) => {
+        $crate::module_device_table!("i2c", $crate::i2c::DeviceId, $($tt)*);
     };
 }
 
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index 965ecca7d55f..166572861e61 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -809,12 +809,10 @@ macro_rules! module_phy_driver {
     };
 
     (@device_table [$($dev:expr),+]) => {
-        const N: usize = $crate::module_phy_driver!(@count_devices $($dev),+);
-
-        const TABLE: $crate::device_id::IdArray<$crate::net::phy::DeviceId, (), N> =
-            $crate::device_id::IdArray::new_without_index([ $($dev),+, ]);
-
-        $crate::module_device_table!("mdio", phydev, TABLE);
+        $crate::module_device_table!(
+            "mdio", $crate::net::phy::DeviceId,
+            phydev, TABLE, @none, [$($dev),+]
+        );
     };
 
     (drivers: [$($driver:ident),+ $(,)?], device_table: [$($dev:expr),+ $(,)?], $($f:tt)*) => {
diff --git a/rust/kernel/of.rs b/rust/kernel/of.rs
index 58b20c367f99..35aa6d36d309 100644
--- a/rust/kernel/of.rs
+++ b/rust/kernel/of.rs
@@ -53,13 +53,7 @@ pub const fn new(compatible: &'static CStr) -> Self {
 /// Create an OF `IdTable` with an "alias" for modpost.
 #[macro_export]
 macro_rules! of_device_table {
-    ($table_name:ident, $module_table_name:ident, $id_info_type: ty, $table_data: expr) => {
-        const $table_name: $crate::device_id::IdArray<
-            $crate::of::DeviceId,
-            $id_info_type,
-            { $table_data.len() },
-        > = $crate::device_id::IdArray::new($table_data);
-
-        $crate::module_device_table!("of", $module_table_name, $table_name);
+    ($($tt:tt)*) => {
+        $crate::module_device_table!("of", $crate::of::DeviceId, $($tt)*);
     };
 }
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index 0e055e4df99e..34e07a53244d 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -245,14 +245,8 @@ fn index(&self) -> usize {
 /// Create a PCI `IdTable` with its alias for modpost.
 #[macro_export]
 macro_rules! pci_device_table {
-    ($table_name:ident, $module_table_name:ident, $id_info_type: ty, $table_data: expr) => {
-        const $table_name: $crate::device_id::IdArray<
-            $crate::pci::DeviceId,
-            $id_info_type,
-            { $table_data.len() },
-        > = $crate::device_id::IdArray::new($table_data);
-
-        $crate::module_device_table!("pci", $module_table_name, $table_name);
+    ($($tt:tt)*) => {
+        $crate::module_device_table!("pci", $crate::pci::DeviceId, $($tt)*);
     };
 }
 
diff --git a/rust/kernel/usb.rs b/rust/kernel/usb.rs
index 7aff0c82d0af..154919ee1e19 100644
--- a/rust/kernel/usb.rs
+++ b/rust/kernel/usb.rs
@@ -254,14 +254,8 @@ fn index(&self) -> usize {
 /// Create a USB `IdTable` with its alias for modpost.
 #[macro_export]
 macro_rules! usb_device_table {
-    ($table_name:ident, $module_table_name:ident, $id_info_type: ty, $table_data: expr) => {
-        const $table_name: $crate::device_id::IdArray<
-            $crate::usb::DeviceId,
-            $id_info_type,
-            { $table_data.len() },
-        > = $crate::device_id::IdArray::new($table_data);
-
-        $crate::module_device_table!("usb", $module_table_name, $table_name);
+    ($($tt:tt)*) => {
+        $crate::module_device_table!("usb", $crate::usb::DeviceId, $($tt)*);
     };
 }
 

-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox