Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: meth: check skb allocation in meth_init_rx_ring()
From: Andrew Lunn @ 2026-06-22  8:01 UTC (permalink / raw)
  To: Pavan Chebbi
  Cc: Haoxiang Li, andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel, stable
In-Reply-To: <CALs4sv2dr2QsFU_DUDNAMgr4MDxHcRrHqer+Kdm7dP+4TUT0eg@mail.gmail.com>

On Mon, Jun 22, 2026 at 11:27:41AM +0530, Pavan Chebbi wrote:
> On Mon, Jun 22, 2026 at 10:20 AM Haoxiang Li <haoxiang_li2024@163.com> wrote:
> >
> > meth_init_rx_ring() does not check the return value of alloc_skb().
> > If the allocation fails, the NULL skb is passed to skb_reserve() and
> > then dereferenced through skb->head.
> >
> > Add check for alloc_skb() to prevent potential null pointer dereference.
> >
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
> > ---
> >  drivers/net/ethernet/sgi/meth.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/sgi/meth.c b/drivers/net/ethernet/sgi/meth.c
> > index f7c3a5a766b7..ceff3cc937ad 100644
> > --- a/drivers/net/ethernet/sgi/meth.c
> > +++ b/drivers/net/ethernet/sgi/meth.c
> > @@ -228,6 +228,9 @@ static int meth_init_rx_ring(struct meth_private *priv)
> >
> >         for (i = 0; i < RX_RING_ENTRIES; i++) {
> >                 priv->rx_skbs[i] = alloc_skb(METH_RX_BUFF_SIZE, 0);
> > +               if (!priv->rx_skbs[i])
> > +                       return -ENOMEM;
> > +
> 
> I think the fix is not complete. The caller meth_open() will not free
> any successfully allocated skbs if the function ever returns -ENOMEM.

There is also the question, does anybody care? Are SGI machines still
used? This is a Fast Ethernet driver, written in 2003. It has no
Maintainer. Maybe it would be better to just remove the driver?

At least drop the Fixes: tag, it does not fit the Stable rules.

https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html

    Andrew

---
pw-bot: cr

^ permalink raw reply

* [PATCH net] net: usb: kalmia: bound RX frame length in kalmia_rx_fixup()
From: Maoyi Xie @ 2026-06-22  8:01 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-usb, netdev, linux-kernel, stable

kalmia_rx_fixup() computes usb_packet_length = skb->len - (2 *
KALMIA_HEADER_LENGTH) as a u16, guarded only by a pre-loop check that
skb->len is at least KALMIA_HEADER_LENGTH, which is 6. A device can
deliver a short bulk-IN frame with skb->len in the 6 to 11 range, or
leave a short trailing remainder on a later loop iteration. Either case
underflows usb_packet_length to about 65530.

That bypasses the usb_packet_length < ether_packet_length truncation path.
The device-supplied ether_packet_length, a le16 up to 65535 read from
header_start[2], then drives a memcmp() and the following skb_trim() and
skb_pull() past the end of the rx buffer. The rx buffer is hard_mtu * 10,
which is 14000 bytes. That is an out of bounds read.

Require both the start and end framing headers to be present before
subtracting them, on every loop iteration.

Fixes: d40261236e8e ("net/usb: Add Samsung Kalmia driver for Samsung GT-B3730")
Cc: stable@vger.kernel.org
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
---
I asked about this on linux-usb on 2026-06-15 and got no reply, so I
am sending the fix.

 drivers/net/usb/kalmia.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/usb/kalmia.c b/drivers/net/usb/kalmia.c
index ee9c48f7f68f..0dd0a30c3db4 100644
--- a/drivers/net/usb/kalmia.c
+++ b/drivers/net/usb/kalmia.c
@@ -276,6 +276,14 @@ kalmia_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 				"Received header: %6phC. Package length: %i\n",
 				header_start, skb->len - KALMIA_HEADER_LENGTH);

+		/* both framing headers must be present before we subtract
+		 * them, otherwise usb_packet_length underflows and the
+		 * device-supplied ether_packet_length drives an out of bounds
+		 * access below
+		 */
+		if (skb->len < 2 * KALMIA_HEADER_LENGTH)
+			return 0;
+
 		/* subtract start header and end header */
 		usb_packet_length = skb->len - (2 * KALMIA_HEADER_LENGTH);
 		ether_packet_length = get_unaligned_le16(&header_start[2]);
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH net] net: ti: icssg-prueth: fix XDP_TX from the AF_XDP zero-copy RX path
From: Meghana Malladi @ 2026-06-22  8:05 UTC (permalink / raw)
  To: David Carlier, danishanwar, rogerq, andrew+netdev, netdev
  Cc: davem, edumazet, kuba, pabeni, horms, hawk, john.fastabend, sdf,
	ast, daniel, bpf, linux-arm-kernel, linux-kernel, stable
In-Reply-To: <20260620213756.87499-1-devnexen@gmail.com>

Hi David,

Thanks for the fix.

On 6/21/26 03:07, David Carlier wrote:
> On XDP_TX from the zero-copy RX path, emac_run_xdp() converts the xsk
> buffer via xdp_convert_zc_to_xdp_frame(), which clones the data into a
> fresh MEM_TYPE_PAGE_ORDER0 page that is not DMA mapped. Transmitting it
> as PRUETH_TX_BUFF_TYPE_XDP_TX derives the DMA address with
> page_pool_get_dma_addr(), reading an uninitialized page->dma_addr, so
> the device DMAs from a bogus address (corrupt TX, or an IOMMU fault).
> 
> Pick the TX buffer type from the frame's memory type: keep
> PRUETH_TX_BUFF_TYPE_XDP_TX for page_pool frames and use
> PRUETH_TX_BUFF_TYPE_XDP_NDO for the cloned zero-copy frame. The
> completion path already unmaps PRUETH_SWDATA_XDPF buffers.
> 

Is it safe to unconditionally unmap the buffer for the case where 
frame's memory type is PRUETH_TX_BUFF_TYPE_XDP_TX? In this case the DMA 
mapping is done with rx_chn->dma_dev, where as in completion path we are 
unmapping with tx_chn->dma_dev unconditionally.

> Fixes: 7a64bb388df3 ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
> Cc: stable@vger.kernel.org
> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
>   drivers/net/ethernet/ti/icssg/icssg_common.c | 13 ++++++++++++-
>   1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
> index 82ddef9c17d5..302e700ea17d 100644
> --- a/drivers/net/ethernet/ti/icssg/icssg_common.c
> +++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
> @@ -804,6 +804,7 @@ EXPORT_SYMBOL_GPL(emac_xmit_xdp_frame);
>    */
>   static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len)
>   {
> +	enum prueth_tx_buff_type tx_buff_type;
>   	struct net_device *ndev = emac->ndev;
>   	struct netdev_queue *netif_txq;
>   	int cpu = smp_processor_id();
> @@ -826,11 +827,21 @@ static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len
>   			goto drop;
>   		}
>   
> +		/* In AF_XDP zero-copy mode xdp_convert_buff_to_frame()
> +		 * clones the xsk buffer into a fresh MEM_TYPE_PAGE_ORDER0
> +		 * page that is not DMA mapped. Such a frame must be mapped
> +		 * via the NDO path; only a page pool-backed frame already
> +		 * carries a usable page_pool DMA address.
> +		 */
> +		tx_buff_type = xdpf->mem_type == MEM_TYPE_PAGE_POOL ?
> +				PRUETH_TX_BUFF_TYPE_XDP_TX :
> +				PRUETH_TX_BUFF_TYPE_XDP_NDO;
> +
>   		q_idx = cpu % emac->tx_ch_num;
>   		netif_txq = netdev_get_tx_queue(ndev, q_idx);
>   		__netif_tx_lock(netif_txq, cpu);
>   		result = emac_xmit_xdp_frame(emac, xdpf, q_idx,
> -					     PRUETH_TX_BUFF_TYPE_XDP_TX);
> +					     tx_buff_type);
>   		__netif_tx_unlock(netif_txq);
>   		if (result == ICSSG_XDP_CONSUMED) {
>   			ndev->stats.tx_dropped++;

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-next v5 2/4] igc: move autoneg-enabled settings into igc_handle_autoneg_enabled()
From: Kadosh, MoriyaX @ 2026-06-22  8:08 UTC (permalink / raw)
  To: Ruinskiy, Dima, KhaiWenTan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: intel-wired-lan, netdev, linux-kernel, faizal.abdul.rahim,
	hong.aun.looi, hector.blanco.alcaine, khai.wen.tan, Faizal Rahim,
	Aleksandr Loktionov
In-Reply-To: <4d8d9eaa-d9bb-4589-a37d-31d0da584335@intel.com>



On 14/06/2026 10:17, Ruinskiy, Dima wrote:
> On 08/05/2026 0:47, KhaiWenTan wrote:
>> From: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>>
>> Move the advertised link modes and flow control configuration from
>> igc_ethtool_set_link_ksettings() into igc_handle_autoneg_enabled().
>>
>> No functional change.
>>
>> Reviewed-by: Looi Hong Aun <hong.aun.looi@intel.com>
>> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>> Signed-off-by: Khai Wen Tan <khai.wen.tan@linux.intel.com>
>> ---
>>   drivers/net/ethernet/intel/igc/igc_ethtool.c | 72 ++++++++++++--------
>>   1 file changed, 44 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/igc/igc_ethtool.c b/drivers/ 
>> net/ethernet/intel/igc/igc_ethtool.c
>> index 0122009bedd0..cfcbf2fdad6e 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_ethtool.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_ethtool.c
>> @@ -2000,6 +2000,49 @@ static int 
>> igc_ethtool_get_link_ksettings(struct net_device *netdev,
>>       return 0;
>>   }
>> +/**
>> + * igc_handle_autoneg_enabled - Configure autonegotiation advertisement
>> + * @adapter: private driver structure
>> + * @cmd: ethtool link ksettings from user
>> + *
>> + * Records advertised speeds and flow control settings when autoneg
>> + * is enabled.
>> + */
>> +static void igc_handle_autoneg_enabled(struct igc_adapter *adapter,
>> +                       const struct ethtool_link_ksettings *cmd)
>> +{
>> +    struct igc_hw *hw = &adapter->hw;
>> +    u16 advertised = 0;
>> +
>> +    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> +                          2500baseT_Full))
>> +        advertised |= ADVERTISE_2500_FULL;
>> +
>> +    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> +                          1000baseT_Full))
>> +        advertised |= ADVERTISE_1000_FULL;
>> +
>> +    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> +                          100baseT_Full))
>> +        advertised |= ADVERTISE_100_FULL;
>> +
>> +    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> +                          100baseT_Half))
>> +        advertised |= ADVERTISE_100_HALF;
>> +
>> +    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> +                          10baseT_Full))
>> +        advertised |= ADVERTISE_10_FULL;
>> +
>> +    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> +                          10baseT_Half))
>> +        advertised |= ADVERTISE_10_HALF;
>> +
>> +    hw->phy.autoneg_advertised = advertised;
>> +    if (adapter->fc_autoneg)
>> +        hw->fc.requested_mode = igc_fc_default;
>> +}
>> +
>>   static int
>>   igc_ethtool_set_link_ksettings(struct net_device *netdev,
>>                      const struct ethtool_link_ksettings *cmd)
>> @@ -2007,7 +2050,6 @@ igc_ethtool_set_link_ksettings(struct net_device 
>> *netdev,
>>       struct igc_adapter *adapter = netdev_priv(netdev);
>>       struct net_device *dev = adapter->netdev;
>>       struct igc_hw *hw = &adapter->hw;
>> -    u16 advertised = 0;
>>       /* When adapter in resetting mode, autoneg/speed/duplex
>>        * cannot be changed
>> @@ -2032,34 +2074,8 @@ igc_ethtool_set_link_ksettings(struct 
>> net_device *netdev,
>>       while (test_and_set_bit(__IGC_RESETTING, &adapter->state))
>>           usleep_range(1000, 2000);
>> -    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> -                          2500baseT_Full))
>> -        advertised |= ADVERTISE_2500_FULL;
>> -
>> -    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> -                          1000baseT_Full))
>> -        advertised |= ADVERTISE_1000_FULL;
>> -
>> -    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> -                          100baseT_Full))
>> -        advertised |= ADVERTISE_100_FULL;
>> -
>> -    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> -                          100baseT_Half))
>> -        advertised |= ADVERTISE_100_HALF;
>> -
>> -    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> -                          10baseT_Full))
>> -        advertised |= ADVERTISE_10_FULL;
>> -
>> -    if (ethtool_link_ksettings_test_link_mode(cmd, advertising,
>> -                          10baseT_Half))
>> -        advertised |= ADVERTISE_10_HALF;
>> -
>>       if (cmd->base.autoneg == AUTONEG_ENABLE) {
>> -        hw->phy.autoneg_advertised = advertised;
>> -        if (adapter->fc_autoneg)
>> -            hw->fc.requested_mode = igc_fc_default;
>> +        igc_handle_autoneg_enabled(adapter, cmd);
>>       } else {
>>           netdev_info(dev, "Force mode currently not supported\n");
>>       }
> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com>

^ permalink raw reply

* Re: [PATCH net] net: usb: kalmia: bound RX frame length in kalmia_rx_fixup()
From: Andrew Lunn @ 2026-06-22  8:09 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: Oliver Neukum, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-usb, netdev, linux-kernel,
	stable
In-Reply-To: <178211531778.2216480.12637613349790980750@maoyixie.com>

On Mon, Jun 22, 2026 at 04:01:57PM +0800, Maoyi Xie wrote:
> kalmia_rx_fixup() computes usb_packet_length = skb->len - (2 *
> KALMIA_HEADER_LENGTH) as a u16, guarded only by a pre-loop check that
> skb->len is at least KALMIA_HEADER_LENGTH, which is 6. A device can
> deliver a short bulk-IN frame with skb->len in the 6 to 11 range, or
> leave a short trailing remainder on a later loop iteration. Either case
> underflows usb_packet_length to about 65530.
> 
> That bypasses the usb_packet_length < ether_packet_length truncation path.
> The device-supplied ether_packet_length, a le16 up to 65535 read from
> header_start[2], then drives a memcmp() and the following skb_trim() and
> skb_pull() past the end of the rx buffer. The rx buffer is hard_mtu * 10,
> which is 14000 bytes. That is an out of bounds read.
> 
> Require both the start and end framing headers to be present before
> subtracting them, on every loop iteration.
> 
> Fixes: d40261236e8e ("net/usb: Add Samsung Kalmia driver for Samsung GT-B3730")
> Cc: stable@vger.kernel.org
> Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-next v5 1/4] igc: remove unused autoneg_failed field
From: Kadosh, MoriyaX @ 2026-06-22  8:10 UTC (permalink / raw)
  To: Ruinskiy, Dima, KhaiWenTan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: intel-wired-lan, netdev, linux-kernel, faizal.abdul.rahim,
	hong.aun.looi, hector.blanco.alcaine, khai.wen.tan, Faizal Rahim,
	Aleksandr Loktionov
In-Reply-To: <7d4b2a62-231a-4f61-8561-5c26d6ed3125@intel.com>



On 14/06/2026 10:16, Ruinskiy, Dima wrote:
> On 08/05/2026 0:47, KhaiWenTan wrote:
>> From: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>>
>> autoneg_failed in struct igc_mac_info is never set in the igc driver.
>> Remove the field and the dead code checking it in
>> igc_config_fc_after_link_up().
>>
>> The field originates from the e1000/e1000e fiber/serdes forced-link
>> path, where MAC-level autoneg timeout sets it to signal the flow-control
>> code to force pause. igc supports only copper, so it never needs to set
>> this field.
>>
>> Reviewed-by: Looi Hong Aun <hong.aun.looi@intel.com>
>> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>> Signed-off-by: Khai Wen Tan <khai.wen.tan@linux.intel.com>
>> ---
>>   drivers/net/ethernet/intel/igc/igc_hw.h  |  1 -
>>   drivers/net/ethernet/intel/igc/igc_mac.c | 16 +---------------
>>   2 files changed, 1 insertion(+), 16 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h b/drivers/net/ 
>> ethernet/intel/igc/igc_hw.h
>> index be8a49a86d09..86ab8f566f44 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_hw.h
>> +++ b/drivers/net/ethernet/intel/igc/igc_hw.h
>> @@ -92,7 +92,6 @@ struct igc_mac_info {
>>       bool asf_firmware_present;
>>       bool arc_subsystem_valid;
>> -    bool autoneg_failed;
>>       bool get_link_status;
>>   };
>> diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c b/drivers/net/ 
>> ethernet/intel/igc/igc_mac.c
>> index 7ac6637f8db7..142beb9ae557 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_mac.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_mac.c
>> @@ -438,28 +438,14 @@ void igc_config_collision_dist(struct igc_hw *hw)
>>    * Checks the status of auto-negotiation after link up to ensure 
>> that the
>>    * speed and duplex were not forced.  If the link needed to be 
>> forced, then
>>    * flow control needs to be forced also.  If auto-negotiation is 
>> enabled
>> - * and did not fail, then we configure flow control based on our link
>> - * partner.
>> + * then we configure flow control based on our link partner.
>>    */
>>   s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>   {
>>       u16 mii_status_reg, mii_nway_adv_reg, mii_nway_lp_ability_reg;
>> -    struct igc_mac_info *mac = &hw->mac;
>>       u16 speed, duplex;
>>       s32 ret_val = 0;
>> -    /* Check for the case where we have fiber media and auto-neg failed
>> -     * so we had to force link.  In this case, we need to force the
>> -     * configuration of the MAC to match the "fc" parameter.
>> -     */
>> -    if (mac->autoneg_failed)
>> -        ret_val = igc_force_mac_fc(hw);
>> -
>> -    if (ret_val) {
>> -        hw_dbg("Error forcing flow control settings\n");
>> -        goto out;
>> -    }
>> -
>>       /* In auto-neg, we need to check and see if Auto-Neg has completed,
>>        * and if so, how the PHY and link partner has flow control
>>        * configured.
> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com>

^ permalink raw reply

* [PATCH iwl-net] ice: clear the default forwarding VSI rule when releasing a VSI
From: Petr Oros @ 2026-06-22  8:10 UTC (permalink / raw)
  To: netdev
  Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jacob Keller, Michal Swiatkowski, intel-wired-lan, linux-kernel

When a VSI is configured as the switch's default forwarding VSI
(ICE_SW_LKUP_DFLT) and is then torn down, the rule is left behind in
the switch. ice_vsi_release() no longer removes it, and the SR-IOV VF
free path (ice_free_vfs() -> ice_free_vf_res() -> ice_vf_vsi_release()
-> ice_vsi_release()) does not disable promiscuous mode either, which
only happens on VF reset in ice_vf_clear_all_promisc_modes().

A trusted VF that enters unicast promiscuous mode becomes the default
forwarding VSI (this is the default mode, when the PF does not have VF
true-promiscuous mode enabled). If the VFs are then destroyed without
the VF first leaving promiscuous mode, the ICE_SW_LKUP_DFLT rule for
the now-freed VSI is leaked. When VFs are recreated, a VSI reuses the
freed hw_vsi_id. If it is assigned a different VSI handle than the
leaked rule holds, ice_set_dflt_vsi() does not recognize it as
already-default, and ice_add_update_vsi_list() folds the dangling
(freed) handle into a VSI list, which the firmware rejects. The VSI
handle assigned on re-creation varies, so the failure is intermittent
rather than every cycle.

Reproduce by repeatedly running the cycle below on the two ports of the
same card, where $VF0 and $VF1 are the netdevs of vf 15 once they
appear. The VF must be brought up so iavf actually pushes the unicast
promiscuous request, and the rule must settle before the VFs are torn
down again:

  echo 16 > /sys/class/net/$PF0/device/sriov_numvfs
  echo 16 > /sys/class/net/$PF1/device/sriov_numvfs
  ip link set $PF0 vf 15 trust on
  ip link set $PF1 vf 15 trust on
  ip link set $VF0 up
  ip link set $VF1 up
  ip link set $VF0 promisc on
  ip link set $VF1 promisc on
  sleep 1
  echo 0 > /sys/class/net/$PF0/device/sriov_numvfs
  echo 0 > /sys/class/net/$PF1/device/sriov_numvfs

Within a few cycles the ice PF and iavf VF log:

  Failed to set VSI 25 as the default forwarding VSI, error -22
  Turning on/off promiscuous mode for VF 63 failed, error: -22
  PF returned error -53 (IAVF_ERR_ADMIN_QUEUE_ERROR) to our request 14

This cleanup used to live in ice_vsi_release() but was dropped by the
referenced refactor. Restore it. Clear the default forwarding VSI rule
in ice_vsi_release() when this VSI owns it, which covers every teardown
path.

Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
Signed-off-by: Petr Oros <poros@redhat.com>
---
 drivers/net/ethernet/intel/ice/ice_lib.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 2717cc31bff8fe..408464434506ef 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2872,6 +2872,9 @@ int ice_vsi_release(struct ice_vsi *vsi)
 		return -ENODEV;
 	pf = vsi->back;

+	if (ice_is_vsi_dflt_vsi(vsi))
+		ice_clear_dflt_vsi(vsi);
+
 	if (test_bit(ICE_FLAG_RSS_ENA, pf->flags))
 		ice_rss_clean(vsi);

-- 
2.53.0

^ permalink raw reply related

* Re: [Intel-wired-lan] [PATCH iwl-next v5 3/4] igc: replace goto out with direct returns in igc_config_fc_after_link_up()
From: Kadosh, MoriyaX @ 2026-06-22  8:11 UTC (permalink / raw)
  To: Ruinskiy, Dima, KhaiWenTan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: intel-wired-lan, netdev, linux-kernel, faizal.abdul.rahim,
	hong.aun.looi, hector.blanco.alcaine, khai.wen.tan, Faizal Rahim
In-Reply-To: <58af982a-1531-43f8-934c-e83b45111b1f@intel.com>



On 14/06/2026 10:17, Ruinskiy, Dima wrote:
> On 08/05/2026 0:47, KhaiWenTan wrote:
>> From: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>>
>> The out: label only returns ret_val with no cleanup. The kernel coding
>> style guide states: "If there is no cleanup needed then just return
>> directly." (Documentation/process/coding-style.rst, section 7).
>>
>> This improves readability ahead of a subsequent patch that introduces a
>> new goto label in this function.
>>
>> No functional change.
>>
>> Reviewed-by: Looi Hong Aun <hong.aun.looi@intel.com>
>> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>> Signed-off-by: Khai Wen Tan <khai.wen.tan@linux.intel.com>
>> ---
>>   drivers/net/ethernet/intel/igc/igc_mac.c | 15 +++++++--------
>>   1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c b/drivers/net/ 
>> ethernet/intel/igc/igc_mac.c
>> index 142beb9ae557..0a3d3f357505 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_mac.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_mac.c
>> @@ -458,15 +458,15 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>       ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS,
>>                          &mii_status_reg);
>>       if (ret_val)
>> -        goto out;
>> +        return ret_val;
>>       ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS,
>>                          &mii_status_reg);
>>       if (ret_val)
>> -        goto out;
>> +        return ret_val;
>>       if (!(mii_status_reg & MII_SR_AUTONEG_COMPLETE)) {
>>           hw_dbg("Copper PHY and Auto Neg has not completed.\n");
>> -        goto out;
>> +        return ret_val;
>>       }
>>       /* The AutoNeg process has completed, so we now need to
>> @@ -478,11 +478,11 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>       ret_val = hw->phy.ops.read_reg(hw, PHY_AUTONEG_ADV,
>>                          &mii_nway_adv_reg);
>>       if (ret_val)
>> -        goto out;
>> +        return ret_val;
>>       ret_val = hw->phy.ops.read_reg(hw, PHY_LP_ABILITY,
>>                          &mii_nway_lp_ability_reg);
>>       if (ret_val)
>> -        goto out;
>> +        return ret_val;
>>       /* Two bits in the Auto Negotiation Advertisement Register
>>        * (Address 4) and two bits in the Auto Negotiation Base
>>        * Page Ability Register (Address 5) determine flow control
>> @@ -598,7 +598,7 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>       ret_val = hw->mac.ops.get_speed_and_duplex(hw, &speed, &duplex);
>>       if (ret_val) {
>>           hw_dbg("Error getting link speed and duplex\n");
>> -        goto out;
>> +        return ret_val;
>>       }
>>       if (duplex == HALF_DUPLEX)
>> @@ -610,10 +610,9 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>       ret_val = igc_force_mac_fc(hw);
>>       if (ret_val) {
>>           hw_dbg("Error forcing flow control settings\n");
>> -        goto out;
>> +        return ret_val;
>>       }
>> -out:
>>       return ret_val;
>>   }
> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com>

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-next v5 4/4] igc: add support for forcing link speed without autonegotiation
From: Kadosh, MoriyaX @ 2026-06-22  8:12 UTC (permalink / raw)
  To: Ruinskiy, Dima, KhaiWenTan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: intel-wired-lan, netdev, linux-kernel, faizal.abdul.rahim,
	hong.aun.looi, hector.blanco.alcaine, khai.wen.tan, Faizal Rahim
In-Reply-To: <d8f4f16c-adf6-4d99-bb76-09c047ba19eb@intel.com>



On 14/06/2026 10:17, Ruinskiy, Dima wrote:
> On 08/05/2026 0:47, KhaiWenTan wrote:
>> From: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>>
>> Allow users to force 10/100 Mb/s link speed and duplex via ethtool
>> when autonegotiation is disabled. Previously, the driver rejected
>> these requests with "Force mode currently not supported.".
>>
>> Forcing at 1000 Mb/s and 2500 Mb/s is not supported.
>>
>> Reviewed-by: Looi Hong Aun <hong.aun.looi@intel.com>
>> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
>> Signed-off-by: Khai Wen Tan <khai.wen.tan@linux.intel.com>
>> ---
>>   drivers/net/ethernet/intel/igc/igc_base.c    |  35 ++++-
>>   drivers/net/ethernet/intel/igc/igc_defines.h |   9 +-
>>   drivers/net/ethernet/intel/igc/igc_ethtool.c | 138 ++++++++++++++-----
>>   drivers/net/ethernet/intel/igc/igc_hw.h      |   9 ++
>>   drivers/net/ethernet/intel/igc/igc_mac.c     |  12 ++
>>   drivers/net/ethernet/intel/igc/igc_main.c    |   2 +-
>>   drivers/net/ethernet/intel/igc/igc_phy.c     |  65 ++++++++-
>>   drivers/net/ethernet/intel/igc/igc_phy.h     |   1 +
>>   8 files changed, 220 insertions(+), 51 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/igc/igc_base.c b/drivers/net/ 
>> ethernet/intel/igc/igc_base.c
>> index 1613b562d17c..ab9120a3127f 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_base.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_base.c
>> @@ -114,11 +114,35 @@ static s32 igc_setup_copper_link_base(struct 
>> igc_hw *hw)
>>       u32 ctrl;
>>       ctrl = rd32(IGC_CTRL);
>> -    ctrl |= IGC_CTRL_SLU;
>> -    ctrl &= ~(IGC_CTRL_FRCSPD | IGC_CTRL_FRCDPX);
>> -    wr32(IGC_CTRL, ctrl);
>> -
>> -    ret_val = igc_setup_copper_link(hw);
>> +    ctrl &= ~(IGC_CTRL_FRCSPD | IGC_CTRL_FRCDPX |
>> +          IGC_CTRL_SPEED_MASK | IGC_CTRL_FD);
>> +
>> +    if (hw->mac.autoneg_enabled) {
>> +        ctrl |= IGC_CTRL_SLU;
>> +        wr32(IGC_CTRL, ctrl);
>> +        ret_val = igc_setup_copper_link(hw);
>> +    } else {
>> +        ctrl |= IGC_CTRL_SLU | IGC_CTRL_FRCSPD | IGC_CTRL_FRCDPX;
>> +
>> +        switch (hw->mac.forced_speed_duplex) {
>> +        case IGC_FORCED_10H:
>> +            ctrl |= IGC_CTRL_SPEED_10;
>> +            break;
>> +        case IGC_FORCED_10F:
>> +            ctrl |= IGC_CTRL_SPEED_10 | IGC_CTRL_FD;
>> +            break;
>> +        case IGC_FORCED_100H:
>> +            ctrl |= IGC_CTRL_SPEED_100;
>> +            break;
>> +        case IGC_FORCED_100F:
>> +            ctrl |= IGC_CTRL_SPEED_100 | IGC_CTRL_FD;
>> +            break;
>> +        default:
>> +            return -IGC_ERR_CONFIG;
>> +        }
>> +        wr32(IGC_CTRL, ctrl);
>> +        ret_val = igc_setup_copper_link(hw);
>> +    }
>>       return ret_val;
>>   }
>> @@ -443,6 +467,7 @@ static const struct igc_phy_operations 
>> igc_phy_ops_base = {
>>       .reset            = igc_phy_hw_reset,
>>       .read_reg        = igc_read_phy_reg_gpy,
>>       .write_reg        = igc_write_phy_reg_gpy,
>> +    .force_speed_duplex    = igc_force_speed_duplex,
>>   };
>>   const struct igc_info igc_base_info = {
>> diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h b/drivers/ 
>> net/ethernet/intel/igc/igc_defines.h
>> index 9482ab11f050..3f504751c2d9 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_defines.h
>> +++ b/drivers/net/ethernet/intel/igc/igc_defines.h
>> @@ -129,10 +129,13 @@
>>   #define IGC_ERR_SWFW_SYNC        13
>>   /* Device Control */
>> +#define IGC_CTRL_FD        BIT(0)  /* Full Duplex */
>>   #define IGC_CTRL_RST        0x04000000  /* Global reset */
>> -
>>   #define IGC_CTRL_PHY_RST    0x80000000  /* PHY Reset */
>>   #define IGC_CTRL_SLU        0x00000040  /* Set link up (Force Link) */
>> +#define IGC_CTRL_SPEED_MASK    GENMASK(10, 8)
>> +#define IGC_CTRL_SPEED_10    FIELD_PREP(IGC_CTRL_SPEED_MASK, 0)
>> +#define IGC_CTRL_SPEED_100    FIELD_PREP(IGC_CTRL_SPEED_MASK, 1)
>>   #define IGC_CTRL_FRCSPD        0x00000800  /* Force Speed */
>>   #define IGC_CTRL_FRCDPX        0x00001000  /* Force Duplex */
>>   #define IGC_CTRL_VME        0x40000000  /* IEEE VLAN mode enable */
>> @@ -673,6 +676,10 @@
>>   #define IGC_GEN_POLL_TIMEOUT    1920
>>   /* PHY Control Register */
>> +#define MII_CR_SPEED_MASK    (BIT(6) | BIT(13))
>> +#define MII_CR_SPEED_10        0x0000    /* SSM=0, SSL=0: 10 Mb/s */
>> +#define MII_CR_SPEED_100    BIT(13)    /* SSM=0, SSL=1: 100 Mb/s */
>> +#define MII_CR_DUPLEX_EN    BIT(8)    /* 0 = Half Duplex, 1 = Full 
>> Duplex */
>>   #define MII_CR_RESTART_AUTO_NEG    0x0200  /* Restart auto 
>> negotiation */
>>   #define MII_CR_POWER_DOWN    0x0800  /* Power down */
>>   #define MII_CR_AUTO_NEG_EN    0x1000  /* Auto Neg Enable */
>> diff --git a/drivers/net/ethernet/intel/igc/igc_ethtool.c b/drivers/ 
>> net/ethernet/intel/igc/igc_ethtool.c
>> index cfcbf2fdad6e..b103836a895f 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_ethtool.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_ethtool.c
>> @@ -1914,44 +1914,58 @@ static int 
>> igc_ethtool_get_link_ksettings(struct net_device *netdev,
>>       ethtool_link_ksettings_add_link_mode(cmd, supported, TP);
>>       ethtool_link_ksettings_add_link_mode(cmd, advertising, TP);
>> -    /* advertising link modes */
>> -    if (hw->phy.autoneg_advertised & ADVERTISE_10_HALF)
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, 
>> 10baseT_Half);
>> -    if (hw->phy.autoneg_advertised & ADVERTISE_10_FULL)
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, 
>> 10baseT_Full);
>> -    if (hw->phy.autoneg_advertised & ADVERTISE_100_HALF)
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, 
>> 100baseT_Half);
>> -    if (hw->phy.autoneg_advertised & ADVERTISE_100_FULL)
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, 
>> 100baseT_Full);
>> -    if (hw->phy.autoneg_advertised & ADVERTISE_1000_FULL)
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, 
>> 1000baseT_Full);
>> -    if (hw->phy.autoneg_advertised & ADVERTISE_2500_FULL)
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, 
>> 2500baseT_Full);
>> -
>>       /* set autoneg settings */
>>       ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
>> -    ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
>> +    if (hw->mac.autoneg_enabled) {
>> +        ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
>> +        cmd->base.autoneg = AUTONEG_ENABLE;
>> +
>> +        /* advertising link modes only apply when autoneg is on */
>> +        if (hw->phy.autoneg_advertised & ADVERTISE_10_HALF)
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 10baseT_Half);
>> +        if (hw->phy.autoneg_advertised & ADVERTISE_10_FULL)
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 10baseT_Full);
>> +        if (hw->phy.autoneg_advertised & ADVERTISE_100_HALF)
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 100baseT_Half);
>> +        if (hw->phy.autoneg_advertised & ADVERTISE_100_FULL)
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 100baseT_Full);
>> +        if (hw->phy.autoneg_advertised & ADVERTISE_1000_FULL)
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 1000baseT_Full);
>> +        if (hw->phy.autoneg_advertised & ADVERTISE_2500_FULL)
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 2500baseT_Full);
>> +
>> +        /* Set pause flow control advertising */
>> +        switch (hw->fc.requested_mode) {
>> +        case igc_fc_full:
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 Pause);
>> +            break;
>> +        case igc_fc_rx_pause:
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 Pause);
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 Asym_Pause);
>> +            break;
>> +        case igc_fc_tx_pause:
>> +            ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> +                                 Asym_Pause);
>> +            break;
>> +        default:
>> +            break;
>> +        }
>> +    } else {
>> +        cmd->base.autoneg = AUTONEG_DISABLE;
>> +    }
>> -    /* Set pause flow control settings */
>> +    /* Pause is always supported */
>>       ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
>> -    switch (hw->fc.requested_mode) {
>> -    case igc_fc_full:
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);
>> -        break;
>> -    case igc_fc_rx_pause:
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising, Pause);
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> -                             Asym_Pause);
>> -        break;
>> -    case igc_fc_tx_pause:
>> -        ethtool_link_ksettings_add_link_mode(cmd, advertising,
>> -                             Asym_Pause);
>> -        break;
>> -    default:
>> -        break;
>> -    }
>> -
>>       status = pm_runtime_suspended(&adapter->pdev->dev) ?
>>            0 : rd32(IGC_STATUS);
>> @@ -1983,7 +1997,6 @@ static int igc_ethtool_get_link_ksettings(struct 
>> net_device *netdev,
>>           cmd->base.duplex = DUPLEX_UNKNOWN;
>>       }
>>       cmd->base.speed = speed;
>> -    cmd->base.autoneg = AUTONEG_ENABLE;
>>       /* MDI-X => 2; MDI =>1; Invalid =>0 */
>>       if (hw->phy.media_type == igc_media_type_copper)
>> @@ -2000,6 +2013,37 @@ static int 
>> igc_ethtool_get_link_ksettings(struct net_device *netdev,
>>       return 0;
>>   }
>> +/**
>> + * igc_handle_autoneg_disabled - Configure forced speed/duplex settings
>> + * @adapter: private driver structure
>> + * @speed: requested speed (must be SPEED_10 or SPEED_100)
>> + * @duplex: requested duplex
>> + *
>> + * Records forced speed/duplex when autoneg is disabled.
>> + * Caller must validate speed before calling this function.
>> + */
>> +static void igc_handle_autoneg_disabled(struct igc_adapter *adapter, 
>> u32 speed,
>> +                    u8 duplex)
>> +{
>> +    struct igc_mac_info *mac = &adapter->hw.mac;
>> +
>> +    switch (speed) {
>> +    case SPEED_10:
>> +        mac->forced_speed_duplex = (duplex == DUPLEX_FULL) ?
>> +            IGC_FORCED_10F : IGC_FORCED_10H;
>> +        break;
>> +    case SPEED_100:
>> +        mac->forced_speed_duplex = (duplex == DUPLEX_FULL) ?
>> +            IGC_FORCED_100F : IGC_FORCED_100H;
>> +        break;
>> +    default:
>> +        WARN_ONCE(1, "Unsupported speed %u\n", speed);
>> +        return;
>> +    }
>> +
>> +    mac->autoneg_enabled = false;
>> +}
>> +
>>   /**
>>    * igc_handle_autoneg_enabled - Configure autonegotiation advertisement
>>    * @adapter: private driver structure
>> @@ -2038,6 +2082,7 @@ static void igc_handle_autoneg_enabled(struct 
>> igc_adapter *adapter,
>>                             10baseT_Half))
>>           advertised |= ADVERTISE_10_HALF;
>> +    hw->mac.autoneg_enabled = true;
>>       hw->phy.autoneg_advertised = advertised;
>>       if (adapter->fc_autoneg)
>>           hw->fc.requested_mode = igc_fc_default;
>> @@ -2059,6 +2104,12 @@ igc_ethtool_set_link_ksettings(struct 
>> net_device *netdev,
>>           return -EINVAL;
>>       }
>> +    if (cmd->base.autoneg != AUTONEG_ENABLE &&
>> +        cmd->base.autoneg != AUTONEG_DISABLE) {
>> +        netdev_info(dev, "Unsupported autoneg setting\n");
>> +        return -EINVAL;
>> +    }
>> +
>>       /* MDI setting is only allowed when autoneg enabled because
>>        * some hardware doesn't allow MDI setting when speed or
>>        * duplex is forced.
>> @@ -2071,14 +2122,25 @@ igc_ethtool_set_link_ksettings(struct 
>> net_device *netdev,
>>           }
>>       }
>> +    if (cmd->base.autoneg == AUTONEG_DISABLE) {
>> +        if (cmd->base.speed != SPEED_10 && cmd->base.speed != 
>> SPEED_100) {
>> +            netdev_info(dev, "Unsupported speed for forced link\n");
>> +            return -EINVAL;
>> +        }
>> +        if (cmd->base.duplex != DUPLEX_HALF && cmd->base.duplex != 
>> DUPLEX_FULL) {
>> +            netdev_info(dev, "Duplex must be half or full for forced 
>> link\n");
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>>       while (test_and_set_bit(__IGC_RESETTING, &adapter->state))
>>           usleep_range(1000, 2000);
>> -    if (cmd->base.autoneg == AUTONEG_ENABLE) {
>> +    if (cmd->base.autoneg == AUTONEG_ENABLE)
>>           igc_handle_autoneg_enabled(adapter, cmd);
>> -    } else {
>> -        netdev_info(dev, "Force mode currently not supported\n");
>> -    }
>> +    else
>> +        igc_handle_autoneg_disabled(adapter, cmd->base.speed,
>> +                        cmd->base.duplex);
>>       /* MDI-X => 2; MDI => 1; Auto => 3 */
>>       if (cmd->base.eth_tp_mdix_ctrl) {
>> diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h b/drivers/net/ 
>> ethernet/intel/igc/igc_hw.h
>> index 86ab8f566f44..62aaee55668a 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_hw.h
>> +++ b/drivers/net/ethernet/intel/igc/igc_hw.h
>> @@ -73,6 +73,13 @@ struct igc_info {
>>   extern const struct igc_info igc_base_info;
>> +enum igc_forced_speed_duplex {
>> +    IGC_FORCED_10H,
>> +    IGC_FORCED_10F,
>> +    IGC_FORCED_100H,
>> +    IGC_FORCED_100F,
>> +};
>> +
>>   struct igc_mac_info {
>>       struct igc_mac_operations ops;
>> @@ -93,6 +100,8 @@ struct igc_mac_info {
>>       bool arc_subsystem_valid;
>>       bool get_link_status;
>> +    bool autoneg_enabled;
>> +    enum igc_forced_speed_duplex forced_speed_duplex;
>>   };
>>   struct igc_nvm_operations {
>> diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c b/drivers/net/ 
>> ethernet/intel/igc/igc_mac.c
>> index 0a3d3f357505..d6f3f6618469 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_mac.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_mac.c
>> @@ -446,6 +446,17 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>       u16 speed, duplex;
>>       s32 ret_val = 0;
>> +    /* Without autoneg, flow control capability is not exchanged with 
>> the
>> +     * link partner. IEEE 802.3 prohibits flow control in half-duplex 
>> mode.
>> +     */
>> +    if (!hw->mac.autoneg_enabled) {
>> +        if (hw->mac.forced_speed_duplex == IGC_FORCED_10H ||
>> +            hw->mac.forced_speed_duplex == IGC_FORCED_100H)
>> +            hw->fc.current_mode = igc_fc_none;
>> +
>> +        goto force_fc;
>> +    }
>> +
>>       /* In auto-neg, we need to check and see if Auto-Neg has completed,
>>        * and if so, how the PHY and link partner has flow control
>>        * configured.
>> @@ -607,6 +618,7 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
>>       /* Now we call a subroutine to actually force the MAC
>>        * controller to use the correct flow control settings.
>>        */
>> +force_fc:
>>       ret_val = igc_force_mac_fc(hw);
>>       if (ret_val) {
>>           hw_dbg("Error forcing flow control settings\n");
>> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ 
>> ethernet/intel/igc/igc_main.c
>> index 72bc5128d8b8..437e1d1ef1e4 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_main.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
>> @@ -7298,7 +7298,7 @@ static int igc_probe(struct pci_dev *pdev,
>>       /* Initialize link properties that are user-changeable */
>>       adapter->fc_autoneg = true;
>>       hw->phy.autoneg_advertised = 0xaf;
>> -
>> +    hw->mac.autoneg_enabled = true;
>>       hw->fc.requested_mode = igc_fc_default;
>>       hw->fc.current_mode = igc_fc_default;
>> diff --git a/drivers/net/ethernet/intel/igc/igc_phy.c b/drivers/net/ 
>> ethernet/intel/igc/igc_phy.c
>> index 6c4d204aecfa..4cf737fb3b21 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_phy.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_phy.c
>> @@ -494,12 +494,20 @@ s32 igc_setup_copper_link(struct igc_hw *hw)
>>       s32 ret_val = 0;
>>       bool link;
>> -    /* Setup autoneg and flow control advertisement and perform
>> -     * autonegotiation.
>> -     */
>> -    ret_val = igc_copper_link_autoneg(hw);
>> -    if (ret_val)
>> -        goto out;
>> +    if (hw->mac.autoneg_enabled) {
>> +        /* Setup autoneg and flow control advertisement and perform
>> +         * autonegotiation.
>> +         */
>> +        ret_val = igc_copper_link_autoneg(hw);
>> +        if (ret_val)
>> +            goto out;
>> +    } else {
>> +        ret_val = hw->phy.ops.force_speed_duplex(hw);
>> +        if (ret_val) {
>> +            hw_dbg("Error Forcing Speed/Duplex\n");
>> +            goto out;
>> +        }
>> +    }
>>       /* Check link status. Wait up to 100 microseconds for link to 
>> become
>>        * valid.
>> @@ -778,3 +786,48 @@ u16 igc_read_phy_fw_version(struct igc_hw *hw)
>>       return gphy_version;
>>   }
>> +
>> +/**
>> + * igc_force_speed_duplex - Force PHY speed and duplex settings
>> + * @hw: pointer to the HW structure
>> + *
>> + * Programs the GPY PHY control register to disable autonegotiation
>> + * and force the speed/duplex indicated by hw->mac.forced_speed_duplex.
>> + */
>> +s32 igc_force_speed_duplex(struct igc_hw *hw)
>> +{
>> +    struct igc_phy_info *phy = &hw->phy;
>> +    u16 phy_ctrl;
>> +    s32 ret_val;
>> +
>> +    ret_val = phy->ops.read_reg(hw, PHY_CONTROL, &phy_ctrl);
>> +    if (ret_val)
>> +        return ret_val;
>> +
>> +    phy_ctrl &= ~(MII_CR_SPEED_MASK | MII_CR_DUPLEX_EN |
>> +              MII_CR_AUTO_NEG_EN | MII_CR_RESTART_AUTO_NEG);
>> +
>> +    switch (hw->mac.forced_speed_duplex) {
>> +    case IGC_FORCED_10H:
>> +        phy_ctrl |= MII_CR_SPEED_10;
>> +        break;
>> +    case IGC_FORCED_10F:
>> +        phy_ctrl |= MII_CR_SPEED_10 | MII_CR_DUPLEX_EN;
>> +        break;
>> +    case IGC_FORCED_100H:
>> +        phy_ctrl |= MII_CR_SPEED_100;
>> +        break;
>> +    case IGC_FORCED_100F:
>> +        phy_ctrl |= MII_CR_SPEED_100 | MII_CR_DUPLEX_EN;
>> +        break;
>> +    default:
>> +        return -IGC_ERR_CONFIG;
>> +    }
>> +
>> +    ret_val = phy->ops.write_reg(hw, PHY_CONTROL, phy_ctrl);
>> +    if (ret_val)
>> +        return ret_val;
>> +
>> +    hw->mac.get_link_status = true;
>> +    return 0;
>> +}
>> diff --git a/drivers/net/ethernet/intel/igc/igc_phy.h b/drivers/net/ 
>> ethernet/intel/igc/igc_phy.h
>> index 832a7e359f18..d37a89174826 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_phy.h
>> +++ b/drivers/net/ethernet/intel/igc/igc_phy.h
>> @@ -18,5 +18,6 @@ void igc_power_down_phy_copper(struct igc_hw *hw);
>>   s32 igc_write_phy_reg_gpy(struct igc_hw *hw, u32 offset, u16 data);
>>   s32 igc_read_phy_reg_gpy(struct igc_hw *hw, u32 offset, u16 *data);
>>   u16 igc_read_phy_fw_version(struct igc_hw *hw);
>> +s32 igc_force_speed_duplex(struct igc_hw *hw);
>>   #endif
> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com>

^ permalink raw reply

* Re: [PATCH net] net: dst: block BH in ipip6_tunnel_xmit
From: Eric Dumazet @ 2026-06-22  8:13 UTC (permalink / raw)
  To: yuan.gao
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Yue Haibing, Kuniyuki Iwashima, Thorsten Blum, Kyle Zeng,
	Kees Cook, netdev, linux-kernel
In-Reply-To: <20260622033118.244651-1-yuan.gao@ucloud.cn>

On Sun, Jun 21, 2026 at 8:31 PM yuan.gao <yuan.gao@ucloud.cn> wrote:
>
> Similar to commit 1378817486d6 ("tipc: block BH before using dst_cache"),
> the dst cache helper functions must be invoked with local BH disabled.
>
> This ensures proper synchronization and fixes a potential race condition
> on SMP systems.
>
> Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
> ---

All ndo_start_xmit() methods already run with BH blocked, can you give
us a stack trace when this would not be enforced?

You forgot a Fixes: tag.

^ permalink raw reply

* Re: [PATCH net,v2 00/14] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2026-06-22  8:16 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260620222738.112506-1-pablo@netfilter.org>

Hi,

Sashiko reports two issues, one in:

- netfilter: flowtable: fix offloaded ct timeout never being extended
  which is real for net/sched/act_ct.c, this was a preexisting issue,
  we can follow up on it.

- netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
  I already planned to follow up on this and a few more subtle issues
  (includeing one related patch I have withdrew because it is
   incomplete).

Please apply, thanks.

On Sun, Jun 21, 2026 at 12:27:24AM +0200, Pablo Neira Ayuso wrote:
> This is v2, dropping two patches that need a bit more work,
> uncovered by sashiko. I have revisit the working of this cover
> letter to refine it.
> 
> -o-
> 
> Hi,
>  
> The following patchset contains Netfilter fixes for net. This batches
> fixes for real crashes with trivial/correctness fixes. There is too
> a rework of the conntrack expectation timeout strategy to deal with
> a possible race when removing an expectation.
>  
> 1) Fix the incorrect flowtable timeout extension for entries in
>    hw offload, from Adrian Bente. This is correcting a defect in
>    the functionality, no crash.
>  
> 2) Hold reference to device under the fake dst in br_netfilter,
>    from Haoze Xie. This is fixing a possible UaF if the device
>    is removed while packet is sitting in nfqueue.
>  
> 3) Reject template conntrack in xt_cluster, otherwise access to
>    uninitialize conntrack fields are possible leading to WARN_ON
>    due to unset layer 3 protocol. From Wyatt Feng.
>  
> 4) Make sure the IPv6 tunnel header is in the linear skb data
>    area before pulling. While at it remove incomplete NEXTHDR_DEST
>    support. From Lorenzo Bianconi. This possibly leading to crash
>    if IPv4 header is not in the linear area.
>  
> 5) Use test_bit_acquire in ipset hash set to avoid reordering
>    of subsequent memory access. This is addressing a LLM related
>    report, no crash has been observed. From Jozsef Kadlecsik.
>  
> 6) Use test_bit_acquire in ipset bitmap set too, for the same
>    reason as in the previous patch, from Jozsef Kadlecsik.
>  
> 7) Call kfree_rcu() after rcu_assign_pointer() to address a
>    possible UaF if kfree_rcu() runs inmediately, which to my
>    understanding never happens. Never observed in practise,
>    reported by LLM. Also from Jozsef Kadlecsik.
> 
> 8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
>    to avoid that ipset GC handler re-queues work as reported by LLM.
>    From Jozsef Kadlecsik. This is for correctness.
>  
> 9) Restore the check in nft_payload for exceeding payloda offset
>     over 2^16. From Florian Westphal. This fixes a silent truncation,
>     not a big deal, but better be assertive and reject it.
>  
> 10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
>     prerouting. From Florian Westphal. Harmless but it could allow
>     to read bytes from skb->cb.
>  
> 11) Zero out destination hardware address during the flowtable
>     path setup, also from Florian. This is a correctness fix, LLM
>     points that possible infoleak can happen but topology to achieve
>     it is not clear.
> 
> 12) Skip IPv4 options if present when building the IPV4 reject reply.
>     Otherwise bytes in the IPv4 options header can be sent back to
>     origin where the ICMP header is being expected. Again from
>     Florian Westphal.
>  
> 13) Replace timer API for expectation by GC worker approach. This
>     is implicitly fixing a race between nf_ct_remove_expectations()
>     which might fail to remove the expectation due to timer_del()
>     returning false because timer has expired and callback is
>     being run concurrently. This fix is addressing a crash that has
>     been already reported with a reproducer.
> 
> 14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack
>     infoleak of 4-bytes. From Florian Westphal.
> 
> Please, pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git nf-26-06-21
> 
> Thanks.
> 
> ----------------------------------------------------------------
> 
> The following changes since commit 96e7f9122aae0ed000ee321f324b812a447906d9:
> 
>   eth: fbnic: take netif_addr_lock_bh() around rx mode address programming (2026-06-18 18:36:26 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git tags/nf-26-06-21
> 
> for you to fetch changes up to 27dd2997746d54ebc079bb13161cc1bdd401d4a6:
> 
>   netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak (2026-06-21 00:18:37 +0200)
> 
> ----------------------------------------------------------------
> netfilter pull request 26-06-21
> 
> ----------------------------------------------------------------
> Adrian Bente (1):
>       netfilter: flowtable: fix offloaded ct timeout never being extended
> 
> Florian Westphal (5):
>       netfilter: nft_payload: reject offsets exceeding 65535 bytes
>       netfilter: nft_meta_bridge: add validate callback for get operations
>       netfilter: nft_flow_offload: zero device address for non-ether case
>       netfilter: nf_reject: skip iphdr options when looking for icmp header
>       netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
> 
> Haoze Xie (1):
>       netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
> 
> Jozsef Kadlecsik (4):
>       netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
>       netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
>       netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
>       netfilter: ipset: make sure gc is properly stopped
> 
> Lorenzo Bianconi (1):
>       netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
> 
> Pablo Neira Ayuso (1):
>       netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
> 
> Wyatt Feng (1):
>       netfilter: xt_cluster: reject template conntracks in hash match
> 
>  include/net/netfilter/nf_conntrack_expect.h        |  16 ++-
>  include/net/netfilter/nf_queue.h                   |   1 +
>  include/net/netfilter/nft_meta.h                   |   2 +
>  include/uapi/linux/netfilter/nf_conntrack_common.h |   1 +
>  net/bridge/netfilter/nft_meta_bridge.c             |  23 +++-
>  net/ipv4/netfilter/nf_reject_ipv4.c                |   2 +-
>  net/ipv6/ip6_tunnel.c                              |   7 +
>  net/netfilter/ipset/ip_set_bitmap_gen.h            |   4 +-
>  net/netfilter/ipset/ip_set_bitmap_ip.c             |   2 +-
>  net/netfilter/ipset/ip_set_bitmap_ipmac.c          |   2 +-
>  net/netfilter/ipset/ip_set_bitmap_port.c           |   2 +-
>  net/netfilter/ipset/ip_set_core.c                  |   4 +-
>  net/netfilter/ipset/ip_set_hash_gen.h              |  12 +-
>  net/netfilter/nf_conntrack_core.c                  |  33 ++++-
>  net/netfilter/nf_conntrack_expect.c                | 145 ++++++++++-----------
>  net/netfilter/nf_conntrack_h323_main.c             |   4 +-
>  net/netfilter/nf_conntrack_helper.c                |  10 +-
>  net/netfilter/nf_conntrack_netlink.c               |  22 ++--
>  net/netfilter/nf_conntrack_sip.c                   |  13 +-
>  net/netfilter/nf_flow_table_core.c                 |  13 +-
>  net/netfilter/nf_flow_table_ip.c                   |  80 +++---------
>  net/netfilter/nf_flow_table_path.c                 |   4 +-
>  net/netfilter/nf_queue.c                           |  14 ++
>  net/netfilter/nfnetlink_queue.c                    |   3 +
>  net/netfilter/nft_ct.c                             |   3 +-
>  net/netfilter/nft_meta.c                           |   5 +-
>  net/netfilter/nft_payload.c                        |  16 ++-
>  net/netfilter/xt_cluster.c                         |   2 +-
>  .../selftests/net/netfilter/nft_flowtable.sh       |   8 +-
>  29 files changed, 254 insertions(+), 199 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH 1/2] fs: Add bpf_sock_read_xattr() kfunc to read socket xattrs
From: Christian Brauner @ 2026-06-22  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Alexander Viro,
	Jan Kara, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	linux-fsdevel, netdev, bpf, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu,
	Yonghong Song, Jiri Olsa
In-Reply-To: <DJDJX62AS415.2BVILN08QK149@gmail.com>

> lgtm.
> How do you want to route it? Thought vfs tree for the next merge window?

Yes, thank you for looking!


^ permalink raw reply

* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Beniamino Galvani @ 2026-06-22  8:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Ralf Lici, netdev, Daniel Gröber, Antonio Quartulli,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel
In-Reply-To: <87a4tab1vs.fsf@toke.dk>

On Thu, Jun 04, 2026 at 08:23:51PM +0200, Toke Høiland-Jørgensen wrote:
> Ralf Lici <ralf@mandelbit.com> writes:
> 
> > This commit introduces the core start_xmit processing flow: validate,
> > select action, translate, and forward. It centralizes action resolution
> > in the dispatch layer and keeps per-direction translation logic separate
> > from device glue. The result is a single data-path entry point with
> > explicit control over drop/forward/emit behavior.
> >
> > Signed-off-by: Ralf Lici <ralf@mandelbit.com>
> 
> This is very cool! Going quickly through the series, this seems like
> thorough work that will be cool to have available in the kernel, so
> thanks for doing this! I'll be quite happy to retire my barebones
> BPF-based implementation once this lands :)

Hi,

speaking as a maintainer of NetworkManager, I would also like to see
this feature in the kernel!

In NetworkManager currently we are using a BPF program [1] to
implement the CLAT, but that approach comes with limitations: for
example, we can't fragment v4->v6 packets if needed, and it's not
possible to recompute checksums in certain cases (e.g. for v4->v6 UDP
packets with zero checksum, and for fragmented ICMP). systemd-networkd
is also adding CLAT support via BPF [2], with a fallback to userspace
for the cases that can't be handled in kernel.

It would be very useful to have a native in-kernel CLAT that solves
the limitations of BPF-based solutions, and can be used by different
tools without having to re-implement everything from scratch.

Beniamino

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.57.2-dev/src/core/bpf/clat.bpf.c
[2] https://github.com/systemd/systemd/pull/41412

^ permalink raw reply

* [PATCH net V3 0/3] net/mlx5e: Fix crashes in dynamic per-channel stats and HV VHCA agent
From: Tariq Toukan @ 2026-06-22  8:36 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan, Gal Pressman,
	Alexei Lazar, Simon Horman, Carolina Jubran, Kees Cook,
	Eran Ben Elisha, Saeed Mahameed

Hi,

Since per-channel stats were converted to be allocated and published
lazily at first channel open in commit fa691d0c9c08 ("net/mlx5e:
Allocate per-channel stats dynamically at first usage"),
priv->channel_stats[] and priv->stats_nch are filled in
incrementally during interface bring-up. This opened a window in
which the various stats readers - most of them reachable from
userspace via netlink/netdev stats queries - can race with
mlx5e_open_channel() on another CPU and observe partially
initialized state. The HV VHCA stats agent, which is created
before the channels are opened, hits related problems of its own.

This series by Feng fixes the resulting crashes.

Regards,
Tariq

V3:
- Rebase on current net.

V2:
https://lore.kernel.org/all/20260617140127.573117-1-tariqt@nvidia.com/

Feng Liu (3):
  net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
  net/mlx5e: Fix HV VHCA stats agent registration race
  net/mlx5e: Fix publication race for priv->channel_stats[]

 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 12 ++++++
 .../mellanox/mlx5/core/en/hv_vhca_stats.c     | 38 +++++++++++++------
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 14 ++++---
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  9 +++--
 .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  3 +-
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.c |  8 +++-
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.h |  6 ++-
 7 files changed, 63 insertions(+), 27 deletions(-)

base-commit: d07d80b6a129a44538cda1549b7acf95154fb197
-- 
2.44.0

^ permalink raw reply

* [PATCH net V3 1/3] net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
From: Tariq Toukan @ 2026-06-22  8:36 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan, Gal Pressman,
	Alexei Lazar, Simon Horman, Carolina Jubran, Kees Cook,
	Eran Ben Elisha, Saeed Mahameed
In-Reply-To: <20260622083646.593220-1-tariqt@nvidia.com>

From: Feng Liu <feliu@nvidia.com>

mlx5e_hv_vhca_stats_create() is called from mlx5e_nic_enable(),
before mlx5e_open(). At that point priv->stats_nch is still zero,
because it is only ever incremented in mlx5e_channel_stats_alloc(),
which is reached only from mlx5e_open_channel().

mlx5e_hv_vhca_stats_buf_size() therefore returns 0, and
kvzalloc(0, GFP_KERNEL) returns ZERO_SIZE_PTR ((void *)16) rather
than NULL. The "if (!buf)" guard does not catch this, and
mlx5e_hv_vhca_stats_create() completes "successfully" with
priv->stats_agent.buf set to ZERO_SIZE_PTR.

Once channels are opened (priv->stats_nch > 0) and the hypervisor
enables stats reporting, mlx5e_hv_vhca_stats_work() recomputes
buf_len using the new non-zero stats_nch and calls
memset(buf, 0, buf_len) on ZERO_SIZE_PTR, faulting at address 0x10.

Allocate the buffer based on priv->max_nch, which is set in
mlx5e_priv_init() and is the upper bound on stats_nch:

  - Add a separate helper mlx5e_hv_vhca_stats_buf_max_size() that
    returns sizeof(per_ring_stats) * max(max_nch, stats_nch), and
    use it for the kvzalloc() in mlx5e_hv_vhca_stats_create().
  - Keep mlx5e_hv_vhca_stats_buf_size() (which returns based on
    stats_nch) for the worker's active payload size, so the wire
    format (block->rings = stats_nch) and the amount of data filled
    by mlx5e_hv_vhca_fill_stats() are unchanged.

The max(max_nch, stats_nch) guard handles the rare case where
mlx5e_attach_netdev() recomputes max_nch downward across a
detach/resume cycle while priv->stats_nch persists (mlx5e_detach_netdev
does not call mlx5e_priv_cleanup, so stats_nch is only reset when
the netdev is destroyed). Without the guard, the worker could compute
buf_len from stats_nch and overrun the smaller buffer allocated based
on the reduced max_nch.

This mirrors the existing mlx5e pattern of preallocating arrays of
size max_nch (e.g. priv->channel_stats) and lazily populating
entries up to stats_nch on demand.

Fixes: fa691d0c9c08 ("net/mlx5e: Allocate per-channel stats dynamically at first usage")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c    | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
index 195863b2c013..06cbd49d4e98 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -54,6 +54,12 @@ static int mlx5e_hv_vhca_stats_buf_size(struct mlx5e_priv *priv)
 		priv->stats_nch);
 }

+static int mlx5e_hv_vhca_stats_buf_max_size(struct mlx5e_priv *priv)
+{
+	return (sizeof(struct mlx5e_hv_vhca_per_ring_stats) *
+		max(priv->max_nch, priv->stats_nch));
+}
+
 static void mlx5e_hv_vhca_stats_work(struct work_struct *work)
 {
 	struct mlx5e_hv_vhca_stats_agent *sagent;
@@ -122,7 +128,7 @@ static void mlx5e_hv_vhca_stats_cleanup(struct mlx5_hv_vhca_agent *agent)

 void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
 {
-	int buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
+	int buf_len = mlx5e_hv_vhca_stats_buf_max_size(priv);
 	struct mlx5_hv_vhca_agent *agent;

 	priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
-- 
2.44.0

^ permalink raw reply related

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Jani Nikula @ 2026-06-22  8:37 UTC (permalink / raw)
  To: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, chengkaitao
In-Reply-To: <20260622040533.29824-1-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> Add *_mutable() iterator variants for list, hlist and llist.  The new
> helpers are variadic and support both forms.  In the common case, the
> caller omits the temporary cursor and the macro creates a unique internal
> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
> explicit temporary cursor, the caller can still pass it and the helper
> keeps the existing *_safe() behaviour.
>
> For example, a call site may use the shorter form:
>
>   list_for_each_entry_mutable(pos, head, member)
>
> or keep the explicit temporary cursor form:
>
>   list_for_each_entry_mutable(pos, tmp, head, member)

I'm unconvinced it's a good idea to allow two forms with macro trickery,
*especially* when it's not the last argument you can omit. I think it's
a footgun.

IMO stick with the first form only, and there'll always be the _safe
variant that can be used when the temp pointer is needed.


BR,
Jani.


-- 
Jani Nikula, Intel

^ permalink raw reply

* [PATCH net V3 2/3] net/mlx5e: Fix HV VHCA stats agent registration race
From: Tariq Toukan @ 2026-06-22  8:36 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan, Gal Pressman,
	Alexei Lazar, Simon Horman, Carolina Jubran, Kees Cook,
	Eran Ben Elisha, Saeed Mahameed
In-Reply-To: <20260622083646.593220-1-tariqt@nvidia.com>

From: Feng Liu <feliu@nvidia.com>

mlx5e_hv_vhca_stats_create() registers the stats agent through
mlx5_hv_vhca_agent_create(). The helper publishes the agent in
hv_vhca->agents[type] under agents_lock and immediately schedules an
asynchronous control invalidation on the HV VHCA workqueue before
returning to mlx5e.

The asynchronous invalidation invokes the control agent's invalidate
callback, which reads the hypervisor control block and forwards the
command to mlx5e_hv_vhca_stats_control(). That callback may either:

  - call cancel_delayed_work_sync(&priv->stats_agent.work), or
  - call queue_delayed_work(priv->wq, &sagent->work, sagent->delay).

However, the delayed_work and priv->stats_agent.agent are only
initialized after mlx5_hv_vhca_agent_create() returns to mlx5e:

    agent = mlx5_hv_vhca_agent_create(...);   /* publish + invalidate */
    ...
    priv->stats_agent.agent = agent;          /* too late */
    INIT_DELAYED_WORK(&priv->stats_agent.work, ...); /* too late */

If the asynchronous control path runs before the two assignments
above, it can:

  - Operate on an uninitialized delayed_work whose timer.function is
    NULL. queue_delayed_work() calls add_timer() unconditionally, so
    when the timer expires the timer softirq invokes a NULL function
    pointer.
  - Re-initialize the timer later through INIT_DELAYED_WORK() while
    the timer is already enqueued in the timer wheel, corrupting the
    hlist (entry.pprev cleared while the previous bucket node still
    points at this entry).
  - When the worker eventually runs, mlx5e_hv_vhca_stats_work() reads
    sagent->agent (NULL) and dereferences it inside
    mlx5_hv_vhca_agent_write().

Fix this by:

  - Initializing priv->stats_agent.work before invoking
    mlx5_hv_vhca_agent_create(), so the work is always in a valid
    state when the control callback observes it.
  - Adding a struct mlx5_hv_vhca_agent **ctx_update out-parameter
    to mlx5_hv_vhca_agent_create(). The helper writes the agent
    pointer to *ctx_update before publishing into hv_vhca->agents[]
    and triggering the agents_update flow, so any callback
    subsequently invoked from that flow already sees a valid
    priv->stats_agent.agent. This avoids having the control
    callback participate in agent initialization.

While at it, clear priv->stats_agent.{agent,buf} after teardown and
on the agent_create() failure path. Without this, an enable/disable
cycle hitting an early-return in create can lead to a UAF or
double-destroy of stale pointers from the previous cycle.

Fixes: cef35af34d6d ("net/mlx5e: Add mlx5e HV VHCA stats agent")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/en/hv_vhca_stats.c     | 22 ++++++++++++-------
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.c |  8 +++++--
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.h |  6 +++--
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
index 06cbd49d4e98..2e495442a547 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -73,7 +73,7 @@ static void mlx5e_hv_vhca_stats_work(struct work_struct *work)
 	sagent = container_of(dwork, struct mlx5e_hv_vhca_stats_agent, work);
 	priv = container_of(sagent, struct mlx5e_priv, stats_agent);
 	buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
-	agent = sagent->agent;
+	agent = READ_ONCE(sagent->agent);
 	buf = sagent->buf;
 
 	memset(buf, 0, buf_len);
@@ -135,11 +135,14 @@ void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
 	if (!priv->stats_agent.buf)
 		return;
 
+	INIT_DELAYED_WORK(&priv->stats_agent.work, mlx5e_hv_vhca_stats_work);
+
 	agent = mlx5_hv_vhca_agent_create(priv->mdev->hv_vhca,
 					  MLX5_HV_VHCA_AGENT_STATS,
 					  mlx5e_hv_vhca_stats_control, NULL,
 					  mlx5e_hv_vhca_stats_cleanup,
-					  priv);
+					  priv,
+					  &priv->stats_agent.agent);
 
 	if (IS_ERR_OR_NULL(agent)) {
 		if (IS_ERR(agent))
@@ -148,18 +151,21 @@ void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
 				    agent);
 
 		kvfree(priv->stats_agent.buf);
-		return;
+		priv->stats_agent.buf = NULL;
 	}
-
-	priv->stats_agent.agent = agent;
-	INIT_DELAYED_WORK(&priv->stats_agent.work, mlx5e_hv_vhca_stats_work);
 }
 
 void mlx5e_hv_vhca_stats_destroy(struct mlx5e_priv *priv)
 {
-	if (IS_ERR_OR_NULL(priv->stats_agent.agent))
+	struct mlx5_hv_vhca_agent *agent;
+
+	agent = READ_ONCE(priv->stats_agent.agent);
+	if (IS_ERR_OR_NULL(agent))
 		return;
 
-	mlx5_hv_vhca_agent_destroy(priv->stats_agent.agent);
+	mlx5_hv_vhca_agent_destroy(agent);
 	kvfree(priv->stats_agent.buf);
+
+	WRITE_ONCE(priv->stats_agent.agent, NULL);
+	priv->stats_agent.buf = NULL;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
index d6dc7bce855e..305752dab7bd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
@@ -190,7 +190,7 @@ mlx5_hv_vhca_control_agent_create(struct mlx5_hv_vhca *hv_vhca)
 	return mlx5_hv_vhca_agent_create(hv_vhca, MLX5_HV_VHCA_AGENT_CONTROL,
 					 NULL,
 					 mlx5_hv_vhca_control_agent_invalidate,
-					 NULL, NULL);
+					 NULL, NULL, NULL);
 }
 
 static void mlx5_hv_vhca_control_agent_destroy(struct mlx5_hv_vhca_agent *agent)
@@ -256,7 +256,8 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
 					     u64 block_mask),
 			  void (*cleaup)(struct mlx5_hv_vhca_agent *agent),
-			  void *priv)
+			  void *priv,
+			  struct mlx5_hv_vhca_agent **ctx_update)
 {
 	struct mlx5_hv_vhca_agent *agent;
 
@@ -284,6 +285,9 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 	agent->invalidate = invalidate;
 	agent->cleanup   = cleaup;
 
+	if (ctx_update)
+		WRITE_ONCE(*ctx_update, agent);
+
 	mutex_lock(&hv_vhca->agents_lock);
 	hv_vhca->agents[type] = agent;
 	mutex_unlock(&hv_vhca->agents_lock);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
index f240ffe5116c..8b3974cf0ee4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
@@ -43,7 +43,8 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
 					     u64 block_mask),
 			  void (*cleanup)(struct mlx5_hv_vhca_agent *agent),
-			  void *context);
+			  void *context,
+			  struct mlx5_hv_vhca_agent **ctx_update);
 
 void mlx5_hv_vhca_agent_destroy(struct mlx5_hv_vhca_agent *agent);
 int mlx5_hv_vhca_agent_write(struct mlx5_hv_vhca_agent *agent,
@@ -84,7 +85,8 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
 					     u64 block_mask),
 			  void (*cleanup)(struct mlx5_hv_vhca_agent *agent),
-			  void *context)
+			  void *context,
+			  struct mlx5_hv_vhca_agent **ctx_update)
 {
 	return NULL;
 }
-- 
2.44.0


^ permalink raw reply related

* [PATCH net V3 3/3] net/mlx5e: Fix publication race for priv->channel_stats[]
From: Tariq Toukan @ 2026-06-22  8:36 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan, Gal Pressman,
	Alexei Lazar, Simon Horman, Carolina Jubran, Kees Cook,
	Eran Ben Elisha, Saeed Mahameed
In-Reply-To: <20260622083646.593220-1-tariqt@nvidia.com>

From: Feng Liu <feliu@nvidia.com>

mlx5e_channel_stats_alloc() publishes a new entry to
priv->channel_stats[] and then increments priv->stats_nch as a
publication token, but neither store carries any memory barrier:

	priv->channel_stats[ix] = kvzalloc_node(...);
	if (!priv->channel_stats[ix])
		return -ENOMEM;
	priv->stats_nch++;

Concurrent readers compute the loop bound from priv->stats_nch and
then dereference priv->channel_stats[i] using plain accesses, e.g.

	for (i = 0; i < priv->stats_nch; i++) {
		struct mlx5e_channel_stats *cs = priv->channel_stats[i];
		... cs->rq.packets ...
	}

On weakly-ordered architectures (ARM, PowerPC, RISC-V) the writes to
channel_stats[ix] and stats_nch may become visible to other CPUs out
of program order. A reader can observe stats_nch == N while still
seeing channel_stats[N-1] == NULL, leading to a NULL pointer
dereference in the channel_stats loop.

This has been observed in production on BlueField-3 DPUs (arm64),
where ovs-vswitchd queries netdev statistics over netlink during NIC
bringup, racing mlx5e_open_channel() -> mlx5e_channel_stats_alloc()
on another CPU:

  Unable to handle kernel NULL pointer dereference at virtual address 0x840
  Hardware name: BlueField-3 DPU
  pc : mlx5e_fold_sw_stats64+0x30/0x180 [mlx5_core]
  Call trace:
   mlx5e_fold_sw_stats64+0x30/0x180 [mlx5_core]
   dev_get_stats+0x50/0xc0
   ovs_vport_get_stats+0x38/0xac [openvswitch]
   ovs_vport_cmd_fill_info+0x194/0x290 [openvswitch]
   ovs_vport_cmd_get+0xbc/0x10c [openvswitch]
   genl_family_rcv_msg_doit+0xd0/0x160
   genl_rcv_msg+0xec/0x1f0
   netlink_rcv_skb+0x64/0x130
   genl_rcv+0x40/0x60
   netlink_unicast+0x2fc/0x370
   netlink_sendmsg+0x1dc/0x454
   ...
   __arm64_sys_sendmsg+0x2c/0x40

Add mlx5e_stats_nch_write() and mlx5e_stats_nch_read() helpers in en.h
that wrap the smp_store_release()/smp_load_acquire() pair on stats_nch.
The release/acquire pair establishes the contract:

  stats_nch == N  =>  channel_stats[0..N-1] are visible and non-NULL.

Publish the stats_nch increment via mlx5e_stats_nch_write() in the
writer (mlx5e_channel_stats_alloc()), and read stats_nch via
mlx5e_stats_nch_read() in all readers: mlx5e RX/TX queue stats,
mlx5e_get_base_stats(), ethtool channels stats, IPoIB stats, the
sw_stats fold and the HV VHCA stats agent.

Fixes: fa691d0c9c08 ("net/mlx5e: Allocate per-channel stats dynamically at first usage")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       | 12 ++++++++++++
 .../ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c | 10 ++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 14 ++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  9 +++++----
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  3 ++-
 5 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2270e2e550dd..d507289096c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -987,6 +987,18 @@ struct mlx5e_priv {
 	struct ethtool_fec_hist_range *fec_ranges;
 };
 
+static inline u16 mlx5e_stats_nch_read(const struct mlx5e_priv *priv)
+{
+	/* Pairs with smp_store_release in mlx5e_stats_nch_write(). */
+	return smp_load_acquire(&priv->stats_nch);
+}
+
+static inline void mlx5e_stats_nch_write(struct mlx5e_priv *priv, u16 n)
+{
+	/* Pairs with smp_load_acquire in mlx5e_stats_nch_read(). */
+	smp_store_release(&priv->stats_nch, n);
+}
+
 struct mlx5e_dev {
 	struct net_device *netdev;
 	struct devlink_port dl_port;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
index 2e495442a547..9747d7736d37 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -33,9 +33,10 @@ mlx5e_hv_vhca_fill_ring_stats(struct mlx5e_priv *priv, int ch,
 static void mlx5e_hv_vhca_fill_stats(struct mlx5e_priv *priv, void *data,
 				     int buf_len)
 {
+	u16 nch = mlx5e_stats_nch_read(priv);
 	int ch, i = 0;
 
-	for (ch = 0; ch < priv->stats_nch; ch++) {
+	for (ch = 0; ch < nch; ch++) {
 		void *buf = data + i;
 
 		if (WARN_ON_ONCE(buf +
@@ -50,8 +51,9 @@ static void mlx5e_hv_vhca_fill_stats(struct mlx5e_priv *priv, void *data,
 
 static int mlx5e_hv_vhca_stats_buf_size(struct mlx5e_priv *priv)
 {
-	return (sizeof(struct mlx5e_hv_vhca_per_ring_stats) *
-		priv->stats_nch);
+	u16 nch = mlx5e_stats_nch_read(priv);
+
+	return sizeof(struct mlx5e_hv_vhca_per_ring_stats) * nch;
 }
 
 static int mlx5e_hv_vhca_stats_buf_max_size(struct mlx5e_priv *priv)
@@ -106,7 +108,7 @@ static void mlx5e_hv_vhca_stats_control(struct mlx5_hv_vhca_agent *agent,
 	sagent = &priv->stats_agent;
 
 	block->version = MLX5_HV_VHCA_STATS_VERSION;
-	block->rings   = priv->stats_nch;
+	block->rings   = mlx5e_stats_nch_read(priv);
 
 	if (!block->command) {
 		cancel_delayed_work_sync(&priv->stats_agent.work);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 775f0c6e55c9..aa8610cedaa8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2773,7 +2773,7 @@ static int mlx5e_channel_stats_alloc(struct mlx5e_priv *priv, int ix, int cpu)
 						GFP_KERNEL, cpu_to_node(cpu));
 	if (!priv->channel_stats[ix])
 		return -ENOMEM;
-	priv->stats_nch++;
+	mlx5e_stats_nch_write(priv, priv->stats_nch + 1);
 
 	return 0;
 }
@@ -4040,9 +4040,10 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
 
 void mlx5e_fold_sw_stats64(struct mlx5e_priv *priv, struct rtnl_link_stats64 *s)
 {
+	u16 nch = mlx5e_stats_nch_read(priv);
 	int i;
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats = priv->channel_stats[i];
 		struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq;
 		struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
@@ -5488,7 +5489,7 @@ static void mlx5e_get_queue_stats_rx(struct net_device *dev, int i,
 	struct mlx5e_rq_stats *xskrq_stats;
 	struct mlx5e_rq_stats *rq_stats;
 
-	if (mlx5e_is_uplink_rep(priv) || !priv->stats_nch)
+	if (mlx5e_is_uplink_rep(priv) || !mlx5e_stats_nch_read(priv))
 		return;
 
 	channel_stats = priv->channel_stats[i];
@@ -5512,7 +5513,7 @@ static void mlx5e_get_queue_stats_tx(struct net_device *dev, int i,
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	struct mlx5e_sq_stats *sq_stats;
 
-	if (!priv->stats_nch)
+	if (!mlx5e_stats_nch_read(priv))
 		return;
 
 	/* no special case needed for ptp htb etc since txq2sq_stats is kept up
@@ -5538,6 +5539,7 @@ static void mlx5e_get_base_stats(struct net_device *dev,
 				 struct netdev_queue_stats_tx *tx)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
+	u16 nch = mlx5e_stats_nch_read(priv);
 	struct mlx5e_ptp *ptp_channel;
 	int i, tc;
 
@@ -5549,7 +5551,7 @@ static void mlx5e_get_base_stats(struct net_device *dev,
 		rx->hw_gro_wire_packets = 0;
 		rx->hw_gro_wire_bytes = 0;
 
-		for (i = priv->channels.params.num_channels; i < priv->stats_nch; i++) {
+		for (i = priv->channels.params.num_channels; i < nch; i++) {
 			struct netdev_queue_stats_rx rx_i = {0};
 
 			mlx5e_get_queue_stats_rx(dev, i, &rx_i);
@@ -5585,7 +5587,7 @@ static void mlx5e_get_base_stats(struct net_device *dev,
 	tx->stop = 0;
 	tx->wake = 0;
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats = priv->channel_stats[i];
 
 		/* handle two cases:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 7f33261ba655..de38b60806c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -515,6 +515,7 @@ static void mlx5e_stats_update_stats_rq_page_pool(struct mlx5e_channel *c)
 static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 {
 	struct mlx5e_sw_stats *s = &priv->stats.sw;
+	u16 nch = mlx5e_stats_nch_read(priv);
 	int i;
 
 	memset(s, 0, sizeof(*s));
@@ -522,7 +523,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 	for (i = 0; i < priv->channels.num; i++) /* for active channels only */
 		mlx5e_stats_update_stats_rq_page_pool(priv->channels.c[i]);
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats =
 			priv->channel_stats[i];
 
@@ -2614,7 +2615,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(ptp) { return; }
 
 static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(channels)
 {
-	int max_nch = priv->stats_nch;
+	int max_nch = mlx5e_stats_nch_read(priv);
 
 	return (NUM_RQ_STATS * max_nch) +
 	       (NUM_CH_STATS * max_nch) +
@@ -2627,8 +2628,8 @@ static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(channels)
 
 static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(channels)
 {
+	int max_nch = mlx5e_stats_nch_read(priv);
 	bool is_xsk = priv->xsk.ever_used;
-	int max_nch = priv->stats_nch;
 	int i, j, tc;
 
 	for (i = 0; i < max_nch; i++)
@@ -2660,8 +2661,8 @@ static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(channels)
 
 static MLX5E_DECLARE_STATS_GRP_OP_FILL_STATS(channels)
 {
+	int max_nch = mlx5e_stats_nch_read(priv);
 	bool is_xsk = priv->xsk.ever_used;
-	int max_nch = priv->stats_nch;
 	int i, j, tc;
 
 	for (i = 0; i < max_nch; i++)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index 0a6003fe60e9..674bed721e63 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -135,10 +135,11 @@ void mlx5i_cleanup(struct mlx5e_priv *priv)
 
 static void mlx5i_grp_sw_update_stats(struct mlx5e_priv *priv)
 {
+	u16 nch = mlx5e_stats_nch_read(priv);
 	struct rtnl_link_stats64 s = {};
 	int i, j;
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats;
 		struct mlx5e_rq_stats *rq_stats;
 
-- 
2.44.0


^ permalink raw reply related

* Re: [PATCH 1/2] Protect skb pointer used by two different kernel instances
From: Eric Dumazet @ 2026-06-22  8:38 UTC (permalink / raw)
  To: Selvamani.Rajagopal
  Cc: Parthiban Veerasooran, Andrew Lunn, Piergiorgio Beruto,
	David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	linux-kernel, Andrew Lunn
In-Reply-To: <20260621-fix-race-condition-and-crash-v1-1-87e290d9357f@onsemi.com>

On Sun, Jun 21, 2026 at 9:33 PM Selvamani Rajagopal via B4 Relay
<devnull+Selvamani.Rajagopal.onsemi.com@kernel.org> wrote:
>
> From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
>
> Threaded IRQ uses waiting_tx_skb. Transmit path also uses
> this pointer without any mutual exclusion protection. As a
> result, it might leak skb buffer, particularly threaded IRQ
> runs in the middle of tranmsmit path, near skb_linearize.
>
> Fixes: b542d13fab0f ("net: ethernet: oa_tc6: Interrupt is active low, level triggered.")
> Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
> ---

OK but please use "net: ethernet: oa_tc6:" prefix in the patch title.

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: David Laight @ 2026-06-22  8:42 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, linux-kernel,
	cgroups, linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf,
	netdev, dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026 12:05:31 +0800
Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

I'm not really sure 'mutable' means anything either.
It is possible to make it valid for the loop body (or even other threads)
to delete arbitrary list items - but that needs significant extra overheads.

It might be worth doing something that doesn't need the extra variable,
but there is little point doing all the churn just to rename things.

> 
> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.
> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\

Use auto

> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)

The variable argument count logic really just slows down compilation.
Maybe there aren't enough copies of this code to make that significant.
But just because you can do it doesn't mean it is a gooD idea.
I'm also not sure it really adds anything to the readability.

And, it you are going to make the middle argument optional there is
no need to change the macro name.

	David



^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Christian König @ 2026-06-22  8:51 UTC (permalink / raw)
  To: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On 6/22/26 06:05, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

That sounds like a bad idea to me. The macro should really be doing one job and that as best as it can.

> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.

This can be perfectly used for code that which really needs the separate variable for the next entry.

Regards,
Christian.


> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_prev_safe is an old interface, use list_for_each_prev_mutable instead.
>   */
>  #define list_for_each_prev_safe(pos, n, head) \
>  	for (pos = (head)->prev, n = pos->prev; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->prev)
>  
> +#define __list_for_each_prev_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->prev)->prev;		\
> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->prev)
> +
> +#define __list_for_each_prev_mutable1(pos, head)			\
> +	__list_for_each_prev_mutable_internal(pos, __UNIQUE_ID(prev), head)
> +
> +#define __list_for_each_prev_mutable2(pos, prev, head)			\
> +	list_for_each_prev_safe(pos, prev, head)
> +
> +/**
> + * list_for_each_prev_mutable - iterate over a list backwards safe against entry removal
> + * @pos:	the &struct list_head to use as a loop cursor.
> + * @...:	either (head) or (prev, head)
> + *
> + * prev:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_prev_mutable(pos, ...)				\
> +	CONCATENATE(__list_for_each_prev_mutable, COUNT_ARGS(__VA_ARGS__)) \
> +		(pos, __VA_ARGS__)
> +
>  /**
>   * list_count_nodes - count nodes in the list
>   * @head:	the head for your list.
> @@ -895,12 +940,8 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	for (; !list_entry_is_head(pos, head, member);			\
>  	     pos = list_prev_entry(pos, member))
>  
> -/**
> - * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
> - * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> +/*
> + * list_for_each_entry_safe is an old interface, use list_for_each_entry_mutable instead.
>   */
>  #define list_for_each_entry_safe(pos, n, head, member)			\
>  	for (pos = list_first_entry(head, typeof(*pos), member),	\
> @@ -908,15 +949,36 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member); 			\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_internal(pos, tmp, head, member)	\
> +	for (typeof(pos) tmp = list_next_entry(pos =			\
> +		list_first_entry(head, typeof(*pos), member), member);	\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable2(pos, head, member)		\
> +	__list_for_each_entry_mutable_internal(pos, __UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable3(pos, next, head, member)		\
> +	list_for_each_entry_safe(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_continue - continue list iteration safe against removal
> + * list_for_each_entry_mutable - iterate over a list safe against entry removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate over list of given type, continuing after current point,
> - * safe against removal of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + */
> +#define list_for_each_entry_mutable(pos, ...)				\
> +	CONCATENATE(__list_for_each_entry_mutable, COUNT_ARGS(__VA_ARGS__)) \
> +		(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_continue is an old interface,
> + * use list_for_each_entry_mutable_continue instead.
>   */
>  #define list_for_each_entry_safe_continue(pos, n, head, member) 		\
>  	for (pos = list_next_entry(pos, member), 				\
> @@ -924,30 +986,79 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member);				\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_continue_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_next_entry(pos =			\
> +		list_next_entry(pos, member), member);			\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_continue2(pos, head, member)	\
> +	__list_for_each_entry_mutable_continue_internal(pos,		\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable_continue3(pos, next, head, member) \
> +	list_for_each_entry_safe_continue(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_from - iterate over list from current point safe against removal
> + * list_for_each_entry_mutable_continue - continue list iteration safe against removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate over list of given type from current point, safe against
> - * removal of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate over list of given type, continuing after current point,
> + * safe against removal of list entry.
> + */
> +#define list_for_each_entry_mutable_continue(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_continue,		\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_from is an old interface,
> + * use list_for_each_entry_mutable_from instead.
>   */
>  #define list_for_each_entry_safe_from(pos, n, head, member) 			\
>  	for (n = list_next_entry(pos, member);					\
>  	     !list_entry_is_head(pos, head, member);				\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_from_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_next_entry(pos, member);		\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_from2(pos, head, member)		\
> +	__list_for_each_entry_mutable_from_internal(pos,		\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable_from3(pos, next, head, member)	\
> +	list_for_each_entry_safe_from(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
> + * list_for_each_entry_mutable_from - iterate over list from current point safe against removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate backwards over list of given type, safe against removal
> - * of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate over list of given type from current point, safe against
> + * removal of list entry.
> + */
> +#define list_for_each_entry_mutable_from(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_from,			\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_reverse is an old interface,
> + * use list_for_each_entry_mutable_reverse instead.
>   */
>  #define list_for_each_entry_safe_reverse(pos, n, head, member)		\
>  	for (pos = list_last_entry(head, typeof(*pos), member),		\
> @@ -955,6 +1066,37 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member); 			\
>  	     pos = n, n = list_prev_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_reverse_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_prev_entry(pos =			\
> +		list_last_entry(head, typeof(*pos), member), member);	\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_prev_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_reverse2(pos, head, member)	\
> +	__list_for_each_entry_mutable_reverse_internal(pos,		\
> +		__UNIQUE_ID(prev), head, member)
> +
> +#define __list_for_each_entry_mutable_reverse3(pos, prev, head, member)	\
> +	list_for_each_entry_safe_reverse(pos, prev, head, member)
> +
> +/**
> + * list_for_each_entry_mutable_reverse - iterate backwards over list safe against removal
> + * @pos:	the type * to use as a loop cursor.
> + * @...:	either (head, member) or (prev, head, member)
> + *
> + * prev:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate backwards over list of given type, safe against removal
> + * of list entry.
> + */
> +#define list_for_each_entry_mutable_reverse(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_reverse,		\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
>  /**
>   * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
>   * @pos:	the loop cursor used in the list_for_each_entry_safe loop
> @@ -1189,6 +1331,31 @@ static inline void hlist_splice_init(struct hlist_head *from,
>  	for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
>  	     pos = n)
>  
> +#define __hlist_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->first) ? pos->next : NULL; \
> +	     pos;							\
> +	     pos = tmp, tmp = pos ? pos->next : NULL)
> +
> +#define __hlist_for_each_mutable1(pos, head)				\
> +	__hlist_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __hlist_for_each_mutable2(pos, next, head)			\
> +	hlist_for_each_safe(pos, next, head)
> +
> +/**
> + * hlist_for_each_mutable - iterate over a hlist safe against entry removal
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct hlist_node to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your hlist.
> + */
> +#define hlist_for_each_mutable(pos, ...)				\
> +	CONCATENATE(__hlist_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)
> +
>  #define hlist_entry_safe(ptr, type, member) \
>  	({ typeof(ptr) ____ptr = (ptr); \
>  	   ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
> @@ -1224,18 +1391,44 @@ static inline void hlist_splice_init(struct hlist_head *from,
>  	for (; pos;							\
>  	     pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))
>  
> -/**
> - * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
> - * @pos:	the type * to use as a loop cursor.
> - * @n:		a &struct hlist_node to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the hlist_node within the struct.
> +/*
> + * hlist_for_each_entry_safe is an old interface, use hlist_for_each_entry_mutable instead.
>   */
>  #define hlist_for_each_entry_safe(pos, n, head, member) 		\
>  	for (pos = hlist_entry_safe((head)->first, typeof(*pos), member);\
>  	     pos && ({ n = pos->member.next; 1; });			\
>  	     pos = hlist_entry_safe(n, typeof(*pos), member))
>  
> +#define __hlist_for_each_entry_mutable_internal(pos, tmp, head, member)	\
> +	for (struct hlist_node *tmp = (pos =				\
> +		hlist_entry_safe((head)->first, typeof(*pos), member)) ? \
> +		pos->member.next : NULL;				\
> +	     pos;							\
> +	     pos = hlist_entry_safe((tmp), typeof(*pos), member),	\
> +		tmp = pos ? pos->member.next : NULL)
> +
> +#define __hlist_for_each_entry_mutable2(pos, head, member)		\
> +	__hlist_for_each_entry_mutable_internal(pos,			\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __hlist_for_each_entry_mutable3(pos, next, head, member)	\
> +	hlist_for_each_entry_safe(pos, next, head, member)
> +
> +/**
> + * hlist_for_each_entry_mutable - iterate over hlist safe against entry removal
> + * @pos:	the type * to use as a loop cursor.
> + * @...:	either (head, member) or (next, head, member)
> + *
> + * next:	a &struct hlist_node to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your hlist.
> + * member:	the name of the hlist_node within the struct.
> + */
> +#define hlist_for_each_entry_mutable(pos, ...)				\
> +	CONCATENATE(__hlist_for_each_entry_mutable,			\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
>  /**
>   * hlist_count_nodes - count nodes in the hlist
>   * @head:	the head for your hlist.


^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH 1/2] igc: Wait for MAC passthrough after reset
From: Loktionov, Aleksandr @ 2026-06-22  8:54 UTC (permalink / raw)
  To: kao, acelan, Ruinskiy, Dima
  Cc: Nguyen, Anthony L, Kitszel, Przemyslaw, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <ajiHH-RaHUjgraMh@acelan-Precision-5480>



> -----Original Message-----
> From: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
> Sent: Monday, June 22, 2026 3:58 AM
> To: Ruinskiy, Dima <dima.ruinskiy@intel.com>
> Cc: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Andrew Lunn <andrew+netdev@lunn.ch>;
> David S. Miller <davem@davemloft.net>; Eric Dumazet
> <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni
> <pabeni@redhat.com>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [Intel-wired-lan] [PATCH 1/2] igc: Wait for MAC
> passthrough after reset
> 
> On Thu, Jun 18, 2026 at 11:51:35AM +0300, Ruinskiy, Dima wrote:
> > On 18/06/2026 10:55, Loktionov, Aleksandr wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On
> > > > Behalf Of Chia-Lin Kao (AceLan) via Intel-wired-lan
> > > > Sent: Thursday, June 18, 2026 9:33 AM
> > > > To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel,
> > > > Przemyslaw <przemyslaw.kitszel@intel.com>
> > > > Cc: Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller
> > > > <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> > > > Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>;
> > > > intel- wired-lan@lists.osuosl.org; netdev@vger.kernel.org;
> linux-
> > > > kernel@vger.kernel.org
> > > > Subject: [Intel-wired-lan] [PATCH 1/2] igc: Wait for MAC
> > > > passthrough after reset
> > > >
> > > > Some systems support MAC passthrough for dock Ethernet
> controllers
> > > > by having firmware rewrite the receive address registers after
> the
> > > > controller reset completes.
> > > >
> > > > igc resets the controller before reading RAL0/RAH0, so that
> reset
> > > > can restore the controller native MAC address temporarily. If
> the
> > > > driver reads the registers immediately, it can race the firmware
> > > > rewrite and keep the native dock MAC instead of the host
> passthrough MAC.
> > > >
> > > > For LMVP devices, poll RAL0/RAH0 after reset and before reading
> > > > the MAC address. Stop once the address registers change to
> another
> > > > valid Ethernet address, allowing firmware a bounded window to
> > > > complete the passthrough update.
> > > >
> Hi Aleksandr and Dima,
> 
> Let me answer your questions below.
> 
> > > Good day, Chia-Lin
> > >
> > > It'd be great if you could share more details on how to reproduce
> the issue.
> > >
> > > What exact hardware setup is affected (dock model, NIC, system)?
> We've observed this issue for a long time, and encountered the issue
> on Lenovo's P15 Gen 2 (type 20YQ, 20YR) Laptops (ThinkPad) the first
> time at 2021 and added 600ms delay.
> Recently, we encountered the same issue on Dell, too, and then
> increased the delay to 1000ms.
> And now, the issue occurs again.
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942999
> https://lore.kernel.org/lkml/20210702045120.22855-2-
> aaron.ma@canonical.com/
> https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.17/+bug/2143197
> 
> > > Which firmware/BIOS version?
> It doesn't happen on a single firmware or BIOS, and not a single
> hardware or a single brand.
> 
> > > How often does the race trigger?
> It may happen when re-plug the dock cable.
> With the mainline kernel, it's easy to reproduce the issue by re-
> plugging the dock cable.
> 
> > > Do you have a way to reliably reproduce it?
> Yes, I can find some machines to reproduce the issue reliably.
> 
> > >
> > > Also, what is the observed behavior vs. expected behavior? For
> > > example, which MAC address is seen and which one should be used?
> Here is the debugging logs, fc:4c:ea:ae:a1:e3 is the MAC address of
> the machine, and c4:d6:d3:83:75:d1 is the MAC of the dock.
> 
> It gets the correct passthrough MAC address after bootup and the first
> re-plug at 40s, and fails to update the MAC address in time after
> couple of re-plugs.
> 
> [    0.689873] igc 0000:70:00.0: MAC debug before reset_hw:
> RAL0=0xaeea4cfc RAH0=0x8000e3a1 RAR0=fc:4c:ea:ae:a1:e3 valid=1
> [    0.755187] igc 0000:70:00.0: MAC debug after reset_hw:
> RAL0=0x83d3d6c4 RAH0=0x8000d175 RAR0=c4:d6:d3:83:75:d1 valid=1
> [    0.755576] igc 0000:70:00.0: MAC debug:
> eth_platform_get_mac_address ret=-19, reading RAR0/NVM fallback
> [    0.755582] igc 0000:70:00.0: MAC debug: read_mac_addr ret=0
> addr=fc:4c:ea:ae:a1:e3 perm_addr=fc:4c:ea:ae:a1:e3
> [    4.687730] igc 0000:70:00.0: MAC debug firmware: fwnode=<none>
> props(mac=0 local=0 address=0) fwnode_ret=-19
> fwnode_mac=00:00:00:00:00:00 device_ret=-2
> device_mac=00:00:00:00:00:00 is_tbt=0 external=0 hotplug_bridge=0
> [    4.687739] igc 0000:70:00.0: MAC debug before reset_hw:
> RAL0=0xaeea4cfc RAH0=0x8000e3a1 RAR0=fc:4c:ea:ae:a1:e3 valid=1
> [    4.748545] igc 0000:70:00.0: MAC debug after reset_hw:
> RAL0=0x83d3d6c4 RAH0=0x8000d175 RAR0=c4:d6:d3:83:75:d1 valid=1
> [    4.748937] igc 0000:70:00.0: MAC debug:
> eth_platform_get_mac_address ret=-19, reading RAR0/NVM fallback
> [    4.748944] igc 0000:70:00.0: MAC debug: read_mac_addr ret=0
> addr=fc:4c:ea:ae:a1:e3 perm_addr=fc:4c:ea:ae:a1:e3
> [   40.892715] igc 0000:70:00.0: MAC debug firmware: fwnode=<none>
> props(mac=0 local=0 address=0) fwnode_ret=-19
> fwnode_mac=00:00:00:00:00:00 device_ret=-2
> device_mac=00:00:00:00:00:00 is_tbt=0 external=0 hotplug_bridge=0
> [   40.892724] igc 0000:70:00.0: MAC debug before reset_hw:
> RAL0=0x83d3d6c4 RAH0=0x8000d175 RAR0=c4:d6:d3:83:75:d1 valid=1
> [   40.953524] igc 0000:70:00.0: MAC debug after reset_hw:
> RAL0=0x83d3d6c4 RAH0=0x8000d175 RAR0=c4:d6:d3:83:75:d1 valid=1
> [   40.953933] igc 0000:70:00.0: MAC debug:
> eth_platform_get_mac_address ret=-19, reading RAR0/NVM fallback
> [   40.953941] igc 0000:70:00.0: MAC debug: read_mac_addr ret=0
> addr=c4:d6:d3:83:75:d1 perm_addr=c4:d6:d3:83:75:d1
> ...
> [  307.387282] igc 0000:70:00.0: MAC poll change at 700ms:
> RAL0=0xaeea4cfc RAH0=0x8000e3a1 RAR0=fc:4c:ea:ae:a1:e3 valid=1
> prev=c4:d6:d3:83:75:d1 [  328.826084] igc 0000:38:00.0: MAC poll
> change at 1000ms: RAL0=0xaeea4cfc RAH0=0x8000e3a1
> RAR0=fc:4c:ea:ae:a1:e3 valid=1 prev=c4:d6:d3:83:75:d1 [  429.070519]
> igc 0000:38:00.0: MAC poll change at 1100ms: RAL0=0xaeea4cfc
> RAH0=0x8000e3a1 RAR0=fc:4c:ea:ae:a1:e3 valid=1 prev=c4:d6:d3:83:75:d1
> [  466.509571] igc 0000:70:00.0: MAC poll change at 1000ms:
> RAL0=0xaeea4cfc RAH0=0x8000e3a1 RAR0=fc:4c:ea:ae:a1:e3 valid=1
> prev=c4:d6:d3:83:75:d1
> 

Please include the info into commit message, so users can grep error and find the fix.
Exact bash commands for reproduction can also help administrators to decide whether they need to patch their OS.

Thank you

...


^ permalink raw reply

* Re: [patch V2 18/25] timekeeping: Prepare for cross timestamps on arbitrary clock IDs
From: David Woodhouse @ 2026-06-22  8:55 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Miroslav Lichvar, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, thomas.weissschuh, Arthur Kiyanovski,
	Rodolfo Giometti, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, Vadim Fedorenko
In-Reply-To: <20260529195557.846634842@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

On Fri, 2026-05-29 at 22:01 +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@kernel.org>
> 
> PTP device system crosstime stamps support only CLOCK_REALTIME, which is
> meaningless for AUX clocks. The PTP core hands in the clock ID already, so
> prepare the core code to honor it.
> 
>  - Add a new sys_systime field to struct system_device_crosststamp which
>    aliases the sys_realtime field. Once all users are converted
>    sys_realtime can be removed.
> 
>  - Prepare get_device_system_crosststamp() and the related code for it by
>    switching to sys_systime and providing the initial changes to utilize
>    different time keepers.
> 
> No functional change intended.

We ended up with ktime_get_snapshot_id() also supporting CLOCK_BOOTTIME
and CLOCK_MONOTONIC_RAW, but not get_device_system_crosststamp().
Should we make that consistent?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH net v2] amt: don't read the IP source address from a reallocated skb header
From: Taehee Yoo @ 2026-06-22  8:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Michael Bommarito, David S . Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, netdev, linux-kernel
In-Reply-To: <20260621150011.33c2fe80@kernel.org>

On Mon, Jun 22, 2026 at 7:00 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 17 Jun 2026 08:34:43 -0400 Michael Bommarito wrote:
> > amt_update_handler() caches iph = ip_hdr(skb) and then calls
> > pskb_may_pull(). pskb_may_pull() can reallocate the skb head: the new
> > head is allocated and the old one is freed. The cached iph is not
> > refreshed, so the following tunnel lookup reads iph->saddr from the
> > freed head. On an AMT relay this lookup runs for every incoming
> > membership update, before the update's nonce and response MAC are
> > validated.
> >
> > The sibling handlers amt_multicast_data_handler() and
> > amt_membership_query_handler() re-read ip_hdr() after the pull and are
> > not affected; only amt_update_handler() keeps the pre-pull pointer.
>
> Sashikos point out a bunch more of these in AMT:
> https://sashiko.dev/#/patchset/20260617123443.3586930-1-michael.bommarito@gmail.com
> https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260617123443.3586930-1-michael.bommarito@gmail.com
>
> Let's fix them all with one patch?

Agreed.
Michael, could you please fix the remaining ones Sashiko flagged?

Thanks a lot!
Taehee Yoo

> --
> pw-bot: cr

^ permalink raw reply

* Re: [PATCH v3 net] net: watchdog: fix refcount tracking races
From: Eric Dumazet @ 2026-06-22  8:59 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, syzbot+381d82bbf0253710b35d,
	syzbot+3479efbc2821cb2a79f2
In-Reply-To: <a443376e-5187-4268-93b3-58047ef113a8@samsung.com>

On Wed, Jun 17, 2026 at 3:48 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Dear All,
>
> On 11.06.2026 17:27, Eric Dumazet wrote:
> > Blamed commit converted the untracked dev_hold()/dev_put() calls
> > in the watchdog code to use the tracked dev_hold_track()/dev_put_track()
> > (which were later renamed/interfaced to netdev_hold() and netdev_put()).
> >
> > By introducing dev->watchdog_dev_tracker to store the
> > reference tracking information without adding synchronization
> > between netdev_watchdog_up() and dev_watchdog(), it enabled the
> > race condition where this pointer could be overwritten or freed
> > concurrently, leading to the list corruption crash syzbot reported:
> >
> > list_del corruption, ffff888114a18c00->next is NULL
> >  kernel BUG at lib/list_debug.c:52 !
> > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> > CPU: 1 UID: 0 PID: 91 Comm: kworker/u8:5 Not tainted syzkaller #0 PREEMPT(lazy)
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
> > Workqueue: events_unbound linkwatch_event
> >  RIP: 0010:__list_del_entry_valid_or_report.cold+0x22/0x2a lib/list_debug.c:52
> > Call Trace:
> >  <TASK>
> >   __list_del_entry_valid include/linux/list.h:132 [inline]
> >   __list_del_entry include/linux/list.h:246 [inline]
> >   list_move_tail include/linux/list.h:341 [inline]
> >   ref_tracker_free+0x1a7/0x6c0 lib/ref_tracker.c:329
> >   netdev_tracker_free include/linux/netdevice.h:4491 [inline]
> >   netdev_put include/linux/netdevice.h:4508 [inline]
> >   netdev_put include/linux/netdevice.h:4504 [inline]
> >   netdev_watchdog_down net/sched/sch_generic.c:600 [inline]
> >   dev_deactivate_many+0x28c/0xfe0 net/sched/sch_generic.c:1363
> >   dev_deactivate+0x109/0x1d0 net/sched/sch_generic.c:1397
> >   linkwatch_do_dev net/core/link_watch.c:184 [inline]
> >   linkwatch_do_dev+0xd3/0x120 net/core/link_watch.c:166
> >   __linkwatch_run_queue+0x3a5/0x810 net/core/link_watch.c:240
> >   linkwatch_event+0x8f/0xc0 net/core/link_watch.c:314
> >   process_one_work+0xa0e/0x1980 kernel/workqueue.c:3314
> >   process_scheduled_works kernel/workqueue.c:3397 [inline]
> >   worker_thread+0x5ef/0xe50 kernel/workqueue.c:3478
> >   kthread+0x370/0x450 kernel/kthread.c:436
> >   ret_from_fork+0x69a/0xc80 arch/x86/kernel/process.c:158
> >   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >
> > This patch has three coordinated parts:
> >
> > 1) Add dev->watchdog_lock and dev->watchdog_ref_held to serialize watchdog operations.
> >
> > 2) Remove netdev_watchdog_up() call from netif_carrier_on():
> >    This ensures netdev_watchdog_up() is only called from process/BH context
> >    (via linkwatch workqueue dev_activate()), allowing us to use
> >    spin_lock_bh() for synchronization.
> >
> > 3) Synchronize watchdog up and watchdog timer:
> >    Protect netdev_watchdog_up() with tx_global_lock and watchdog_lock.
> >    Only allocate a new tracker in netdev_watchdog_up() if one is
> >    not already present.
> >    In dev_watchdog(), ensure we don't release the tracker if the
> >    timer was rescheduled either by dev_watchdog() itself or concurrently
> >    by netdev_watchdog_up().
> >
> > Fixes: f12bf6f3f942 ("net: watchdog: add net device refcount tracker")
> > Reported-by: syzbot+381d82bbf0253710b35d@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/netdev/6a26b751.c25708ab.1b19ef.0013.GAE@google.com/T/#u
> > Tested-by: syzbot+3479efbc2821cb2a79f2@syzkaller.appspotmail.com
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> This patch landed recently in linux-next as commit 8eed5519e496 ("net: watchdog:
> fix refcount tracking races"). In my tests I found that it causes the following
> deadlock during system suspend/resume on QEmu's ARM64bit 'virt' machine:
>
> root@target:~# time rtcwake -s10 -mmem
> rtcwake: assuming RTC uses UTC ...
> rtcwake: wakeup from "mem" using /dev/rtc0 at Wed Jun 17 10:46:12 2026
> PM: suspend entry (s2idle)
> Filesystems sync: 0.055 seconds
> Freezing user space processes
> Freezing user space processes completed (elapsed 0.006 seconds)
> OOM killer disabled.
> Freezing remaining freezable tasks
> Freezing remaining freezable tasks completed (elapsed 0.003 seconds)
>
> ============================================
> WARNING: possible recursive locking detected
> 7.1.0-rc7+ #13003 Not tainted
> --------------------------------------------
> rtcwake/254 is trying to acquire lock:
> ffff000006de64e8 (&dev->tx_global_lock){+.-.}-{3:3}, at: netdev_watchdog_up+0x40/0x108
>
> but task is already holding lock:
> ffff000006de64e8 (&dev->tx_global_lock){+.-.}-{3:3}, at: netif_tx_lock+0x1c/0x34
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(&dev->tx_global_lock);
>   lock(&dev->tx_global_lock);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 6 locks held by rtcwake/254:
>  #0: ffff0000071ab3e8 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0x1ec/0x35c
>  #1: ffff00000d22c480 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xf0/0x1c4
>  #2: ffff0000049162c8 (kn->active#61){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1c4
>  #3: ffffaa79533c03b0 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x98/0x608
>  #4: ffff000005e3a138 (&dev->mutex){....}-{4:4}, at: device_resume+0xb4/0x254
>  #5: ffff000006de64e8 (&dev->tx_global_lock){+.-.}-{3:3}, at: netif_tx_lock+0x1c/0x34
>
> stack backtrace:
> CPU: 1 UID: 0 PID: 254 Comm: rtcwake Not tainted 7.1.0-rc7+ #13003 PREEMPT
> Hardware name: linux,dummy-virt (DT)
> Call trace:
>  show_stack+0x18/0x24 (C)
>  dump_stack_lvl+0x90/0xd0
>  dump_stack+0x18/0x24
>  print_deadlock_bug+0x260/0x350
>  __lock_acquire+0x11b8/0x225c
>  lock_acquire+0x1c4/0x3f0
>  _raw_spin_lock_bh+0x50/0x68
>  netdev_watchdog_up+0x40/0x108
>  netif_device_attach+0x9c/0xb0
>  virtnet_restore+0x100/0x21c
>  virtio_device_restore_priv+0x11c/0x1d0
>  virtio_device_restore+0x14/0x20
>  virtio_mmio_restore+0x34/0x40
>  platform_pm_resume+0x2c/0x68
>  dpm_run_callback+0xa0/0x240
>  device_resume+0x120/0x254
>  dpm_resume+0x1f8/0x2ec
>  dpm_resume_end+0x18/0x34
>  suspend_devices_and_enter+0x1d0/0x990
>  pm_suspend+0x1ec/0x608
>  state_store+0x8c/0x110
>  kobj_attr_store+0x18/0x2c
>  sysfs_kf_write+0x50/0x7c
>  kernfs_fop_write_iter+0x130/0x1c4
>  vfs_write+0x2b8/0x35c
>  ksys_write+0x6c/0x104
>  __arm64_sys_write+0x1c/0x28
>  invoke_syscall+0x54/0x110
>  el0_svc_common.constprop.0+0x40/0xe8
>  do_el0_svc+0x20/0x2c
>  el0_svc+0x54/0x338
>  el0t_64_sync_handler+0xa0/0xe4
>  el0t_64_sync+0x198/0x19c
>
>
> Reverting $subject on top of linux-next fixes this issue.

Thanks for the report Marek!

Acquiring tx_global_lock in netdev_watchdog_up() appears unnecessary anyway
because the critical state (timer and refcount tracker) is already
protected by dev->watchdog_lock.

Could you try this patch?

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 3f1c510df850dbdbaf10d483547c7b1f3a5d5482..ef2b4bf51564173751c74fefe17e3913ed2fa056
100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -594,9 +594,8 @@ void netdev_watchdog_up(struct net_device *dev)
                return;
        if (dev->watchdog_timeo <= 0)
                dev->watchdog_timeo = 5*HZ;
-       spin_lock_bh(&dev->tx_global_lock);

-       spin_lock(&dev->watchdog_lock);
+       spin_lock_bh(&dev->watchdog_lock);
        if (!mod_timer(&dev->watchdog_timer,
                       round_jiffies(jiffies + dev->watchdog_timeo))) {
                if (!dev->watchdog_ref_held) {
@@ -605,9 +604,7 @@ void netdev_watchdog_up(struct net_device *dev)
                        dev->watchdog_ref_held = true;
                }
        }
-       spin_unlock(&dev->watchdog_lock);
-
-       spin_unlock_bh(&dev->tx_global_lock);
+       spin_unlock_bh(&dev->watchdog_lock);
 }
 EXPORT_SYMBOL_GPL(netdev_watchdog_up);

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox