Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-06-22 (ice, i40e, e1000e)
From: patchwork-bot+netdevbpf @ 2026-06-25  3:00 UTC (permalink / raw)
  To: Tony Nguyen; +Cc: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
In-Reply-To: <20260622220059.2471844-1-anthony.l.nguyen@intel.com>

Hello:

This series was applied to netdev/net.git (main)
by Tony Nguyen <anthony.l.nguyen@intel.com>:

On Mon, 22 Jun 2026 15:00:47 -0700 you wrote:
> For ice:
> Dawid changes call to release control VSI during reset to prevent
> leaking it.
> 
> Lukasz fixes flow control error check to check value rather than treat
> is as bitmap values.
> 
> [...]

Here is the summary with links:
  - [net,1/8] ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()
    https://git.kernel.org/netdev/net/c/ebbe8868cf47
  - [net,2/8] ice: fix AQ error code comparison in ice_set_pauseparam()
    https://git.kernel.org/netdev/net/c/2bf7744bc322
  - [net,3/8] ice: fix ice_init_link() error return preventing probe
    https://git.kernel.org/netdev/net/c/eb509638686b
  - [net,4/8] ice: call netif_keep_dst() once when entering switchdev mode
    https://git.kernel.org/netdev/net/c/c0d00c882bc4
  - [net,5/8] ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info
    https://git.kernel.org/netdev/net/c/a903afff66d7
  - [net,6/8] ice: dpll: fix memory leak in ice_dpll_init_info error paths
    https://git.kernel.org/netdev/net/c/20da495f2df0
  - [net,7/8] i40e: Fix i40e_debug() to use struct i40e_hw argument
    https://git.kernel.org/netdev/net/c/798f94603eb0
  - [net,8/8] e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake
    https://git.kernel.org/netdev/net/c/578294b8b60d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-25  3:01 UTC (permalink / raw)
  To: David Laight, Christian König, Jani Nikula,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <20260624152324.3def88ce@pumpkin>

在 2026/6/24 22:23, David Laight 写道:
> On Wed, 24 Jun 2026 15:23:47 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>  
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>> every call site, even though most users only need it for the iterator
>>>>> implementation and never reference it in the loop body.
>>>>>
>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>> a unique internal cursor.  
>>>>
>>>> I'm not really sure 'mutable' means anything either.
>>>> It is possible to make it valid for the loop body (or even other threads)
>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>
>>>> It might be worth doing something that doesn't need the extra variable,
>>>> but there is little point doing all the churn just to rename things.
>>>>  
>>>>>
>>>>> This makes call sites that only mutate the list through the current entry
>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>> compatibility.
>>>>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>> --- a/include/linux/list.h
>>>>> +++ b/include/linux/list.h
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/stddef.h>
>>>>>  #include <linux/poison.h>
>>>>>  #include <linux/const.h>
>>>>> +#include <linux/args.h>
>>>>>  
>>>>>  #include <asm/barrier.h>
>>>>>  
>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>  #define list_for_each_prev(pos, head) \
>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>  
>>>>> -/**
>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> +/*
>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>   */
>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>  	     !list_is_head(pos, (head)); \
>>>>>  	     pos = n, n = pos->next)
>>>>>  
>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>
>>>> Use auto
>>>>  
>>>>> +	     !list_is_head(pos, (head));				\
>>>>> +	     pos = tmp, tmp = pos->next)
>>>>> +
>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>> +
>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>> +	list_for_each_safe(pos, next, head)
>>>>> +
>>>>>  /**
>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> + * @...:	either (head) or (next, head)
>>>>> + *
>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>> + *		the caller.
>>>>> + * head:	the head for your list.
>>>>> + */
>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>> +		(pos, __VA_ARGS__)  
>>>>
>>>> The variable argument count logic really just slows down compilation.
>>>> Maybe there aren't enough copies of this code to make that significant.
>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>> I'm also not sure it really adds anything to the readability.
>>>>
>>>> And, it you are going to make the middle argument optional there is
>>>> no need to change the macro name.  
>>>
>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>> implementation approach. If we abandon that method, it means we will
>>> inevitably need to add some new macros. If mutable is not a good name,
>>> suggestions for better alternatives would be welcome; coming up with a
>>> suitable name is indeed rather tricky.  
>>
>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>
>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
> 
> IIRC currently you have a choice of either:
> 	define               Item that can't be deleted
> 	list_for_each()	     The current item.
> 	list_for_each_safe() The next item.
> There is also likely to be code that updates the variables to allow
> for other scenarios.
> 
> Note that if increase a reference count and release a lock then list_for_each()
> is likely safer than list_for_each_safe() :-)
> 
> list.h has 9 variants of the 'safe' loop.
> The bloat of another 9 is getting excessive.
> 
> It has to be said that this is one of my least favourite type of list...

Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
Andy Shevchenko, Alexei Starovoitov

For ease of discussion, I need to summarize the currently possible
approaches and briefly describe their respective pros and cons,
using the list_for_each_entry* interfaces as examples.

1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
would be used specifically for safe deletion scenarios that do not
need to expose the temporary cursor externally. The code can refer to
the v1 version.

Pros: Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: Requires adding a whole set of mutable interfaces, which makes the
      code somewhat redundant.

2. Directly optimize away the temporary cursor in list_for_each_entry_safe
and define it inside the loop instead, changing the interface from four
arguments to three.

Pros: Does not add redundant interfaces.
Cons: (1) Users need to manually update special cases that use the
      traversal variable of list_for_each_entry_safe, the new
      list_for_each_entry_safe would no longer apply there and would
      need to be open-coded.
      (2) Because the macro arguments changes, all list_for_each_entry_safe
      callers would need to be modified and merged together, making it
      difficult to merge such a large amount of code at once.

3. Use a variadic macro approach to optimize list_for_each_entry_safe,
so that it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can
      be merged directly.
Cons: (1) Increases compile time.
      (2) Makes the interface harder for users to use.

4. Optimize list_for_each_entry by defining the temporary cursor internally,
making it compatible with the functionality of list_for_each_entry_safe.
The code can refer to the v2 version.

Pros: (1) Does not add redundant interfaces.
      (2) The number of externally visible arguments of list_for_each_entry
      remains unchanged, still three.
Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.
      (2) Users need to manually update special cases that use the traversal
      variable of list_for_each_entry, the new list_for_each_entry would no
      longer apply there and would need to be open-coded. There are 15 such
      cases in total.

5. Use a variadic macro approach to optimize list_for_each_entry, so that
it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: (1) Increases compile time.
      (2) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.

6. Make no changes, keep the current logic unchanged, and close the current
email discussion.


Which of the six solutions above do people prefer?

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [PATCH] ice: propagate ETH56G deskew read errors
From: Pengpeng Hou @ 2026-06-25  3:03 UTC (permalink / raw)
  To: Tony Nguyen, Przemek Kitszel
  Cc: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Richard Cochran, intel-wired-lan, netdev, linux-kernel, pengpeng

ice_ptp_calc_deskew_eth56g() returns a u32 deskew value, but it also
returns the negative read_poll_timeout() error when the DESKEW valid bit
never appears. That converts the negative error into a large unsigned
deskew contribution, which can then be folded into the RX timestamp
offset and programmed into hardware.

Return the deskew value through an output parameter and propagate the
read error from ice_phy_set_offsets_eth56g() instead of using it as
offset data.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 27 +++++++++++++++------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
index 8e5f97835954..bd2e31b816a8 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
@@ -1736,17 +1736,21 @@ static u32 ice_ptp_calc_bitslip_eth56g(struct ice_hw *hw, u8 port, u32 bs,
  * @ds: deskew multiplier
  * @rs: RS-FEC enabled
  * @spd: link speed
+ * @deskew: calculated deskew value
  *
- * Return: calculated deskew value
+ * Return: 0 on success, negative error code otherwise
  */
-static u32 ice_ptp_calc_deskew_eth56g(struct ice_hw *hw, u8 port, u32 ds,
-				      bool rs, enum ice_eth56g_link_spd spd)
+static int ice_ptp_calc_deskew_eth56g(struct ice_hw *hw, u8 port, u32 ds,
+				      bool rs, enum ice_eth56g_link_spd spd,
+				      u32 *deskew)
 {
 	u32 deskew_i, deskew_f;
 	int err;
 
-	if (!ds)
+	if (!ds) {
+		*deskew = 0;
 		return 0;
+	}
 
 	read_poll_timeout(ice_read_ptp_reg_eth56g, err,
 			  FIELD_GET(PHY_REG_DESKEW_0_VALID, deskew_i), 500,
@@ -1766,7 +1770,9 @@ static u32 ice_ptp_calc_deskew_eth56g(struct ice_hw *hw, u8 port, u32 ds,
 	deskew_i = FIELD_PREP(ICE_ETH56G_MAC_CFG_RX_OFFSET_INT, deskew_i);
 	/* Shift 3 fractional bits to the end of the integer part */
 	deskew_f <<= ICE_ETH56G_MAC_CFG_FRAC_W - PHY_REG_DESKEW_0_RLEVEL_FRAC_W;
-	return mul_u32_u32_fx_q9(deskew_i | deskew_f, ds);
+	*deskew = mul_u32_u32_fx_q9(deskew_i | deskew_f, ds);
+
+	return 0;
 }
 
 /**
@@ -1789,6 +1795,7 @@ static int ice_phy_set_offsets_eth56g(struct ice_hw *hw, u8 port,
 {
 	u32 rx_offset, tx_offset, bs_ds;
 	bool onestep, sfd;
+	int err;
 
 	onestep = hw->ptp.phy.eth56g.onestep_ena;
 	sfd = hw->ptp.phy.eth56g.sfd_ena;
@@ -1805,11 +1812,15 @@ static int ice_phy_set_offsets_eth56g(struct ice_hw *hw, u8 port,
 	if (sfd)
 		rx_offset = add_u32_u32_fx(rx_offset, cfg->rx_offset.sfd);
 
-	if (spd < ICE_ETH56G_LNK_SPD_40G)
+	if (spd < ICE_ETH56G_LNK_SPD_40G) {
 		bs_ds = ice_ptp_calc_bitslip_eth56g(hw, port, bs_ds, fc, rs,
 						    spd);
-	else
-		bs_ds = ice_ptp_calc_deskew_eth56g(hw, port, bs_ds, rs, spd);
+	} else {
+		err = ice_ptp_calc_deskew_eth56g(hw, port, bs_ds, rs, spd,
+						 &bs_ds);
+		if (err)
+			return err;
+	}
 	rx_offset = add_u32_u32_fx(rx_offset, bs_ds);
 	rx_offset &= ICE_ETH56G_MAC_CFG_RX_OFFSET_INT |
 		     ICE_ETH56G_MAC_CFG_RX_OFFSET_FRAC;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH] net: pch_gbe: return errors from MIIM accesses
From: Pengpeng Hou @ 2026-06-25  3:05 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-kernel, pengpeng

pch_gbe_mac_ctrl_miim() polls for the MIIM controller to become ready,
but returns zero on the initial ready timeout and ignores the completion
timeout after issuing the operation. MDIO and PHY helpers can then report
success with zero or stale data.

Make the MIIM helper return an errno and pass read data through an output
parameter. Propagate the error through the MDIO read path, the probe-time
PHY discovery path, and the internal PHY register helpers that already
return an error status.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 .../net/ethernet/oki-semi/pch_gbe/pch_gbe.h   |  4 +-
 .../ethernet/oki-semi/pch_gbe/pch_gbe_main.c  | 54 ++++++++++++++-----
 .../ethernet/oki-semi/pch_gbe/pch_gbe_phy.c   | 22 +++++---
 3 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
index 108f312bc542..4bdf0afca462 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
@@ -619,6 +619,6 @@ void pch_gbe_set_ethtool_ops(struct net_device *netdev);
 
 /* pch_gbe_mac.c */
 s32 pch_gbe_mac_force_mac_fc(struct pch_gbe_hw *hw);
-u16 pch_gbe_mac_ctrl_miim(struct pch_gbe_hw *hw, u32 addr, u32 dir, u32 reg,
-			  u16 data);
+int pch_gbe_mac_ctrl_miim(struct pch_gbe_hw *hw, u32 addr, u32 dir, u32 reg,
+			  u16 data, u16 *read_data);
 #endif /* _PCH_GBE_H_ */
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 62f05f4569b1..61d47b529a0e 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -476,35 +476,48 @@ static void pch_gbe_mac_set_wol_event(struct pch_gbe_hw *hw, u32 wu_evt)
  * @dir:  Operetion. (Write or Read)
  * @reg:  Access register of PHY
  * @data: Write data.
+ * @read_data: Read data.
  *
- * Returns: Read date.
+ * Return: 0 on success, negative error code on failure.
  */
-u16 pch_gbe_mac_ctrl_miim(struct pch_gbe_hw *hw, u32 addr, u32 dir, u32 reg,
-			u16 data)
+int pch_gbe_mac_ctrl_miim(struct pch_gbe_hw *hw, u32 addr, u32 dir, u32 reg,
+			  u16 data, u16 *read_data)
 {
 	struct pch_gbe_adapter *adapter = pch_gbe_hw_to_adapter(hw);
 	unsigned long flags;
 	u32 data_out;
+	int ret;
 
 	spin_lock_irqsave(&hw->miim_lock, flags);
 
-	if (readx_poll_timeout_atomic(ioread32, &hw->reg->MIIM, data_out,
-				      data_out & PCH_GBE_MIIM_OPER_READY, 20, 2000)) {
+	ret = readx_poll_timeout_atomic(ioread32, &hw->reg->MIIM, data_out,
+					data_out & PCH_GBE_MIIM_OPER_READY, 20,
+					2000);
+	if (ret) {
 		netdev_err(adapter->netdev, "pch-gbe.miim won't go Ready\n");
 		spin_unlock_irqrestore(&hw->miim_lock, flags);
-		return 0;	/* No way to indicate timeout error */
+		return ret;
 	}
 	iowrite32(((reg << PCH_GBE_MIIM_REG_ADDR_SHIFT) |
 		  (addr << PCH_GBE_MIIM_PHY_ADDR_SHIFT) |
 		  dir | data), &hw->reg->MIIM);
-	readx_poll_timeout_atomic(ioread32, &hw->reg->MIIM, data_out,
-				  data_out & PCH_GBE_MIIM_OPER_READY, 20, 2000);
+	ret = readx_poll_timeout_atomic(ioread32, &hw->reg->MIIM, data_out,
+					data_out & PCH_GBE_MIIM_OPER_READY, 20,
+					2000);
+	if (ret) {
+		netdev_err(adapter->netdev, "pch-gbe.miim operation timed out\n");
+		spin_unlock_irqrestore(&hw->miim_lock, flags);
+		return ret;
+	}
 	spin_unlock_irqrestore(&hw->miim_lock, flags);
 
 	netdev_dbg(adapter->netdev, "PHY %s: reg=%d, data=0x%04X\n",
 		   dir == PCH_GBE_MIIM_OPER_READ ? "READ" : "WRITE", reg,
 		   dir == PCH_GBE_MIIM_OPER_READ ? data_out : data);
-	return (u16) data_out;
+	if (dir == PCH_GBE_MIIM_OPER_READ && read_data)
+		*read_data = (u16)data_out;
+
+	return 0;
 }
 
 /**
@@ -589,14 +602,20 @@ static int pch_gbe_init_phy(struct pch_gbe_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
 	u32 addr;
-	u16 bmcr, stat;
+	int bmcr, stat;
 
 	/* Discover phy addr by searching addrs in order {1,0,2,..., 31} */
 	for (addr = 0; addr < PCH_GBE_PHY_REGS_LEN; addr++) {
 		adapter->mii.phy_id = (addr == 0) ? 1 : (addr == 1) ? 0 : addr;
 		bmcr = pch_gbe_mdio_read(netdev, adapter->mii.phy_id, MII_BMCR);
+		if (bmcr < 0)
+			return bmcr;
 		stat = pch_gbe_mdio_read(netdev, adapter->mii.phy_id, MII_BMSR);
+		if (stat < 0)
+			return stat;
 		stat = pch_gbe_mdio_read(netdev, adapter->mii.phy_id, MII_BMSR);
+		if (stat < 0)
+			return stat;
 		if (!((bmcr == 0xFFFF) || ((stat == 0) && (bmcr == 0))))
 			break;
 	}
@@ -611,6 +630,8 @@ static int pch_gbe_init_phy(struct pch_gbe_adapter *adapter)
 					   BMCR_ISOLATE);
 		} else {
 			bmcr = pch_gbe_mdio_read(netdev, addr, MII_BMCR);
+			if (bmcr < 0)
+				return bmcr;
 			pch_gbe_mdio_write(netdev, addr, MII_BMCR,
 					   bmcr & ~BMCR_ISOLATE);
 		}
@@ -639,9 +660,15 @@ static int pch_gbe_mdio_read(struct net_device *netdev, int addr, int reg)
 {
 	struct pch_gbe_adapter *adapter = netdev_priv(netdev);
 	struct pch_gbe_hw *hw = &adapter->hw;
+	u16 data;
+	int ret;
+
+	ret = pch_gbe_mac_ctrl_miim(hw, addr, PCH_GBE_HAL_MIIM_READ, reg,
+				    0, &data);
+	if (ret)
+		return ret;
 
-	return pch_gbe_mac_ctrl_miim(hw, addr, PCH_GBE_HAL_MIIM_READ, reg,
-				     (u16) 0);
+	return data;
 }
 
 /**
@@ -657,7 +684,8 @@ static void pch_gbe_mdio_write(struct net_device *netdev,
 	struct pch_gbe_adapter *adapter = netdev_priv(netdev);
 	struct pch_gbe_hw *hw = &adapter->hw;
 
-	pch_gbe_mac_ctrl_miim(hw, addr, PCH_GBE_HAL_MIIM_WRITE, reg, data);
+	pch_gbe_mac_ctrl_miim(hw, addr, PCH_GBE_HAL_MIIM_WRITE, reg, data,
+			      NULL);
 }
 
 /**
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c
index 3426f6fa2b57..edf3644f7589 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c
@@ -139,9 +139,10 @@ s32 pch_gbe_phy_read_reg_miic(struct pch_gbe_hw *hw, u32 offset, u16 *data)
 			   offset);
 		return -EINVAL;
 	}
-	*data = pch_gbe_mac_ctrl_miim(hw, phy->addr, PCH_GBE_HAL_MIIM_READ,
-				      offset, (u16)0);
-	return 0;
+
+	*data = 0;
+	return pch_gbe_mac_ctrl_miim(hw, phy->addr, PCH_GBE_HAL_MIIM_READ,
+				     offset, 0, data);
 }
 
 /**
@@ -164,9 +165,8 @@ s32 pch_gbe_phy_write_reg_miic(struct pch_gbe_hw *hw, u32 offset, u16 data)
 			   offset);
 		return -EINVAL;
 	}
-	pch_gbe_mac_ctrl_miim(hw, phy->addr, PCH_GBE_HAL_MIIM_WRITE,
-				 offset, data);
-	return 0;
+	return pch_gbe_mac_ctrl_miim(hw, phy->addr, PCH_GBE_HAL_MIIM_WRITE,
+				     offset, data, NULL);
 }
 
 /**
@@ -266,13 +266,19 @@ static int pch_gbe_phy_tx_clk_delay(struct pch_gbe_hw *hw)
 	case PHY_AR803X_ID:
 		netdev_dbg(adapter->netdev,
 			   "Configuring AR803X PHY for 2ns TX clock delay\n");
-		pch_gbe_phy_read_reg_miic(hw, PHY_AR8031_DBG_OFF, &mii_reg);
+		ret = pch_gbe_phy_read_reg_miic(hw, PHY_AR8031_DBG_OFF,
+						&mii_reg);
+		if (ret)
+			break;
 		ret = pch_gbe_phy_write_reg_miic(hw, PHY_AR8031_DBG_OFF,
 						 PHY_AR8031_SERDES);
 		if (ret)
 			break;
 
-		pch_gbe_phy_read_reg_miic(hw, PHY_AR8031_DBG_DAT, &mii_reg);
+		ret = pch_gbe_phy_read_reg_miic(hw, PHY_AR8031_DBG_DAT,
+						&mii_reg);
+		if (ret)
+			break;
 		mii_reg |= PHY_AR8031_SERDES_TX_CLK_DLY;
 		ret = pch_gbe_phy_write_reg_miic(hw, PHY_AR8031_DBG_DAT,
 						 mii_reg);
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v5] net: mvneta_bm: add suspend/resume support to prevent crash after resume
From: Yun Zhou @ 2026-06-25  3:09 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: netdev, linux-kernel, yun.zhou

The mvneta driver uses the hardware Buffer Manager (BM) for RX buffer
allocation. During suspend, mvneta disables its clock, causing BM to
lose all buffer address state. On resume, mvneta_bm_port_init() re-
attaches the BM pool to the NIC, but BM hardware returns stale/garbage
buffer addresses. When NAPI poll processes these buffers, DMA cache
sync hits an invalid virtual address causing a kernel panic:

 Unable to handle kernel paging request at virtual address b0000080
 PC is at v7_dma_inv_range
 Call trace:
  v7_dma_inv_range from arch_sync_dma_for_cpu+0x94/0x158
  arch_sync_dma_for_cpu from __dma_sync_single_for_cpu+0xc4/0x15c
  __dma_sync_single_for_cpu from mvneta_rx_swbm+0x6c8/0xf48
  mvneta_rx_swbm from mvneta_poll+0x6fc/0x70c
  mvneta_poll from __napi_poll.constprop.0+0x2c/0x1e0
  __napi_poll.constprop.0 from net_rx_action+0x160/0x2c4
  net_rx_action from handle_softirqs+0xd8/0x2b8
  handle_softirqs from run_ksoftirqd+0x30/0x94
  run_ksoftirqd from smpboot_thread_fn+0x100/0x204
  smpboot_thread_fn from kthread+0xf4/0x110
  kthread from ret_from_fork+0x14/0x28

Fix by adding suspend/resume callbacks to the BM driver:

- suspend: drain all buffers (with DMA unmapping), free the BPPE
  regions, and reset pool state to FREE before stopping BM and gating
  the clock.

- resume: enable the clock, reinitialize BM defaults, and restore pool
  read/write pointers and size registers. Pool allocation and buffer
  refill are handled by mvneta_resume() through the normal
  mvneta_bm_port_init() path, which sees pools as FREE and performs
  full initialization identical to probe.

Add a device_link (DL_FLAG_AUTOREMOVE_CONSUMER) in mvneta_probe to
guarantee BM resumes before mvneta and suspends after mvneta. If the
link cannot be created, fall back to SW buffer management to avoid a
potential crash on resume due to unordered PM transitions.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v5:
  - Call mvneta_bm_pool_disable() per pool before dma_free_coherent()
    in suspend, matching mvneta_bm_pool_destroy() ordering. This also
    ensures ENABLE_MASK is cleared before BM START on resume.
  - Guard dma_free_coherent() with if (bm_pool->virt_addr) to defend
    against a partially-failed mvneta_bm_pool_create() leaving a stale
    pointer.

v4:
  - On device_link_add() failure, fall back to SW buffer management
    (destroy pools, put BM reference, clear bm_priv) instead of merely
    emitting a warning. Without the link, suspend/resume ordering is
    not guaranteed and the original crash can still occur.

v3:
  - Restore per-pool POOL_SIZE_REG, POOL_READ_PTR_REG, and
    POOL_WRITE_PTR_REG in resume, since clock gating loses all BM
    register state.
  - Check device_link_add() return value and emit dev_warn on failure.
  - Replace SIMPLE_DEV_PM_OPS (deprecated) with
    DEFINE_SIMPLE_DEV_PM_OPS and pm_sleep_ptr(), removing the
    #ifdef CONFIG_PM_SLEEP guard.
  - Add dev_warn in suspend if not all buffers could be freed.

v2:
  - Drain buffers via mvneta_bm_bufs_free() in suspend instead of only
    stopping BM and gating the clock. This ensures proper DMA unmapping
    and avoids buffer leaks.
  - Free the BPPE DMA-coherent region in suspend so that resume takes
    the full probe-time initialization path (alloc + fill), eliminating
    the need to modify mvneta_bm_pool_create().
  - Reset pool type to MVNETA_BM_FREE in suspend so mvneta_bm_pool_use()
    correctly re-creates and refills pools on resume.
  - Check clk_prepare_enable() return value in resume.
  - Add device_link between mvneta (consumer) and mvneta_bm (supplier)
    to guarantee correct suspend/resume ordering.

 drivers/net/ethernet/marvell/mvneta.c    | 18 +++++++
 drivers/net/ethernet/marvell/mvneta_bm.c | 63 ++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 744d6585a949..543e566425c1 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5678,6 +5678,24 @@ static int mvneta_probe(struct platform_device *pdev)
 					 "use SW buffer management\n");
 				mvneta_bm_put(pp->bm_priv);
 				pp->bm_priv = NULL;
+			} else if (!device_link_add(&pdev->dev,
+						    &pp->bm_priv->pdev->dev,
+						    DL_FLAG_AUTOREMOVE_CONSUMER)) {
+				/*
+				 * Link guarantees BM resumes before mvneta.
+				 * Without it, BM may not be ready when
+				 * mvneta_bm_port_init() runs on resume,
+				 * causing stale buffer addresses and a crash.
+				 * Fall back to SW management to be safe.
+				 */
+				dev_warn(&pdev->dev,
+					 "failed to link to BM, use SW buffer management\n");
+				mvneta_bm_pool_destroy(pp->bm_priv,
+						       pp->pool_long, 1 << pp->id);
+				mvneta_bm_pool_destroy(pp->bm_priv,
+						       pp->pool_short, 1 << pp->id);
+				mvneta_bm_put(pp->bm_priv);
+				pp->bm_priv = NULL;
 			}
 		}
 		/* Set RX packet offset correction for platforms, whose
diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c b/drivers/net/ethernet/marvell/mvneta_bm.c
index 6bb380494919..c23982bfc20b 100644
--- a/drivers/net/ethernet/marvell/mvneta_bm.c
+++ b/drivers/net/ethernet/marvell/mvneta_bm.c
@@ -477,6 +477,68 @@ static void mvneta_bm_remove(struct platform_device *pdev)
 	clk_disable_unprepare(priv->clk);
 }
 
+static int mvneta_bm_suspend(struct device *dev)
+{
+	struct mvneta_bm *priv = dev_get_drvdata(dev);
+	int i;
+
+	/* Drain buffers and free pool resources while BM is still clocked */
+	for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+		struct mvneta_bm_pool *bm_pool = &priv->bm_pools[i];
+		int size_bytes;
+
+		if (bm_pool->type == MVNETA_BM_FREE)
+			continue;
+
+		mvneta_bm_bufs_free(priv, bm_pool, bm_pool->port_map);
+		if (bm_pool->hwbm_pool.buf_num)
+			dev_warn(&priv->pdev->dev,
+				 "pool %d: %d buffers not freed\n",
+				 bm_pool->id, bm_pool->hwbm_pool.buf_num);
+
+		mvneta_bm_pool_disable(priv, bm_pool->id);
+
+		if (bm_pool->virt_addr) {
+			size_bytes = sizeof(u32) * bm_pool->hwbm_pool.size;
+			dma_free_coherent(&priv->pdev->dev, size_bytes,
+					  bm_pool->virt_addr,
+					  bm_pool->phys_addr);
+			bm_pool->virt_addr = NULL;
+		}
+		bm_pool->type = MVNETA_BM_FREE;
+	}
+
+	mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_STOP_MASK);
+	clk_disable_unprepare(priv->clk);
+	return 0;
+}
+
+static int mvneta_bm_resume(struct device *dev)
+{
+	struct mvneta_bm *priv = dev_get_drvdata(dev);
+	int i, err;
+
+	err = clk_prepare_enable(priv->clk);
+	if (err)
+		return err;
+
+	/* Reinitialize BM hardware; pools are refilled by mvneta_resume() */
+	mvneta_bm_default_set(priv);
+
+	/* Restore pool registers lost during clock gating */
+	for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+		mvneta_bm_write(priv, MVNETA_BM_POOL_READ_PTR_REG(i), 0);
+		mvneta_bm_write(priv, MVNETA_BM_POOL_WRITE_PTR_REG(i), 0);
+		mvneta_bm_write(priv, MVNETA_BM_POOL_SIZE_REG(i),
+				priv->bm_pools[i].hwbm_pool.size);
+	}
+
+	mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_START_MASK);
+	return 0;
+}
+
+static DEFINE_SIMPLE_DEV_PM_OPS(mvneta_bm_pm_ops, mvneta_bm_suspend, mvneta_bm_resume);
+
 static const struct of_device_id mvneta_bm_match[] = {
 	{ .compatible = "marvell,armada-380-neta-bm" },
 	{ }
@@ -489,6 +551,7 @@ static struct platform_driver mvneta_bm_driver = {
 	.driver = {
 		.name = MVNETA_BM_DRIVER_NAME,
 		.of_match_table = mvneta_bm_match,
+		.pm = pm_sleep_ptr(&mvneta_bm_pm_ops),
 	},
 };
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net v3 01/11] rxrpc: Fix ACKALL packet handling
From: Jeffrey E Altman @ 2026-06-25  3:31 UTC (permalink / raw)
  To: David Howells, netdev
  Cc: Marc Dionne, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, linux-afs, linux-kernel, Wyatt Feng,
	Yuan Tan, Yifan Wu, Juefei Pu, Zhengchuan Liang, Xin Liu, Ren Wei,
	stable
In-Reply-To: <20260624163819.3017002-2-dhowells@redhat.com>

On 6/24/2026 12:38 PM, David Howells wrote:

> From: Wyatt Feng <bronzed_45_vested@icloud.com>
>
> rxrpc_input_ackall() accepts ACKALL packets without checking whether the
> call is in a state that can legitimately have outstanding transmit buffers.
> A forged ACKALL can therefore reach a new service call in
> RXRPC_CALL_SERVER_RECV_REQUEST before any reply packets have been queued.
>
> In that state call->tx_top is zero and call->tx_queue is NULL, so
> rxrpc_rotate_tx_window() dereferences a NULL txqueue and triggers a
> null-pointer dereference.
>
> Fix the handling of ACKALL packets by the following means:
>
>   (1) Add two new call states: RXRPC_CALL_CLIENT_PRE_SEND which indicates
>       that the client call is connected, but nothing has been transmitted as
>       yet; and RXRPC_CALL_CLIENT_AWAIT_ACK, which indicates that everything
>       has been transmitted at least once, but we're now waiting for the
>       stuff remaining in the Tx buffer to be ACK'd (retransmissions may
>       still happen).
>
>       The RXRPC_CALL_CLIENT_PRE_SEND state is set when the call is assigned
>       a channel and transitions to RXRPC_CALL_CLIENT_SEND_REQUEST when the
>       first packet is transmitted.
>
>       RXRPC_CALL_CLIENT_AWAIT_REPLY is then narrowed in scope to indicate
>       that all Tx packets have been ACK'd and we're now waiting for the
>       reply to be received.
>
>   (2) As per Wyatt Feng's original patch[1], the ACKALL handler then checks
>       that the call state is one in which there might be stuff in the Tx
>       buffer to ACK, but now this includes AWAIT_ACK rather than
>       AWAIT_REPLY.  ACKALL packets are ignored if received in the wrong
>       state.
>
>       Note that unlike Wyatt Feng's patch, it's no longer necessary to check
>       to see if the Tx buffer exists as this the state set now covers this.
>
>   (3) Make the ACKALL handler use call->tx_transmitted rather than
>       call->tx_top as the former is explicitly the highest packet seq number
>       transmitted, whereas the latter has a looser definition.
>
> Thanks to Jeffrey Altman for a description of the history of the ACKALL
> packet[1].

The Link reference should be [2] instead of [1].

> Fixes: b341a0263b1b ("rxrpc: Implement progressive transmission queue struct")
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
> Co-developed-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Ren Wei <n05ec@lzu.edu.cn>
> cc: Marc Dionne <marc.dionne@auristor.com>
> cc: linux-afs@lists.infradead.org
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/r/20260616155749.2125907-2-dhowells@redhat.com/ [1]
> Link: https://lore.kernel.org/r/c0fd4fec-1576-4070-b31e-a37d5506f5ed@auristor.com/ [2]
> ---
>   net/rxrpc/ar-internal.h |  2 ++
>   net/rxrpc/call_event.c  |  5 ++++-
>   net/rxrpc/call_object.c |  2 ++
>   net/rxrpc/conn_client.c |  2 +-
>   net/rxrpc/input.c       | 23 +++++++++++++++++++----
>   net/rxrpc/sendmsg.c     |  3 ++-
>   6 files changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
> index 98f2165159d7..b6ccd8a8199b 100644
> --- a/net/rxrpc/ar-internal.h
> +++ b/net/rxrpc/ar-internal.h
> @@ -650,7 +650,9 @@ enum rxrpc_call_event {
>   enum rxrpc_call_state {
>   	RXRPC_CALL_UNINITIALISED,
>   	RXRPC_CALL_CLIENT_AWAIT_CONN,	/* - client waiting for connection to become available */
> +	RXRPC_CALL_CLIENT_PRE_SEND,	/* - client is connected, but hasn't sent anything yet */
>   	RXRPC_CALL_CLIENT_SEND_REQUEST,	/* - client sending request phase */
> +	RXRPC_CALL_CLIENT_AWAIT_ACK,	/* - client awaiting ACKs of request */
>   	RXRPC_CALL_CLIENT_AWAIT_REPLY,	/* - client awaiting reply */
>   	RXRPC_CALL_CLIENT_RECV_REPLY,	/* - client receiving reply phase */
>   	RXRPC_CALL_SERVER_PREALLOC,	/* - service preallocation */
> diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
> index fec59d9338b9..21be9c86d7a7 100644
> --- a/net/rxrpc/call_event.c
> +++ b/net/rxrpc/call_event.c
> @@ -178,7 +178,7 @@ static void rxrpc_close_tx_phase(struct rxrpc_call *call)
>   
>   	switch (__rxrpc_call_state(call)) {
>   	case RXRPC_CALL_CLIENT_SEND_REQUEST:
> -		rxrpc_set_call_state(call, RXRPC_CALL_CLIENT_AWAIT_REPLY);
> +		rxrpc_set_call_state(call, RXRPC_CALL_CLIENT_AWAIT_ACK);
>   		break;
>   	case RXRPC_CALL_SERVER_SEND_REPLY:
>   		rxrpc_set_call_state(call, RXRPC_CALL_SERVER_AWAIT_ACK);
> @@ -244,6 +244,8 @@ static void rxrpc_transmit_fresh_data(struct rxrpc_call *call, unsigned int limi
>   				break;
>   		} while (req.n < limit && before(seq, send_top));
>   
> +		if (__rxrpc_call_state(call) == RXRPC_CALL_CLIENT_PRE_SEND)
> +			rxrpc_set_call_state(call, RXRPC_CALL_CLIENT_SEND_REQUEST);
>   		if (txb->flags & RXRPC_LAST_PACKET) {
>   			rxrpc_close_tx_phase(call);
>   			tq = NULL;
> @@ -267,6 +269,7 @@ void rxrpc_transmit_some_data(struct rxrpc_call *call, unsigned int limit,
>   		fallthrough;
>   
>   	case RXRPC_CALL_SERVER_SEND_REPLY:
> +	case RXRPC_CALL_CLIENT_PRE_SEND:
>   	case RXRPC_CALL_CLIENT_SEND_REQUEST:
>   		if (!rxrpc_tx_window_space(call))
>   			return;
> diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
> index fcb9d38bb521..817ed9acb91e 100644
> --- a/net/rxrpc/call_object.c
> +++ b/net/rxrpc/call_object.c
> @@ -18,7 +18,9 @@
>   const char *const rxrpc_call_states[NR__RXRPC_CALL_STATES] = {
>   	[RXRPC_CALL_UNINITIALISED]		= "Uninit  ",
>   	[RXRPC_CALL_CLIENT_AWAIT_CONN]		= "ClWtConn",
> +	[RXRPC_CALL_CLIENT_PRE_SEND]		= "ClPreSnd",
>   	[RXRPC_CALL_CLIENT_SEND_REQUEST]	= "ClSndReq",
> +	[RXRPC_CALL_CLIENT_AWAIT_ACK]		= "ClAwtAck",
>   	[RXRPC_CALL_CLIENT_AWAIT_REPLY]		= "ClAwtRpl",
>   	[RXRPC_CALL_CLIENT_RECV_REPLY]		= "ClRcvRpl",
>   	[RXRPC_CALL_SERVER_PREALLOC]		= "SvPrealc",
> diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c
> index 9b757798dedd..48519f0de185 100644
> --- a/net/rxrpc/conn_client.c
> +++ b/net/rxrpc/conn_client.c
> @@ -449,7 +449,7 @@ static void rxrpc_activate_one_channel(struct rxrpc_connection *conn,
>   	trace_rxrpc_connect_call(call);
>   	call->tx_last_sent = ktime_get_real();
>   	rxrpc_start_call_timer(call);
> -	rxrpc_set_call_state(call, RXRPC_CALL_CLIENT_SEND_REQUEST);
> +	rxrpc_set_call_state(call, RXRPC_CALL_CLIENT_PRE_SEND);
>   	wake_up(&call->waitq);
>   }
>   
> diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
> index ce761466b02d..2eedab1b0919 100644
> --- a/net/rxrpc/input.c
> +++ b/net/rxrpc/input.c
> @@ -181,7 +181,8 @@ void rxrpc_congestion_degrade(struct rxrpc_call *call)
>   	if (call->cong_ca_state != RXRPC_CA_SLOW_START &&
>   	    call->cong_ca_state != RXRPC_CA_CONGEST_AVOIDANCE)
>   		return;
> -	if (__rxrpc_call_state(call) == RXRPC_CALL_CLIENT_AWAIT_REPLY)
> +	if (__rxrpc_call_state(call) == RXRPC_CALL_CLIENT_AWAIT_ACK ||
> +	    __rxrpc_call_state(call) == RXRPC_CALL_CLIENT_AWAIT_REPLY)
>   		return;
>   
>   	rtt = ns_to_ktime(call->srtt_us * (NSEC_PER_USEC / 8));
> @@ -356,6 +357,7 @@ static void rxrpc_end_tx_phase(struct rxrpc_call *call, bool reply_begun,
>   
>   	switch (__rxrpc_call_state(call)) {
>   	case RXRPC_CALL_CLIENT_SEND_REQUEST:
> +	case RXRPC_CALL_CLIENT_AWAIT_ACK:
>   	case RXRPC_CALL_CLIENT_AWAIT_REPLY:
>   		if (reply_begun) {
>   			rxrpc_set_call_state(call, RXRPC_CALL_CLIENT_RECV_REPLY);
> @@ -694,6 +696,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, struct sk_buff *skb)
>   
>   	switch (__rxrpc_call_state(call)) {
>   	case RXRPC_CALL_CLIENT_SEND_REQUEST:
> +	case RXRPC_CALL_CLIENT_AWAIT_ACK:
>   	case RXRPC_CALL_CLIENT_AWAIT_REPLY:
>   		/* Received data implicitly ACKs all of the request
>   		 * packets we sent when we're acting as a client.
> @@ -1154,10 +1157,12 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb)
>   	if (hard_ack + 1 == 0)
>   		return rxrpc_proto_abort(call, 0, rxrpc_eproto_ackr_zero);
>   
> -	/* Ignore ACKs unless we are or have just been transmitting. */
> +	/* Ignore ACKs unless we are transmitting or are waiting for
> +	 * acknowledgement of the packets we've just been transmitting.
> +	 */
>   	switch (__rxrpc_call_state(call)) {
>   	case RXRPC_CALL_CLIENT_SEND_REQUEST:
> -	case RXRPC_CALL_CLIENT_AWAIT_REPLY:
> +	case RXRPC_CALL_CLIENT_AWAIT_ACK:
>   	case RXRPC_CALL_SERVER_SEND_REPLY:
>   	case RXRPC_CALL_SERVER_AWAIT_ACK:
>   		break;
> @@ -1215,7 +1220,17 @@ static void rxrpc_input_ackall(struct rxrpc_call *call, struct sk_buff *skb)
>   {
>   	struct rxrpc_ack_summary summary = { 0 };
>   
> -	if (rxrpc_rotate_tx_window(call, call->tx_top, &summary))
> +	switch (__rxrpc_call_state(call)) {
> +	case RXRPC_CALL_CLIENT_SEND_REQUEST:
> +	case RXRPC_CALL_CLIENT_AWAIT_ACK:
> +	case RXRPC_CALL_SERVER_SEND_REPLY:
> +	case RXRPC_CALL_SERVER_AWAIT_ACK:
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	if (rxrpc_rotate_tx_window(call, call->tx_transmitted, &summary))
>   		rxrpc_end_tx_phase(call, false, rxrpc_eproto_unexpected_ackall);
>   }
>   
> diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
> index c35de4fd75e3..ed2c9a51005a 100644
> --- a/net/rxrpc/sendmsg.c
> +++ b/net/rxrpc/sendmsg.c
> @@ -366,7 +366,8 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
>   	if (state >= RXRPC_CALL_COMPLETE)
>   		goto maybe_error;
>   	ret = -EPROTO;
> -	if (state != RXRPC_CALL_CLIENT_SEND_REQUEST &&
> +	if (state != RXRPC_CALL_CLIENT_PRE_SEND &&
> +	    state != RXRPC_CALL_CLIENT_SEND_REQUEST &&
>   	    state != RXRPC_CALL_SERVER_ACK_REQUEST &&
>   	    state != RXRPC_CALL_SERVER_SEND_REPLY) {
>   		/* Request phase complete for this client call */
>
Thanks for the update patch.

Reviewed-by: Jeffrey Altman <jaltman@auristor.com>




^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/2] bpf: Support BPF_F_EGRESS with bpf_redirect_peer
From: Jiayuan Chen @ 2026-06-25  3:37 UTC (permalink / raw)
  To: Jordan Rife, bpf
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Stanislav Fomichev, Jiayuan Chen, Paul Chaignon
In-Reply-To: <20260618182035.43811-2-jordan@jrife.io>


On 6/19/26 2:20 AM, Jordan Rife wrote:
> We have several use cases where a pod injects traffic into the datapath
> of another so that the traffic appears to have originated from that
> pod. One such use case is a synthetic flow generator which injects
> synthetic traffic into a pod's datapath to enable dynamic probing and
> debugging. Another is a transparent proxy where connections originating
> from one pod are redirected towards another which proxies that
> connection. The new connection is bound to the IP of the original pod
> using IP_TRANSPARENT and its traffic is injected into that pod's
> datapath and handled as if it had originated there. This can be used for
> mTLS, etc.
>
[...]
>
>    [ ID] Interval           Transfer     Bitrate         Retr
>    [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec    0       sender
>    [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec            receiver
>
> In this test, using bpf_redirect_peer(BPF_F_EGRESS) for the hop from
> [iperf pod] to [pod b] led to ~18% more throughput compared to
> bpf_redirect(BPF_F_INGRESS).
>
> Signed-off-by: Jordan Rife <jordan@jrife.io>
> ---
>   include/uapi/linux/bpf.h       | 19 +++++++++++--------
>   net/core/filter.c              | 12 +++++++-----
>   tools/include/uapi/linux/bpf.h | 19 +++++++++++--------
>   3 files changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 89b36de5fdbb..c91b5a4bda03 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5079,17 +5079,19 @@ union bpf_attr {
>    * 	Description
>    * 		Redirect the packet to another net device of index *ifindex*.
>    * 		This helper is somewhat similar to **bpf_redirect**\ (), except
> - * 		that the redirection happens to the *ifindex*' peer device and
> - * 		the netns switch takes place from ingress to ingress without
> - * 		going through the CPU's backlog queue.
> + * 		that the redirection happens to the *ifindex*' peer device. If
> + * 		*flags* is 0, the netns switch takes place from ingress to
> + * 		ingress without going through the CPU's backlog queue. If the
> + * 		**BPF_F_EGRESS** flag is provided then redirection happens in
> + * 		the egress direction of the peer device.
>    *
>    * 		*skb*\ **->mark** and *skb*\ **->tstamp** are not cleared during
>    * 		the netns switch.
>    *
> - * 		The *flags* argument is reserved and must be 0. The helper is
> - * 		currently only supported for tc BPF program types at the
> - * 		ingress hook and for veth and netkit target device types. The
> - * 		peer device must reside in a different network namespace.
> + * 		If the *flags* argument is 0, the helper is currently only
> + * 		supported for tc BPF program types at the ingress hook and for
> + * 		veth and netkit target device types. The peer device must reside
> + * 		in a different network namespace.
>    * 	Return
>    * 		The helper returns **TC_ACT_REDIRECT** on success or
>    * 		**TC_ACT_SHOT** on error.
> @@ -6336,9 +6338,10 @@ enum {
>   /* Flags for bpf_redirect and bpf_redirect_map helpers */
>   enum {
>   	BPF_F_INGRESS		= (1ULL << 0), /* used for skb path */
> +	BPF_F_EGRESS		= (1ULL << 1), /* used for skb path */
>   	BPF_F_BROADCAST		= (1ULL << 3), /* used for XDP path */
>   	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4), /* used for XDP path */
> -#define BPF_F_REDIRECT_FLAGS (BPF_F_INGRESS | BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS)
> +#define BPF_F_REDIRECT_FLAGS (BPF_F_INGRESS | BPF_F_EGRESS | BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS)
>   };
>   

Thanks, BPF_F_EGRESS is clearer.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>


^ permalink raw reply

* [PATCH v2 net] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
From: Pengfei Zhang @ 2026-06-25  4:41 UTC (permalink / raw)
  To: dsahern, idosch
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang, Pengfei Zhang

From: Pengfei Zhang <zhangpengfei16@xiaomi.com>

inet6_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches the RTNL lock
is released, so a concurrent fib6_new_table() can insert a new table
at the chain head, shifting all existing entries.  The saved index
then lands on a different table, causing fib6_dump_table() to set
w->root to the wrong table while w->node still points into the
previous one.  fib6_walk_continue() dereferences w->node->parent
(NULL) and panics:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  RIP: 0010:fib6_walk_continue+0x6e/0x170
  Call Trace:
   <TASK>
   fib6_dump_table.isra.0+0xc5/0x240
   inet6_dump_fib+0xf6/0x420
   rtnl_dumpit+0x30/0xa0
   netlink_dump+0x15b/0x460
   netlink_recvmsg+0x1d6/0x2a0
   ____sys_recvmsg+0x17a/0x190

Fix by storing tb->tb6_id in cb->args[1] instead of a positional
index.  On resume, skip entries until the id matches; a concurrent
head-insert can never match the saved id, so the walker always
resumes on the correct table.

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 2^32")
Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
 net/ipv6/ip6_fib.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc95738de..bda492634 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	};
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	unsigned int e = 0, s_e;
 	struct hlist_head *head;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	unsigned int h, s_h;
+	u32 s_id;
 	int err = 0;
 
 	rcu_read_lock();
@@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
-	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb6_id != s_id)
+				continue;
+			s_id = 0;
+
+			cb->args[1] = tb->tb6_id;
 			err = fib6_dump_table(tb, skb, cb);
 			if (err != 0)
 				goto out;
-next:
-			e++;
 		}
 	}
 out:
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2] octeontx2-af: Block VFs from clobbering special CGX PKIND state
From: Ratheesh Kannoth @ 2026-06-25  4:46 UTC (permalink / raw)
  To: davem, gakula, linux-kernel, netdev, sgoutham
  Cc: andrew+netdev, edumazet, kuba, pabeni, Hariprasad Kelam,
	Ratheesh Kannoth

From: Hariprasad Kelam <hkelam@marvell.com>

PF and VF NIX LFs that share a CGX LMAC reuse the same hardware PKIND
programming. When HiGig2 or EDSA parsing is enabled, a VF NIX LF alloc must
not reset the LMAC RX PKIND or default TX parse config over the PF setup.

Add cgx_get_pkind() and rvu_cgx_is_pkind_config_permitted() so VFs skip
cgx_set_pkind(), rvu_npc_set_pkind(), and NIX_AF_LFX_TX_PARSE_CFG updates
when the LMAC is using NPC_RX_HIGIG_PKIND or NPC_RX_EDSA_PKIND.

Fixes: 94d942c5fb97 ("octeontx2-af: Config pkind for CGX mapped PFs")
Cc: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>

---
v1 -> v2: Addressed simon comments
	https://lore.kernel.org/netdev/20260619041002.1773822-1-rkannoth@marvell.com/
---
 .../net/ethernet/marvell/octeontx2/af/cgx.c   | 12 +++++++
 .../net/ethernet/marvell/octeontx2/af/cgx.h   |  1 +
 .../net/ethernet/marvell/octeontx2/af/rvu.h   |  1 +
 .../ethernet/marvell/octeontx2/af/rvu_cgx.c   | 32 +++++++++++++++++++
 .../ethernet/marvell/octeontx2/af/rvu_nix.c   | 28 +++++++++++++---
 5 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index 2e94d5105016..f5fd6138c352 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -518,6 +518,18 @@ int cgx_set_pkind(void *cgxd, u8 lmac_id, int pkind)
 	return 0;
 }
 
+int cgx_get_pkind(void *cgxd, u8 lmac_id, int *pkind)
+{
+	struct cgx *cgx = cgxd;
+
+	if (!is_lmac_valid(cgx, lmac_id))
+		return -ENODEV;
+
+	*pkind = cgx_read(cgx, lmac_id, cgx->mac_ops->rxid_map_offset);
+	*pkind = *pkind & 0x3F;
+	return 0;
+}
+
 static u8 cgx_get_lmac_type(void *cgxd, int lmac_id)
 {
 	struct cgx *cgx = cgxd;
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index 92ccf343dfe0..8411a75dd723 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -141,6 +141,7 @@ int cgx_get_cgxid(void *cgxd);
 int cgx_get_lmac_cnt(void *cgxd);
 void *cgx_get_pdata(int cgx_id);
 int cgx_set_pkind(void *cgxd, u8 lmac_id, int pkind);
+int cgx_get_pkind(void *cgxd, u8 lmac_id, int *pkind);
 int cgx_lmac_evh_register(struct cgx_event_cb *cb, void *cgxd, int lmac_id);
 int cgx_lmac_evh_unregister(void *cgxd, int lmac_id);
 int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, u64 *tx_stat);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 7f3505ae6860..bb671e2150aa 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -1115,6 +1115,7 @@ void npc_read_mcam_entry(struct rvu *rvu, struct npc_mcam *mcam,
 			 u8 *intf, u8 *ena);
 int npc_config_cntr_default_entries(struct rvu *rvu, bool enable);
 bool is_cgx_config_permitted(struct rvu *rvu, u16 pcifunc);
+bool rvu_cgx_is_pkind_config_permitted(struct rvu *rvu, u16 pcifunc);
 bool is_mac_feature_supported(struct rvu *rvu, int pf, int feature);
 u32  rvu_cgx_get_fifolen(struct rvu *rvu);
 void *rvu_first_cgx_pdata(struct rvu *rvu);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
index 4ff3935ed3fe..2be1da3476ac 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
@@ -1355,3 +1355,35 @@ void rvu_mac_reset(struct rvu *rvu, u16 pcifunc)
 	if (mac_ops->mac_reset(cgxd, lmac, !is_vf(pcifunc)))
 		dev_err(rvu->dev, "Failed to reset MAC\n");
 }
+
+/* Do not allow CGX-mapped VFs to overwrite PKIND when special parse kinds
+ * (HiGig, EDSA, etc.) are in use on the shared LMAC.
+ */
+bool rvu_cgx_is_pkind_config_permitted(struct rvu *rvu, u16 pcifunc)
+{
+	int pf, err, rxpkind;
+	u8 cgx_id, lmac_id;
+	void *cgxd;
+
+	pf = rvu_get_pf(rvu->pdev, pcifunc);
+
+	if (!(pcifunc & RVU_PFVF_FUNC_MASK))
+		return true;
+
+	if (!is_pf_cgxmapped(rvu, pf))
+		return true;
+
+	rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], &cgx_id, &lmac_id);
+	cgxd = rvu_cgx_pdata(cgx_id, rvu);
+	err = cgx_get_pkind(cgxd, lmac_id, &rxpkind);
+	if (err)
+		return false;
+
+	switch (rxpkind) {
+	case NPC_RX_HIGIG_PKIND:
+	case NPC_RX_EDSA_PKIND:
+		return false;
+	default:
+		return true;
+	}
+}
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index d8989395e875..40f5b25eafb1 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -338,6 +338,7 @@ static int nix_interface_init(struct rvu *rvu, u16 pcifunc, int type, int nixlf,
 	struct sdp_node_info *sdp_info;
 	int pkind, pf, vf, lbkid, vfid;
 	u8 cgx_id, lmac_id;
+	struct cgx *cgxd;
 	bool from_vf;
 	int err;
 
@@ -363,8 +364,15 @@ static int nix_interface_init(struct rvu *rvu, u16 pcifunc, int type, int nixlf,
 		pfvf->tx_chan_cnt = 1;
 		rsp->tx_link = cgx_id * hw->lmac_per_cgx + lmac_id;
 
-		cgx_set_pkind(rvu_cgx_pdata(cgx_id, rvu), lmac_id, pkind);
-		rvu_npc_set_pkind(rvu, pkind, pfvf);
+		cgxd = rvu_cgx_pdata(cgx_id, rvu);
+
+		mutex_lock(&cgxd->lock);
+		if (rvu_cgx_is_pkind_config_permitted(rvu, pcifunc)) {
+			cgx_set_pkind(rvu_cgx_pdata(cgx_id, rvu), lmac_id,
+				      pkind);
+			rvu_npc_set_pkind(rvu, pkind, pfvf);
+		}
+		mutex_unlock(&cgxd->lock);
 		break;
 	case NIX_INTF_TYPE_LBK:
 		vf = (pcifunc & RVU_PFVF_FUNC_MASK) - 1;
@@ -1508,7 +1516,10 @@ int rvu_mbox_handler_nix_lf_alloc(struct rvu *rvu,
 	struct rvu_block *block;
 	struct rvu_pfvf *pfvf;
 	u64 cfg, ctx_cfg;
+	struct cgx *cgxd;
 	int blkaddr;
+	u8 cgx;
+	int pf;
 
 	if (!req->rq_cnt || !req->sq_cnt || !req->cq_cnt)
 		return NIX_AF_ERR_PARAM;
@@ -1680,8 +1691,17 @@ int rvu_mbox_handler_nix_lf_alloc(struct rvu *rvu,
 	rvu_write64(rvu, blkaddr, NIX_AF_LFX_RX_CFG(nixlf), req->rx_cfg);
 
 	/* Configure pkind for TX parse config */
-	cfg = NPC_TX_DEF_PKIND;
-	rvu_write64(rvu, blkaddr, NIX_AF_LFX_TX_PARSE_CFG(nixlf), cfg);
+	if (is_pf_cgxmapped(rvu, rvu_get_pf(rvu->pdev, pcifunc))) {
+		pf = rvu_get_pf(rvu->pdev, pcifunc);
+		cgxd = rvu_cgx_pdata(cgx, rvu);
+
+		mutex_lock(&cgxd->lock);
+		if (rvu_cgx_is_pkind_config_permitted(rvu, pcifunc)) {
+			cfg = NPC_TX_DEF_PKIND;
+			rvu_write64(rvu, blkaddr, NIX_AF_LFX_TX_PARSE_CFG(nixlf), cfg);
+		}
+		mutex_unlock(&cgxd->lock);
+	}
 
 	if (is_rep_dev(rvu, pcifunc)) {
 		pfvf->tx_chan_base = RVU_SWITCH_LBK_CHAN;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net v3] igb: only strip Rx timestamp header on the first buffer of a frame
From: Tjerk Kusters via B4 Relay @ 2026-06-25  5:24 UTC (permalink / raw)
  To: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Richard Cochran,
	Jesper Dangaard Brouer, Kurt Kanzenbach
  Cc: intel-wired-lan, netdev, linux-kernel, stable, Piotr Kwapulinski,
	Aleksandr Loktionov, Tjerk Kusters

From: Tjerk Kusters <tkusters@aweta.nl>

When Rx hardware timestamping is enabled (e.g. ptp4l, which configures
HWTSTAMP_FILTER_ALL), the NIC prepends a 16-byte timestamp header to the
first Rx buffer of every received frame. igb_clean_rx_irq() strips this
header inside its per-buffer loop:

	if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
		ts_hdr_len = igb_ptp_rx_pktstamp(rx_ring->q_vector,
						 pktbuf, &timestamp);
		pkt_offset += ts_hdr_len;
		size -= ts_hdr_len;
	}

For a frame that spans more than one Rx buffer (e.g. a jumbo frame), this
block runs once per buffer. The timestamp header only exists at the start
of the first buffer, but igb_ptp_rx_pktstamp() is called for every buffer.

On a continuation buffer the data is packet payload, not a timestamp
header. igb_ptp_rx_pktstamp() already has two guards against acting on a
non-header buffer: it returns 0 if PTP is disabled, and returns 0 if the
reserved dwords (the first 8 bytes) are non-zero. Neither is sufficient
here: PTP is enabled, and a continuation buffer whose payload happens to
begin with 8 zero bytes passes the reserved-dword check. In that case the
payload is mistaken for a valid timestamp header and igb_ptp_rx_pktstamp()
returns IGB_TS_HDR_LEN, so the caller strips 16 bytes of real data from
that buffer. A frame spanning N buffers whose continuation buffers start
with zero bytes therefore loses 16 * (N - 1) bytes from its tail.

This is easily triggered by a GigE Vision camera streaming dark frames
(mostly 0x00 pixel data) over jumbo UDP with PTP active on the receiver:
the all-zero frames arrive truncated while frames with non-zero content
are fine. There is no error indication.

No content-based check can reliably tell a continuation buffer that begins
with zero bytes from a real timestamp header, because both are all zero.
Fix it structurally instead: only attempt the strip on the first buffer of
a frame, which is the only buffer that can contain a timestamp header. In
igb_clean_rx_irq() skb is NULL until the first buffer has been processed,
so guarding the strip with !skb restricts it to the first buffer
regardless of payload content.

Fixes: 5379260852b0 ("igb: Fix XDP with PTP enabled")
Cc: stable@vger.kernel.org
Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Tjerk Kusters <tkusters@aweta.nl>
---
Changes in v3:
- update the rx-timestamp comment to note it only applies to the first
  buffer of a frame (Piotr Kwapulinski)
- add Reviewed-by from Aleksandr Loktionov and Piotr Kwapulinski
- no functional change
- Link to v2: https://patch.msgid.link/20260619-igb-rx-ts-fix-v2-1-d3b8d605ca62@aweta.nl

igb: only strip Rx timestamp header on the first buffer of a frame

Changes in v2:
 - resend via b4 (v1 was sent with a mail client)
 - use full author name "Tjerk Kusters" (Jacob Keller)
 - add Reviewed-by from Kurt Kanzenbach
 - no functional change

Link to v1: https://lore.kernel.org/all/PAWPR05MB1069106D52F4E17F1EDB99C67B9182@PAWPR05MB10691.eurprd05.prod.outlook.com/
---
 drivers/net/ethernet/intel/igb/igb_main.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index ce91dda00ec0..539bf5389a24 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -9060,8 +9060,11 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
 		rx_buffer = igb_get_rx_buffer(rx_ring, size, &rx_buf_pgcnt);
 		pktbuf = page_address(rx_buffer->page) + rx_buffer->page_offset;
 
-		/* pull rx packet timestamp if available and valid */
-		if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
+		/* pull rx packet timestamp if available and valid; it is only
+		 * present on the first buffer of a frame
+		 */
+		if (!skb &&
+		    igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
 			int ts_hdr_len;
 
 			ts_hdr_len = igb_ptp_rx_pktstamp(rx_ring->q_vector,

---
base-commit: 2d3090a8aeb596a26935db0955d46c9a5db5c6ce
change-id: 20260619-igb-rx-ts-fix-cd70585ee316

Best regards,
--  
Tjerk Kusters <tkusters@aweta.nl>



^ permalink raw reply related

* Re: [PATCH net v2] octeontx2-af: Block VFs from clobbering special CGX PKIND state
From: Ratheesh Kannoth @ 2026-06-25  5:25 UTC (permalink / raw)
  To: davem, gakula, linux-kernel, netdev, sgoutham
  Cc: andrew+netdev, edumazet, kuba, pabeni, Hariprasad Kelam
In-Reply-To: <20260625044621.2841831-1-rkannoth@marvell.com>

On 2026-06-25 at 10:16:21, Ratheesh Kannoth (rkannoth@marvell.com) wrote:
> From: Hariprasad Kelam <hkelam@marvell.com>
>
> PF and VF NIX LFs that share a CGX LMAC reuse the same hardware PKIND
> programming. When HiGig2 or EDSA parsing is enabled, a VF NIX LF alloc must
> not reset the LMAC RX PKIND or default TX parse config over the PF setup.
>
> Add cgx_get_pkind() and rvu_cgx_is_pkind_config_permitted() so VFs skip
> cgx_set_pkind(), rvu_npc_set_pkind(), and NIX_AF_LFX_TX_PARSE_CFG updates
> when the LMAC is using NPC_RX_HIGIG_PKIND or NPC_RX_EDSA_PKIND.
>
> Fixes: 94d942c5fb97 ("octeontx2-af: Config pkind for CGX mapped PFs")
> Cc: Geetha sowjanya <gakula@marvell.com>
> Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
>
> ---
> v1 -> v2: Addressed simon comments
> 	https://lore.kernel.org/netdev/20260619041002.1773822-1-rkannoth@marvell.com/
> ---

Apologies for the inconvenience — it appears I submitted an incorrect patch.
I will abandon it and post a revised one later. Thanks.

pw-bot: changes-requested

^ permalink raw reply

* Re: [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility
From: Kuniyuki Iwashima @ 2026-06-25  5:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, nb, aleksandr.loktionov,
	dtatulea
In-Reply-To: <20260624182018.2445732-2-kuba@kernel.org>

On Wed, Jun 24, 2026 at 11:20 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> The rx_mode update runs from a workqueue: drivers have their
> ndo_set_rx_mode_async() callback executed by a single global
> work item under RTNL and ops lock. This is a useful pattern.
>
> Support multiple "events" that need to be serviced and make RX_MODE
> sync the first one. Call the events "core" because later on
> we will let drivers define and schedule their own.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Oh very nice !

I was drafting almost the same change for dev_set_rx_mode()
in mcast path and some ipvlan changes.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* [PATCH v2 1/1] xfrm: nat_keepalive: avoid double free on send error
From: Ren Wei @ 2026-06-25  5:55 UTC (permalink / raw)
  To: netdev
  Cc: steffen.klassert, herbert, davem, eyal.birger, yuantan098, bird,
	qianyuluo3, n05ec

From: Qianyu Luo <qianyuluo3@gmail.com>

nat_keepalive_send() frees the keepalive skb whenever the IPv4 or IPv6
send helper reports an error.

That cleanup is only correct before the skb is handed to the output
path. Once ip_build_and_send_pkt() or ip6_xmit() takes ownership, the
networking stack may already have consumed the skb before returning an
error, so freeing it again is unsafe.

Handle the pre-handoff failure cases inside nat_keepalive_send_ipv4()
and nat_keepalive_send_ipv6(), where the caller still owns the skb, and
keep nat_keepalive_send() responsible only for family dispatch and the
unsupported-family cleanup path.

Fixes: f531d13bdfe3 ("xfrm: support sending NAT keepalives in ESP in UDP states")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Qianyu Luo <qianyuluo3@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
Changes in v2:
- move kfree_skb() after local_unlock_nested_bh() in the IPv6 dst-lookup
  failure path as suggested in review
- rebase onto latest netdev/net

Link: https://lore.kernel.org/all/46eb334399ce0e25e0897b42f21020541d159300.1781788385.git.qianyuluo3@gmail.com/

 net/xfrm/xfrm_nat_keepalive.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c
index 458931062a04..eb1b6f67739e 100644
--- a/net/xfrm/xfrm_nat_keepalive.c
+++ b/net/xfrm/xfrm_nat_keepalive.c
@@ -55,8 +55,10 @@ static int nat_keepalive_send_ipv4(struct sk_buff *skb,
 			   ka->encap_sport, sock_net_uid(net, NULL));
 
 	rt = ip_route_output_key(net, &fl4);
-	if (IS_ERR(rt))
+	if (IS_ERR(rt)) {
+		kfree_skb(skb);
 		return PTR_ERR(rt);
+	}
 
 	skb_dst_set(skb, &rt->dst);
 
@@ -101,6 +103,7 @@ static int nat_keepalive_send_ipv6(struct sk_buff *skb,
 	dst = ip6_dst_lookup_flow(net, sk, &fl6, NULL);
 	if (IS_ERR(dst)) {
 		local_unlock_nested_bh(&nat_keepalive_sk_ipv6.bh_lock);
+		kfree_skb(skb);
 		return PTR_ERR(dst);
 	}
 
@@ -118,7 +121,6 @@ static void nat_keepalive_send(struct nat_keepalive *ka)
 					sizeof(struct ipv6hdr)) +
 				    sizeof(struct udphdr);
 	const u8 nat_ka_payload = 0xFF;
-	int err = -EAFNOSUPPORT;
 	struct sk_buff *skb;
 	struct udphdr *uh;
 
@@ -140,16 +142,17 @@ static void nat_keepalive_send(struct nat_keepalive *ka)
 
 	switch (ka->family) {
 	case AF_INET:
-		err = nat_keepalive_send_ipv4(skb, ka);
+		nat_keepalive_send_ipv4(skb, ka);
 		break;
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		err = nat_keepalive_send_ipv6(skb, ka, uh);
+		nat_keepalive_send_ipv6(skb, ka, uh);
 		break;
 #endif
-	}
-	if (err)
+	default:
 		kfree_skb(skb);
+		break;
+	}
 }
 
 struct nat_keepalive_work_ctx {
-- 
2.43.7


^ permalink raw reply related

* Re: [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API
From: Kuniyuki Iwashima @ 2026-06-25  5:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, nb, aleksandr.loktionov,
	dtatulea
In-Reply-To: <20260624182018.2445732-3-kuba@kernel.org>

On Wed, Jun 24, 2026 at 11:20 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> With an extra event mask we can easily extend the netdev work
> to also service driver-defined events. For advanced drivers
> this is probably not a perfect match, but it makes running
> deferred work easier in simple cases.
>
> Expose the netdev_work facility to drivers. Add helpers
> to schedule work and a dedicated ndo to perform the driver-
> -scheduled actions.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH net 3/4] vlan: defer real device state propagation to netdev_work
From: Kuniyuki Iwashima @ 2026-06-25  5:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, nb, aleksandr.loktionov,
	dtatulea, syzbot+09da62a8b78959ceb8bb,
	syzbot+cb67c392b0b8f0fd0fc1, syzbot+9bb8bd77f3966641f298
In-Reply-To: <20260624182018.2445732-4-kuba@kernel.org>

On Wed, Jun 24, 2026 at 11:20 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> vlan_device_event() generates nested UP/DOWN, MTU and feature
> change events. It executes an event for the VLAN device directly
> from the notifier - while the locks of the lower device are held.
>
> This causes deadlocks, for example:
>
>   bond    (3) bond_update_speed_duplex(vlan)
>     |           ^                v
>   vlan    (2) UP(vlan)    (4) vlan_ethtool_get_link_ksettings()
>     |           ^                v
>   dummy   (1) UP(dummy)   (5) __ethtool_get_link_ksettings()
>
> The dummy device is ops locked, vlan creates a nested event (2),
> then bond wants to ask vlan for link state (3). bond uses the
> "I'm already holding the instance lock" flavor of API. But in
> this case the lock held refers to vlan itself. We hit vlan's
> link settings trampoline (4) and call __ethtool_get_link_ksettings()
> which tries to lock dummy. Deadlock. There's no clean way for us
> to tell the vlan_ethtool_get_link_ksettings() that the caller
> is already in lower device's critical section.
>
> Defer the propagation to the per-netdev work facility instead:
> the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*),
> and ndo_work (vlan_dev_work) applies the change later. Hopefully
> nobody expects the VLAN state changes to be instantaneous.
>
> If someone does expect the changes to be instantaneous we will
> have to do the same thing Stan did for rx_mode and "strategically"
> place sync calls, to make sure such delayed works are executed
> after we drop the ops lock but before we drop rtnl_lock.
>
> Stan suggests that if we need that down the line we may
> consider reshaping the mechanism into "async notifications".
> AFAICT only vlan does this sort of netdev open chaining,
> so as a first try I think that sticking the complexity into
> the vlan code makes sense.
>
> One corner case is that we need to cancel the event if user
> explicitly changes the state before work could run. Consider
> the following operations with vlan0 on top of dummy0:
>
>   ip link set dev dummy0 up    # queues work to up vlan0
>   ip link set dev vlan0 down   # user explicitly downs the vlan
>   ndo_work                     # acts on the stale event
>
> Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com
> Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com
> Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com
> Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

Thanks !

^ permalink raw reply

* [RFC PATCH] net/iucv: Descend into net/iucv when AFIUCV is enabled
From: Pengpeng Hou @ 2026-06-25  6:13 UTC (permalink / raw)
  To: Alexandra Winter, Thorsten Winkler, David Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Heiko Carstens, linux-s390, netdev, linux-kernel,
	Pengpeng Hou

AFIUCV can be enabled by the QETH_L3/HiperSockets path even when IUCV
itself is not enabled.  However, the top-level net Makefile only descends
into net/iucv/ under CONFIG_IUCV.

That creates a Kconfig/Kbuild carrier mismatch: CONFIG_AFIUCV=m can be
selected, but af_iucv.o is never considered because the containing
directory is skipped.

This RFC uses an always-descend model for net/iucv/.  The subdirectory
Makefile already gates iucv.o and af_iucv.o on their own Kconfig symbols,
so entering the directory does not force either provider object on.

This is intentionally RFC because s390 maintainers should confirm whether
the QETH_L3-only AF_IUCV configuration is intended to build af_iucv.o
without the base IUCV object.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 net/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/Makefile b/net/Makefile
--- a/net/Makefile
+++ b/net/Makefile
@@ -45,7 +45,7 @@
 obj-$(CONFIG_MAC80211)		+= mac80211/
 obj-$(CONFIG_TIPC)		+= tipc/
 obj-$(CONFIG_NETLABEL)		+= netlabel/
-obj-$(CONFIG_IUCV)		+= iucv/
+obj-y				+= iucv/
 obj-$(CONFIG_SMC)		+= smc/
 obj-$(CONFIG_RFKILL)		+= rfkill/
 obj-$(CONFIG_NET_9P)		+= 9p/
-- 
2.39.5


^ permalink raw reply

* Re: [PATCH v2 net] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
From: Kuniyuki Iwashima @ 2026-06-25  6:15 UTC (permalink / raw)
  To: zhangfeionline
  Cc: baohua, chenzhangqi, davem, dsahern, edumazet, horms, idosch,
	kuba, linux-kernel, netdev, pabeni, zhangpengfei16
In-Reply-To: <20260625044101.939070-1-zhangfeionline@gmail.com>

From: Pengfei Zhang <zhangfeionline@gmail.com>
Date: Thu, 25 Jun 2026 12:41:01 +0800
> From: Pengfei Zhang <zhangpengfei16@xiaomi.com>
> 
> inet6_dump_fib() saves its progress in cb->args[1] as a positional
> index within the current hash chain.  Between batches the RTNL lock
> is released,

nit: RTNL has been removed from IPv6 FIB, simply say like

  Between batches, a concurrent fib6_new_table() can insert ...

> so a concurrent fib6_new_table() can insert a new table
> at the chain head, shifting all existing entries.  The saved index
> then lands on a different table, causing fib6_dump_table() to set
> w->root to the wrong table while w->node still points into the
> previous one.  fib6_walk_continue() dereferences w->node->parent
> (NULL) and panics:
> 
>   BUG: kernel NULL pointer dereference, address: 0000000000000008
>   RIP: 0010:fib6_walk_continue+0x6e/0x170
>   Call Trace:
>    <TASK>
>    fib6_dump_table.isra.0+0xc5/0x240
>    inet6_dump_fib+0xf6/0x420
>    rtnl_dumpit+0x30/0xa0
>    netlink_dump+0x15b/0x460
>    netlink_recvmsg+0x1d6/0x2a0
>    ____sys_recvmsg+0x17a/0x190
> 
> Fix by storing tb->tb6_id in cb->args[1] instead of a positional
> index.  On resume, skip entries until the id matches; a concurrent
> head-insert can never match the saved id, so the walker always
> resumes on the correct table.
> 
> Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 2^32")
> Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>

SOB does not match the Author of the patch (the first From: line).


> ---
>  net/ipv6/ip6_fib.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index fc95738de..bda492634 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>  	};
>  	const struct nlmsghdr *nlh = cb->nlh;
>  	struct net *net = sock_net(skb->sk);
> -	unsigned int e = 0, s_e;
>  	struct hlist_head *head;
>  	struct fib6_walker *w;
>  	struct fib6_table *tb;
>  	unsigned int h, s_h;
> +	u32 s_id;

nit: please keep the reverse xmas tree order.
https://docs.kernel.org/7.1/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs


>  	int err = 0;
>  
>  	rcu_read_lock();
> @@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>  	}
>  
>  	s_h = cb->args[0];
> -	s_e = cb->args[1];
> +	s_id = cb->args[1];
>  
> -	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
> -		e = 0;
> +	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
>  		head = &net->ipv6.fib_table_hash[h];
>  		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
> -			if (e < s_e)
> -				goto next;
> +			if (s_id && tb->tb6_id != s_id)
> +				continue;
> +			s_id = 0;
> +
> +			cb->args[1] = tb->tb6_id;
>  			err = fib6_dump_table(tb, skb, cb);
>  			if (err != 0)
>  				goto out;
> -next:
> -			e++;
>  		}
>  	}
>  out:
> -	cb->args[1] = e;
>  	cb->args[0] = h;
>  
>  unlock:
> -- 
> 2.34.1

^ permalink raw reply

* [PATCH net v3] net: wwan: iosm: bound device offsets in the MUX downlink decoder
From: Maoyi Xie @ 2026-06-25  6:17 UTC (permalink / raw)
  To: Loic Poulain, Sergey Ryazanov, Johannes Berg
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, stable

mux_dl_adb_decode() walks a chain of aggregated datagram tables using
offsets and lengths taken from the modem. first_table_index,
next_table_index, table_length, datagram_index and datagram_length are
all device supplied le values. Only first_table_index was checked, and
only for being non zero. The decoder then formed adth = block +
adth_index and read the table header and the datagram entries with no
bound against the received skb. A modem that reports an index or a
length past the downlink buffer makes the decoder read out of bounds.

The buffer is IPC_MEM_MAX_DL_MUX_LITE_BUF_SIZE and skb->len is at most
that, so skb->len is the real limit, but none of these in band offsets
were checked against it.

The table chain is also followed with no forward progress check. The loop
takes the next table from adth->next_table_index and stops only when that
reaches zero. A modem can stage two tables that point at each other, so
the loop never ends. It runs in softirq and clones the skb on every pass.

Validate every device offset and length against skb->len before use.
The block header must fit. Each table header, on entry and after every
next_table_index, must lie inside the skb. The datagram table must fit.
Each datagram index and length must stay inside the skb. The header
padding must not exceed the datagram length so the receive length does
not wrap. Require each next_table_index to move forward so the chain
cannot cycle.

This was reproduced under KASAN as a slab out of bounds read on a normal
downlink receive once the iosm net device is up.

Fixes: 1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support")
Suggested-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
---
Changes in v3:
- Also require next_table_index to move strictly forward, so a modem
  cannot point two tables at each other and spin the decode loop in
  softirq. Raised in review of v2.

Link to v1: https://lore.kernel.org/all/178185979029.4044562.9993615975949055530@maoyixie.com/
Link to v2: https://lore.kernel.org/all/178196118045.462404.11069139160448641355@maoyixie.com/

 drivers/net/wwan/iosm/iosm_ipc_mux_codec.c |   40 +++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c b/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c
index bff46f7ca59f..0bbd41263cc2 100644
--- a/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c
+++ b/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c
@@ -553,19 +553,21 @@ static int mux_dl_process_dg(struct iosm_mux *ipc_mux, struct mux_adbh *adbh,
 	u32 packet_offset, i, rc, dg_len;
 
 	for (i = 0; i < nr_of_dg; i++, dg++) {
-		if (le32_to_cpu(dg->datagram_index)
-				< sizeof(struct mux_adbh))
+		u32 dg_index = le32_to_cpu(dg->datagram_index);
+
+		dg_len = le16_to_cpu(dg->datagram_length);
+
+		if (dg_index < sizeof(struct mux_adbh))
 			goto dg_error;
 
-		/* Is the packet inside of the ADB */
-		if (le32_to_cpu(dg->datagram_index) >=
-					le32_to_cpu(adbh->block_length)) {
+		/* Is the packet inside of the ADB and the received skb ? */
+		if (dg_index >= le32_to_cpu(adbh->block_length) ||
+		    dg_index >= skb->len ||
+		    dg_len > skb->len - dg_index ||
+		    dl_head_pad_len >= dg_len) {
 			goto dg_error;
 		} else {
-			packet_offset =
-				le32_to_cpu(dg->datagram_index) +
-				dl_head_pad_len;
-			dg_len = le16_to_cpu(dg->datagram_length);
+			packet_offset = dg_index + dl_head_pad_len;
 			/* Pass the packet to the netif layer. */
 			rc = ipc_mux_net_receive(ipc_mux, if_id, ipc_mux->wwan,
 						 packet_offset,
@@ -589,12 +591,16 @@ static void mux_dl_adb_decode(struct iosm_mux *ipc_mux,
 	struct mux_adbh *adbh;
 	struct mux_adth *adth;
 	int nr_of_dg, if_id;
-	u32 adth_index;
+	u32 adth_index, prev_index = 0;
 	u8 *block;
 
 	block = skb->data;
 	adbh = (struct mux_adbh *)block;
 
+	/* The block header itself must fit in the received skb. */
+	if (skb->len < sizeof(struct mux_adbh))
+		goto adb_decode_err;
+
 	/* Process the aggregated datagram tables. */
 	adth_index = le32_to_cpu(adbh->first_table_index);
 
@@ -606,6 +612,16 @@ static void mux_dl_adb_decode(struct iosm_mux *ipc_mux,
 
 	/* Loop through mixed session tables. */
 	while (adth_index) {
+		/* The table header must lie within the received skb, and the
+		 * chain must move forward so a modem cannot make the loop
+		 * cycle between two tables.
+		 */
+		if (adth_index <= prev_index ||
+		    adth_index < sizeof(struct mux_adbh) ||
+		    adth_index > skb->len - sizeof(struct mux_adth))
+			goto adb_decode_err;
+		prev_index = adth_index;
+
 		/* Get the reference to the table header. */
 		adth = (struct mux_adth *)(block + adth_index);
 
@@ -629,6 +645,10 @@ static void mux_dl_adb_decode(struct iosm_mux *ipc_mux,
 		if (le16_to_cpu(adth->table_length) < sizeof(struct mux_adth))
 			goto adb_decode_err;
 
+		/* The whole datagram table must fit in the received skb. */
+		if (le16_to_cpu(adth->table_length) > skb->len - adth_index)
+			goto adb_decode_err;
+
 		/* Calculate the number of datagrams. */
 		nr_of_dg = (le16_to_cpu(adth->table_length) -
 					sizeof(struct mux_adth)) /
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH net] net: udp_tunnel: fix use-after-free by refcounting udp_tunnel_nic
From: Eric Dumazet @ 2026-06-25  6:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiayuan Chen, David S . Miller, Paolo Abeni, Simon Horman,
	Ido Schimmel, David Ahern, netdev, eric.dumazet, Yue Sun
In-Reply-To: <20260624195521.5972a5a8@kernel.org>

On Wed, Jun 24, 2026 at 7:55 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 25 Jun 2026 10:47:09 +0800 Jiayuan Chen wrote:
> > On 6/25/26 5:57 AM, Jakub Kicinski wrote:
> > > On Wed, 24 Jun 2026 17:10:34 +0000 Eric Dumazet wrote:
> > >> Yue Sun reported a use-after-free and debugobjects warning in
> > >> udp_tunnel_nic_device_sync_work() during concurrent device operations.
> > >>
> > >> The state flags of struct udp_tunnel_nic were originally bitfields
> > >> sharing a byte, modified concurrently without locking (RCU vs worker).
> > > Can you clarify the path where the bits are modified without locks??
> > > My mental model is that this is basically all under rtnl_lock, and
> > > Stan added _another_ lock so that drivers can call "sync" / reply
> > > without needing rtnl lock, but any changes are still under rtnl_lock.
> > >
> > > The gap seems to be that we don't check pending under Stan's new lock,
> > > since commit 1ead7501094c6 ("udp_tunnel: remove rtnl_lock dependency")
> > > did:
> >
> >
> > I think the real problem is that a single work_pending flag can't track
> > the work being queued twice:
> >
> > 1. Thread A calls queue_work() -> work_pending = 1.
> > 2. The worker gets picked up; workqueue clears the PENDING(internal work
> > queue flag) bit before running the work function.
> >     The worker then blocks on rtnl/utn->lock.
> > 3. Thread B calls queue_work() again. Since PENDING was already cleared,
> > it enqueues a second
> >     instance and sets work_pending = 1.
> > 4. A's worker finally gets the lock and does work_pending = 0, runs,
> > returns.
> > 5. Now work_pending == 0 but B's instance is still queued. unregister
> > sees 0, frees utn.
>
> Ah, thanks, now I get it. Claude told me the same thing but in 10,000
> words and I lost the thread before reading 'til the end...
>
> In that case:
>
> diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
> index 9944ed923ddf..3b32a0afa979 100644
> --- a/net/ipv4/udp_tunnel_nic.c
> +++ b/net/ipv4/udp_tunnel_nic.c
> @@ -301,7 +301,7 @@ __udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
>  static void
>  udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
>  {
> -       if (!utn->need_sync)
> +       if (!utn->need_sync || utn->work_pending)
>                 return;
>
>         queue_work(udp_tunnel_nic_workqueue, &utn->work);

Yep, this should do it. I will send a V2 with your suggestion.

I will also send a separate patch for the ->missed part, since the bug
came after  Stan commit.

^ permalink raw reply

* [PATCH net v3] net: ti: icssg-prueth: fix XDP_TX from the AF_XDP zero-copy RX path
From: David Carlier @ 2026-06-25  6:31 UTC (permalink / raw)
  To: danishanwar, rogerq, andrew+netdev, netdev
  Cc: davem, edumazet, kuba, pabeni, horms, m-malladi, hawk,
	john.fastabend, sdf, ast, daniel, bpf, linux-arm-kernel,
	linux-kernel, stable, David Carlier

On XDP_TX from the zero-copy RX path, emac_run_xdp() converts the xsk
buffer via xdp_convert_zc_to_xdp_frame(), which clones the data into a
fresh MEM_TYPE_PAGE_ORDER0 page that is not DMA mapped. Transmitting it
as PRUETH_TX_BUFF_TYPE_XDP_TX derives the DMA address with
page_pool_get_dma_addr(), reading an uninitialized page->dma_addr, so
the device DMAs from a bogus address (corrupt TX, or an IOMMU fault).

Pick the TX buffer type from the frame's memory type: keep
PRUETH_TX_BUFF_TYPE_XDP_TX for page_pool frames and use
PRUETH_TX_BUFF_TYPE_XDP_NDO for the cloned zero-copy frame, which is then
DMA mapped through the NDO path and unmapped on completion.

While at it, fix the page_pool XDP_TX completion path. A
PRUETH_TX_BUFF_TYPE_XDP_TX frame carries a page_pool-owned DMA mapping
(established against rx_chn->dma_dev), yet prueth_xmit_free()
unconditionally calls dma_unmap_single() on it with tx_chn->dma_dev,
tearing down a mapping the driver does not own; xdp_return_frame()
already recycles the page back to the pool. Tag such frames with a
dedicated PRUETH_SWDATA_XDPF_TX type so the completion path skips the
unmap, the same way PRUETH_SWDATA_XSK buffers are handled.

Fixes: 7a64bb388df3 ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Fixes: 62aa3246f462 ("net: ti: icssg-prueth: Add XDP support")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Meghana Malladi <m-malladi@ti.com>
---
v3:
 - address Meghana Malladi review nits: split the prueth_xmit_free()
   guard to stay under 80 columns, parenthesize the swdata->type
   ternary (and the matching tx_buff_type one for consistency).
 - no functional change; carry Reviewed-by.
v2: https://lore.kernel.org/netdev/20260623112225.303930-1-devnexen@gmail.com
v1: https://lore.kernel.org/netdev/20260620213756.87499-1-devnexen@gmail.com
 drivers/net/ethernet/ti/icssg/icssg_common.c | 21 +++++++++++++++++---
 drivers/net/ethernet/ti/icssg/icssg_prueth.h |  1 +
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
index 82ddef9c17d5..64ae3704481e 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_common.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
@@ -185,7 +185,8 @@ void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
 	first_desc = desc;
 	next_desc = first_desc;
 	swdata = cppi5_hdesc_get_swdata(first_desc);
-	if (swdata->type == PRUETH_SWDATA_XSK)
+	if (swdata->type == PRUETH_SWDATA_XSK ||
+	    swdata->type == PRUETH_SWDATA_XDPF_TX)
 		goto free_pool;
 
 	cppi5_hdesc_get_obuf(first_desc, &buf_dma, &buf_dma_len);
@@ -259,6 +260,7 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
 			napi_consume_skb(skb, budget);
 			break;
 		case PRUETH_SWDATA_XDPF:
+		case PRUETH_SWDATA_XDPF_TX:
 			xdpf = swdata->data.xdpf;
 			dev_sw_netstats_tx_add(ndev, 1, xdpf->len);
 			total_bytes += xdpf->len;
@@ -769,7 +771,8 @@ u32 emac_xmit_xdp_frame(struct prueth_emac *emac,
 	k3_udma_glue_tx_dma_to_cppi5_addr(tx_chn->tx_chn, &buf_dma);
 	cppi5_hdesc_attach_buf(first_desc, buf_dma, xdpf->len, buf_dma, xdpf->len);
 	swdata = cppi5_hdesc_get_swdata(first_desc);
-	swdata->type = PRUETH_SWDATA_XDPF;
+	swdata->type = (buff_type == PRUETH_TX_BUFF_TYPE_XDP_TX ?
+		PRUETH_SWDATA_XDPF_TX : PRUETH_SWDATA_XDPF);
 	swdata->data.xdpf = xdpf;
 
 	/* Report BQL before sending the packet */
@@ -804,6 +807,7 @@ EXPORT_SYMBOL_GPL(emac_xmit_xdp_frame);
  */
 static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len)
 {
+	enum prueth_tx_buff_type tx_buff_type;
 	struct net_device *ndev = emac->ndev;
 	struct netdev_queue *netif_txq;
 	int cpu = smp_processor_id();
@@ -826,11 +830,21 @@ static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len
 			goto drop;
 		}
 
+		/* In AF_XDP zero-copy mode xdp_convert_buff_to_frame()
+		 * clones the xsk buffer into a fresh MEM_TYPE_PAGE_ORDER0
+		 * page that is not DMA mapped. Such a frame must be mapped
+		 * via the NDO path; only a page pool-backed frame already
+		 * carries a usable page_pool DMA address.
+		 */
+		tx_buff_type = (xdpf->mem_type == MEM_TYPE_PAGE_POOL ?
+				PRUETH_TX_BUFF_TYPE_XDP_TX :
+				PRUETH_TX_BUFF_TYPE_XDP_NDO);
+
 		q_idx = cpu % emac->tx_ch_num;
 		netif_txq = netdev_get_tx_queue(ndev, q_idx);
 		__netif_tx_lock(netif_txq, cpu);
 		result = emac_xmit_xdp_frame(emac, xdpf, q_idx,
-					     PRUETH_TX_BUFF_TYPE_XDP_TX);
+					     tx_buff_type);
 		__netif_tx_unlock(netif_txq);
 		if (result == ICSSG_XDP_CONSUMED) {
 			ndev->stats.tx_dropped++;
@@ -1395,6 +1409,7 @@ void prueth_tx_cleanup(void *data, dma_addr_t desc_dma)
 		dev_kfree_skb_any(skb);
 		break;
 	case PRUETH_SWDATA_XDPF:
+	case PRUETH_SWDATA_XDPF_TX:
 		xdpf = swdata->data.xdpf;
 		xdp_return_frame(xdpf);
 		break;
diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
index df93d15c5b78..00bb760d68a9 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
@@ -153,6 +153,7 @@ enum prueth_swdata_type {
 	PRUETH_SWDATA_CMD,
 	PRUETH_SWDATA_XDPF,
 	PRUETH_SWDATA_XSK,
+	PRUETH_SWDATA_XDPF_TX,
 };
 
 enum prueth_tx_buff_type {
-- 
2.53.0


^ permalink raw reply related

* [PATCH] octeontx2-af: Fix pci_dev reference leak in cgx_print_dmac_flt
From: Wentao Liang @ 2026-06-25  6:39 UTC (permalink / raw)
  To: Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, Wentao Liang, stable

In cgx_print_dmac_flt(), pci_get_device() is called to look up the AF
PCI device, but its return value is passed directly to pci_get_drvdata()
without saving the pointer. This means pci_dev_put() can never be called
for the obtained device, causing a reference count leak.

Fix it by saving the return value of pci_get_device() in a local variable
and releasing it via pci_dev_put() after the drvdata is extracted.

Cc: stable@vger.kernel.org
Fixes: dbc52debf95f ("octeontx2-af: Debugfs support for DMAC filters")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 .../net/ethernet/marvell/octeontx2/af/rvu_debugfs.c   | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
index fa461489acdd..90dc13df9ff9 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
@@ -2949,7 +2949,7 @@ RVU_DEBUG_SEQ_FOPS(cgx_stat, cgx_stat_display, NULL);
 
 static int cgx_print_dmac_flt(struct seq_file *s, int lmac_id)
 {
-	struct pci_dev *pdev = NULL;
+	struct pci_dev *af_pdev, *pdev = NULL;
 	void *cgxd = s->private;
 	char *bcast, *mcast;
 	u16 index, domain;
@@ -2958,8 +2958,13 @@ static int cgx_print_dmac_flt(struct seq_file *s, int lmac_id)
 	u64 cfg, mac;
 	int pf;
 
-	rvu = pci_get_drvdata(pci_get_device(PCI_VENDOR_ID_CAVIUM,
-					     PCI_DEVID_OCTEONTX2_RVU_AF, NULL));
+	af_pdev = pci_get_device(PCI_VENDOR_ID_CAVIUM,
+				 PCI_DEVID_OCTEONTX2_RVU_AF, NULL);
+	if (!af_pdev)
+		return -ENODEV;
+
+	rvu = pci_get_drvdata(af_pdev);
+	pci_dev_put(af_pdev);
 	if (!rvu)
 		return -ENODEV;
 
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* [PATCH net] net: airoha: fix max receive size configuration
From: Lorenzo Bianconi @ 2026-06-25  6:49 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev, Madhur Agrawal

Set the GDM maximum receive size to AIROHA_MAX_RX_SIZE unconditionally
during hardware initialization instead of updating it according to the
configured MTU. This avoids dropping incoming frames that exceed the
current MTU but could still be processed by the networking stack, which
is able to fragment the reply on the TX side (e.g. ICMP echo requests).
Move the per-port MTU configuration to the PPE egress path where it
belongs, and set the tx frame size running airoha_ppe_set_xmit_frame_size()
to dynamically track the maximum MTU across running interfaces sharing
the same PPE instance.
Fix the PPE MTU register addressing to pack two port entries per
register word and add WAN_MTU0 configuration for non-LAN GDM devices.

Fixes: 54d989d58d2a ("net: airoha: Move min/max packet len configuration in airoha_dev_open()")
Tested-by: Madhur Agrawal <madhur.agrawal@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/airoha/airoha_eth.c  | 68 ++++++++++---------------------
 drivers/net/ethernet/airoha/airoha_eth.h  |  2 +
 drivers/net/ethernet/airoha/airoha_ppe.c  | 39 +++++++++++++-----
 drivers/net/ethernet/airoha/airoha_regs.h |  9 ++--
 4 files changed, 58 insertions(+), 60 deletions(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 932b3a3df2e5..3f451c2d4c24 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -178,10 +178,15 @@ static void airoha_fe_maccr_init(struct airoha_eth *eth)
 {
 	int p;
 
-	for (p = 1; p <= ARRAY_SIZE(eth->ports); p++)
+	for (p = 1; p <= ARRAY_SIZE(eth->ports); p++) {
 		airoha_fe_set(eth, REG_GDM_FWD_CFG(p),
 			      GDM_TCP_CKSUM_MASK | GDM_UDP_CKSUM_MASK |
 			      GDM_IP4_CKSUM_MASK | GDM_DROP_CRC_ERR_MASK);
+		airoha_fe_rmw(eth, REG_GDM_LEN_CFG(p),
+			      GDM_SHORT_LEN_MASK | GDM_LONG_LEN_MASK,
+			      FIELD_PREP(GDM_SHORT_LEN_MASK, 60) |
+			      FIELD_PREP(GDM_LONG_LEN_MASK, AIROHA_MAX_RX_SIZE));
+	}
 
 	airoha_fe_rmw(eth, REG_CDM_VLAN_CTRL(1), CDM_VLAN_MASK,
 		      FIELD_PREP(CDM_VLAN_MASK, 0x8100));
@@ -1831,13 +1836,24 @@ static void airoha_update_hw_stats(struct airoha_gdm_dev *dev)
 	spin_unlock(&port->stats_lock);
 }
 
+static void airoha_dev_set_xmit_frame_size(struct net_device *netdev)
+{
+	struct airoha_gdm_dev *dev = netdev_priv(netdev);
+
+	airoha_ppe_set_xmit_frame_size(dev);
+	if (!airoha_is_lan_gdm_dev(dev))
+		airoha_fe_rmw(dev->eth, REG_WAN_MTU0, WAN_MTU0_MASK,
+			      FIELD_PREP(WAN_MTU0_MASK,
+					 VLAN_ETH_HLEN + netdev->mtu));
+}
+
 static int airoha_dev_open(struct net_device *netdev)
 {
-	int err, len = ETH_HLEN + netdev->mtu + ETH_FCS_LEN;
 	struct airoha_gdm_dev *dev = netdev_priv(netdev);
 	struct airoha_gdm_port *port = dev->port;
-	u32 cur_len, pse_port = FE_PSE_PORT_PPE1;
 	struct airoha_qdma *qdma = dev->qdma;
+	u32 pse_port = FE_PSE_PORT_PPE1;
+	int err;
 
 	netif_tx_start_all_queues(netdev);
 	err = airoha_set_vip_for_gdm_port(dev, true);
@@ -1851,19 +1867,7 @@ static int airoha_dev_open(struct net_device *netdev)
 		airoha_fe_clear(qdma->eth, REG_GDM_INGRESS_CFG(port->id),
 				GDM_STAG_EN_MASK);
 
-	cur_len = airoha_fe_get(qdma->eth, REG_GDM_LEN_CFG(port->id),
-				GDM_LONG_LEN_MASK);
-	if (!port->users || len > cur_len) {
-		/* Opening a sibling net_device with a larger MTU updates the
-		 * MTU of already running devices. This is required to allow
-		 * multiple net_devices with different MTUs to share the same
-		 * GDM port.
-		 */
-		airoha_fe_rmw(qdma->eth, REG_GDM_LEN_CFG(port->id),
-			      GDM_SHORT_LEN_MASK | GDM_LONG_LEN_MASK,
-			      FIELD_PREP(GDM_SHORT_LEN_MASK, 60) |
-			      FIELD_PREP(GDM_LONG_LEN_MASK, len));
-	}
+	airoha_dev_set_xmit_frame_size(netdev);
 	port->users++;
 
 	if (!airoha_is_lan_gdm_dev(dev) &&
@@ -1875,30 +1879,6 @@ static int airoha_dev_open(struct net_device *netdev)
 	return 0;
 }
 
-static void airoha_set_port_mtu(struct airoha_eth *eth,
-				struct airoha_gdm_port *port)
-{
-	u32 len = 0;
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(port->devs); i++) {
-		struct airoha_gdm_dev *dev = port->devs[i];
-		struct net_device *netdev;
-
-		if (!dev)
-			continue;
-
-		netdev = netdev_from_priv(dev);
-		if (netif_running(netdev))
-			len = max_t(u32, len, netdev->mtu);
-	}
-	len += ETH_HLEN + ETH_FCS_LEN;
-
-	airoha_fe_rmw(eth, REG_GDM_LEN_CFG(port->id),
-		      GDM_LONG_LEN_MASK,
-		      FIELD_PREP(GDM_LONG_LEN_MASK, len));
-}
-
 static int airoha_dev_stop(struct net_device *netdev)
 {
 	struct airoha_gdm_dev *dev = netdev_priv(netdev);
@@ -1909,7 +1889,7 @@ static int airoha_dev_stop(struct net_device *netdev)
 	airoha_set_vip_for_gdm_port(dev, false);
 
 	if (--port->users)
-		airoha_set_port_mtu(dev->eth, port);
+		airoha_ppe_set_xmit_frame_size(dev);
 	else
 		airoha_set_gdm_port_fwd_cfg(qdma->eth,
 					    REG_GDM_FWD_CFG(port->id),
@@ -1962,10 +1942,6 @@ static int airoha_enable_gdm2_loopback(struct airoha_gdm_dev *dev)
 		      FIELD_PREP(LPBK_CHAN_MASK, chan) |
 		      LBK_GAP_MODE_MASK | LBK_LEN_MODE_MASK |
 		      LBK_CHAN_MODE_MASK | LPBK_EN_MASK);
-	airoha_fe_rmw(eth, REG_GDM_LEN_CFG(AIROHA_GDM2_IDX),
-		      GDM_SHORT_LEN_MASK | GDM_LONG_LEN_MASK,
-		      FIELD_PREP(GDM_SHORT_LEN_MASK, 60) |
-		      FIELD_PREP(GDM_LONG_LEN_MASK, AIROHA_MAX_MTU));
 	/* Forward the traffic to the proper GDM port */
 	pse_port = port->id == AIROHA_GDM3_IDX ? FE_PSE_PORT_GDM3
 					       : FE_PSE_PORT_GDM4;
@@ -2098,7 +2074,7 @@ static int airoha_dev_change_mtu(struct net_device *netdev, int mtu)
 
 	WRITE_ONCE(netdev->mtu, mtu);
 	if (port->users)
-		airoha_set_port_mtu(dev->eth, port);
+		airoha_dev_set_xmit_frame_size(netdev);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
index d7ff8c5200e2..0c3fb6e5d7f1 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.h
+++ b/drivers/net/ethernet/airoha/airoha_eth.h
@@ -23,6 +23,7 @@
 #define AIROHA_MAX_DSA_PORTS		7
 #define AIROHA_MAX_NUM_RSTS		3
 #define AIROHA_MAX_MTU			9220
+#define AIROHA_MAX_RX_SIZE		16128
 #define AIROHA_MAX_PACKET_SIZE		2048
 #define AIROHA_NUM_QOS_CHANNELS		4
 #define AIROHA_NUM_QOS_QUEUES		8
@@ -676,6 +677,7 @@ int airoha_get_fe_port(struct airoha_gdm_dev *dev);
 bool airoha_is_valid_gdm_dev(struct airoha_eth *eth,
 			     struct airoha_gdm_dev *dev);
 
+void airoha_ppe_set_xmit_frame_size(struct airoha_gdm_dev *dev);
 void airoha_ppe_set_cpu_port(struct airoha_gdm_dev *dev, u8 ppe_id, u8 fport);
 bool airoha_ppe_is_enabled(struct airoha_eth *eth, int index);
 void airoha_ppe_check_skb(struct airoha_ppe_dev *dev, struct sk_buff *skb,
diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c
index 42f4b0f21d17..e7c78293002a 100644
--- a/drivers/net/ethernet/airoha/airoha_ppe.c
+++ b/drivers/net/ethernet/airoha/airoha_ppe.c
@@ -97,6 +97,33 @@ void airoha_ppe_set_cpu_port(struct airoha_gdm_dev *dev, u8 ppe_id, u8 fport)
 		      __field_prep(DFT_CPORT_MASK(fport), fe_cpu_port));
 }
 
+void airoha_ppe_set_xmit_frame_size(struct airoha_gdm_dev *dev)
+{
+	struct airoha_gdm_port *port = dev->port;
+	struct airoha_eth *eth = dev->eth;
+	int i, ppe_id, index;
+	u32 len = 0;
+
+	for (i = 0; i < ARRAY_SIZE(port->devs); i++) {
+		struct airoha_gdm_dev *d = port->devs[i];
+		struct net_device *netdev;
+
+		if (!d)
+			continue;
+
+		netdev = netdev_from_priv(d);
+		if (netif_running(netdev))
+			len = max_t(u32, len, netdev->mtu);
+	}
+	len += VLAN_ETH_HLEN;
+
+	ppe_id = !airoha_is_lan_gdm_dev(dev) && airoha_ppe_is_enabled(eth, 1);
+	index = port->id == AIROHA_GDM4_IDX ? 7 : port->id;
+	airoha_fe_rmw(eth, REG_PPE_MTU(ppe_id, index),
+		      FP_EGRESS_MTU_MASK(index),
+		      __field_prep(FP_EGRESS_MTU_MASK(index), len));
+}
+
 static void airoha_ppe_hw_init(struct airoha_ppe *ppe)
 {
 	u32 sram_ppe_num_data_entries = PPE_SRAM_NUM_ENTRIES, sram_num_entries;
@@ -115,8 +142,6 @@ static void airoha_ppe_hw_init(struct airoha_ppe *ppe)
 		PPE_RAM_NUM_ENTRIES_SHIFT(sram_ppe_num_data_entries);
 
 	for (i = 0; i < eth->soc->num_ppe; i++) {
-		int p;
-
 		airoha_fe_wr(eth, REG_PPE_TB_BASE(i),
 			     ppe->foe_dma + sram_tb_size);
 
@@ -166,15 +191,6 @@ static void airoha_ppe_hw_init(struct airoha_ppe *ppe)
 		airoha_fe_wr(eth, REG_PPE_HASH_SEED(i), PPE_HASH_SEED);
 		airoha_fe_clear(eth, REG_PPE_PPE_FLOW_CFG(i),
 				PPE_FLOW_CFG_IP6_6RD_MASK);
-
-		for (p = 0; p < ARRAY_SIZE(eth->ports); p++)
-			airoha_fe_rmw(eth, REG_PPE_MTU(i, p),
-				      FP0_EGRESS_MTU_MASK |
-				      FP1_EGRESS_MTU_MASK,
-				      FIELD_PREP(FP0_EGRESS_MTU_MASK,
-						 AIROHA_MAX_MTU) |
-				      FIELD_PREP(FP1_EGRESS_MTU_MASK,
-						 AIROHA_MAX_MTU));
 	}
 
 	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
@@ -196,6 +212,7 @@ static void airoha_ppe_hw_init(struct airoha_ppe *ppe)
 				 airoha_ppe_is_enabled(eth, 1);
 			fport = airoha_get_fe_port(dev);
 			airoha_ppe_set_cpu_port(dev, ppe_id, fport);
+			airoha_ppe_set_xmit_frame_size(dev);
 		}
 	}
 }
diff --git a/drivers/net/ethernet/airoha/airoha_regs.h b/drivers/net/ethernet/airoha/airoha_regs.h
index 436f3c8779c1..6fed63d013b4 100644
--- a/drivers/net/ethernet/airoha/airoha_regs.h
+++ b/drivers/net/ethernet/airoha/airoha_regs.h
@@ -327,9 +327,8 @@
 #define PPE_SRAM_TABLE_EN_MASK			BIT(0)
 
 #define REG_PPE_MTU_BASE(_n)			(((_n) ? PPE2_BASE : PPE1_BASE) + 0x304)
-#define REG_PPE_MTU(_m, _n)			(REG_PPE_MTU_BASE(_m) + ((_n) << 2))
-#define FP1_EGRESS_MTU_MASK			GENMASK(29, 16)
-#define FP0_EGRESS_MTU_MASK			GENMASK(13, 0)
+#define REG_PPE_MTU(_m, _n)			(REG_PPE_MTU_BASE(_m) + (((_n) / 2) << 2))
+#define FP_EGRESS_MTU_MASK(_n)			GENMASK(13 + (((_n) % 2) << 4), ((_n) % 2) << 4)
 
 #define REG_PPE_RAM_CTRL(_n)			(((_n) ? PPE2_BASE : PPE1_BASE) + 0x31c)
 #define PPE_SRAM_CTRL_ACK_MASK			BIT(31)
@@ -377,6 +376,10 @@
 #define REG_SRC_PORT_FC_MAP6		0x2298
 #define FC_ID_OF_SRC_PORT_MASK(_n)	GENMASK(4 + ((_n) << 3), ((_n) << 3))
 
+#define REG_WAN_MTU0			0x2300
+#define WAN_MTU1_MASK			GENMASK(29, 16)
+#define WAN_MTU0_MASK			GENMASK(13, 0)
+
 #define REG_CDM5_RX_OQ1_DROP_CNT	0x29d4
 
 /* QDMA */

---
base-commit: fd1269e454089abda0e4f9e5e25ecd02a90ab009
change-id: 20260618-airoha-fix-rx-max-len-57654b661646

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply related

* [PATCH v2 net 0/3] net: udp_tunnel: fix races and use-after-free
From: Eric Dumazet @ 2026-06-25  6:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Yue Sun, Stanislav Fomichev, netdev, eric.dumazet,
	Eric Dumazet

Yue Sun reported a use-after-free and debugobjects warning in
udp_tunnel_nic_device_sync_work() when concurrently creating and
destroying netdevsim and geneve devices.

This series resolves the UAF and the underlying data races that
make the fix vulnerable.

The core issue is a workqueue re-queue race combined with data races
introduced by the lock-splitting in commit 1ead7501094c ("udp_tunnel:
remove rtnl_lock dependency"). That commit allowed the device reset
path (reset_ntf) to run without holding the RTNL lock (using only
utn->lock), while the port addition paths (add_port) still run under
RTNL without acquiring utn->lock.

This series fixes these issues in three steps:

1. Patch 1 (Jakub's fix) addresses the UAF by preventing double-queueing
   of the sync work. If work_pending is already set, we return early
   in device_sync(), blocking a second work item from entering the
   queue while the first is blocked on RTNL.

2. Patch 2 converts the state flags (need_sync, need_replay, work_pending)
   from bitfields to atomic bitops. Because these flags share a single
   byte, concurrent RMW writes from the RTNL-locked path and the RTNL-less
   reset path corrupt the byte. This corruption could clear work_pending,
   defeating the UAF fix.

3. Patch 3 fixes a similar data race on the 'missed' bitmap. Writes
   (__set_bit) happen under RTNL, while reads (should_replay) happen
   under utn->lock without RTNL. We convert this to use atomic set_bit(),
   READ_ONCE() for the fast-path read, and WRITE_ONCE() for clearing.

Reported-by: Yue Sun <samsun1006219@gmail.com>

Eric Dumazet (3):
  net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync
  net: udp_tunnel: convert state flags to atomic bitops
  net: udp_tunnel: use atomic bitops for missed bitmap

 net/ipv4/udp_tunnel_nic.c | 51 +++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 23 deletions(-)

-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply

* [PATCH v2 net 1/3] net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync
From: Eric Dumazet @ 2026-06-25  6:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Yue Sun, Stanislav Fomichev, netdev, eric.dumazet,
	Eric Dumazet
In-Reply-To: <20260625065938.654652-1-edumazet@google.com>

Yue Sun reported a use-after-free and debugobjects warning in
udp_tunnel_nic_device_sync_work() during concurrent device operations.

The workqueue core clears the internal pending bit before invoking the
worker. At that point, a concurrent thread can queue the work again.
When the already running worker eventually clears the work_pending flag
to 0, it mistakenly clears the flag for the newly queued instance.
udp_tunnel_nic_unregister() then observes work_pending as 0 and frees
the structure while the second work item is still active in the queue,
leading to UAF.

Fix this by returning early in udp_tunnel_nic_device_sync() if
work_pending is already set, preventing redundant work queueing.

Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure")
Reported-by: Yue Sun <samsun1006219@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp_tunnel_nic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
index 9944ed923ddfd10f9adf6ad788c0740daeaf2adb..3b32a0afa9798d3c416d9ae570e6d529f70e6697 100644
--- a/net/ipv4/udp_tunnel_nic.c
+++ b/net/ipv4/udp_tunnel_nic.c
@@ -301,7 +301,7 @@ __udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 static void
 udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
-	if (!utn->need_sync)
+	if (!utn->need_sync || utn->work_pending)
 		return;
 
 	queue_work(udp_tunnel_nic_workqueue, &utn->work);
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v2 net 2/3] net: udp_tunnel: convert state flags to atomic bitops
From: Eric Dumazet @ 2026-06-25  6:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Yue Sun, Stanislav Fomichev, netdev, eric.dumazet,
	Eric Dumazet
In-Reply-To: <20260625065938.654652-1-edumazet@google.com>

The state flags of struct udp_tunnel_nic (need_sync, need_replay,
work_pending) are currently bitfields sharing a single byte.

These flags can be modified concurrently from different contexts:
- RTNL-locked paths (like add_port/del_port) write to need_sync and
  work_pending.
- The RTNL-less reset path (reset_ntf, used by netdevsim) writes to
  need_sync and need_replay under utn->lock.

Since they share a byte, concurrent writes are compiled into non-atomic
Read-Modify-Write (RMW) operations that can corrupt each other. For
example, a write to need_replay in reset_ntf can overwrite and clear
work_pending, defeating the double-queueing prevention and causing UAF.

Fix this by converting these state flags to atomic bitops, ensuring
safe concurrent writes across RTNL-locked and RTNL-less paths.

Fixes: 1ead7501094c ("udp_tunnel: remove rtnl_lock dependency")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp_tunnel_nic.c | 43 ++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
index 3b32a0afa9798d3c416d9ae570e6d529f70e6697..840be5d79fc0ac3142049dcb9f1105a5844da9ae 100644
--- a/net/ipv4/udp_tunnel_nic.c
+++ b/net/ipv4/udp_tunnel_nic.c
@@ -30,9 +30,7 @@ struct udp_tunnel_nic_table_entry {
  * @work:	async work for talking to hardware from process context
  * @dev:	netdev pointer
  * @lock:	protects all fields
- * @need_sync:	at least one port start changed
- * @need_replay: space was freed, we need a replay of all ports
- * @work_pending: @work is currently scheduled
+ * @flags:	sync, replay, pending flags
  * @n_tables:	number of tables under @entries
  * @missed:	bitmap of tables which overflown
  * @entries:	table of tables of ports currently offloaded
@@ -44,9 +42,10 @@ struct udp_tunnel_nic {
 
 	struct mutex lock;
 
-	u8 need_sync:1;
-	u8 need_replay:1;
-	u8 work_pending:1;
+	unsigned long flags;
+#define UDP_TUNNEL_NIC_NEED_SYNC	0
+#define UDP_TUNNEL_NIC_NEED_REPLAY	1
+#define UDP_TUNNEL_NIC_WORK_PENDING	2
 
 	unsigned int n_tables;
 	unsigned long missed;
@@ -116,7 +115,7 @@ udp_tunnel_nic_entry_queue(struct udp_tunnel_nic *utn,
 			   unsigned int flag)
 {
 	entry->flags |= flag;
-	utn->need_sync = 1;
+	set_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags);
 }
 
 static void
@@ -283,7 +282,7 @@ udp_tunnel_nic_device_sync_by_table(struct net_device *dev,
 static void
 __udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
-	if (!utn->need_sync)
+	if (!test_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags))
 		return;
 
 	if (dev->udp_tunnel_nic_info->sync_table)
@@ -291,21 +290,27 @@ __udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 	else
 		udp_tunnel_nic_device_sync_by_port(dev, utn);
 
-	utn->need_sync = 0;
+	clear_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags);
 	/* Can't replay directly here, in case we come from the tunnel driver's
 	 * notification - trying to replay may deadlock inside tunnel driver.
 	 */
-	utn->need_replay = udp_tunnel_nic_should_replay(dev, utn);
+	if (udp_tunnel_nic_should_replay(dev, utn))
+		set_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
+	else
+		clear_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
 }
 
 static void
 udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
-	if (!utn->need_sync || utn->work_pending)
+	if (!test_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags))
+		return;
+
+	if (test_bit(UDP_TUNNEL_NIC_WORK_PENDING, &utn->flags))
 		return;
 
 	queue_work(udp_tunnel_nic_workqueue, &utn->work);
-	utn->work_pending = 1;
+	set_bit(UDP_TUNNEL_NIC_WORK_PENDING, &utn->flags);
 }
 
 static bool
@@ -552,7 +557,7 @@ static void __udp_tunnel_nic_reset_ntf(struct net_device *dev)
 
 	mutex_lock(&utn->lock);
 
-	utn->need_sync = false;
+	clear_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags);
 	for (i = 0; i < utn->n_tables; i++)
 		for (j = 0; j < info->tables[i].n_entries; j++) {
 			struct udp_tunnel_nic_table_entry *entry;
@@ -696,8 +701,8 @@ udp_tunnel_nic_flush(struct net_device *dev, struct udp_tunnel_nic *utn)
 	for (i = 0; i < utn->n_tables; i++)
 		memset(utn->entries[i], 0, array_size(info->tables[i].n_entries,
 						      sizeof(**utn->entries)));
-	WARN_ON(utn->need_sync);
-	utn->need_replay = 0;
+	WARN_ON(test_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags));
+	clear_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
 }
 
 static void
@@ -714,7 +719,7 @@ udp_tunnel_nic_replay(struct net_device *dev, struct udp_tunnel_nic *utn)
 		for (j = 0; j < info->tables[i].n_entries; j++)
 			udp_tunnel_nic_entry_freeze_used(&utn->entries[i][j]);
 	utn->missed = 0;
-	utn->need_replay = 0;
+	clear_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
 
 	if (!info->shared) {
 		udp_tunnel_get_rx_info(dev);
@@ -736,10 +741,10 @@ static void udp_tunnel_nic_device_sync_work(struct work_struct *work)
 	rtnl_lock();
 	mutex_lock(&utn->lock);
 
-	utn->work_pending = 0;
+	clear_bit(UDP_TUNNEL_NIC_WORK_PENDING, &utn->flags);
 	__udp_tunnel_nic_device_sync(utn->dev, utn);
 
-	if (utn->need_replay)
+	if (test_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags))
 		udp_tunnel_nic_replay(utn->dev, utn);
 
 	mutex_unlock(&utn->lock);
@@ -904,7 +909,7 @@ udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
 	/* Wait for the work to be done using the state, netdev core will
 	 * retry unregister until we give up our reference on this device.
 	 */
-	if (utn->work_pending)
+	if (test_bit(UDP_TUNNEL_NIC_WORK_PENDING, &utn->flags))
 		return;
 
 	udp_tunnel_nic_free(utn);
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox