Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Breno Leitao @ 2026-06-18 14:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Peter Zijlstra, Petr Mladek, Sebastian Andrzej Siewior,
	John Ogness, Sergey Senozhatsky, Vlad Poenaru, Thomas Gleixner,
	netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260617132127.645534d1@kernel.org>

On Wed, Jun 17, 2026 at 01:21:27PM -0700, Jakub Kicinski wrote:
> On Wed, 17 Jun 2026 07:56:50 -0700 Breno Leitao wrote:
> > As far as I can tell, there isn't a network driver today whose transmit
> > path is completely lockless, so, even if we make netpoll lockless.
> > 
> > It's unlikely any NIC will ever achieve this, given that NIC TX
> > fundamentally relies on a shared DMA ring and doorbell register, which
> > inherently cannot be made lockless.
> 
> The lock which protects the queue is maintained by the stack,
> and we trylock it. Maybe I lost the thread but if you're saying
> that writes to netconsole are impossible from arbitrary context,
> that is _not_ true, AFAIU. We can queue a packet and kick off 
> the transfer on well-behaved drivers.
> 
> Main problem is the opportunistic freeing up of the queue space.
> If we could avoid that in atomic context I think we'd be good.

Thanks for the clarification, this is quite valuable.

Let me verify my understanding: if we switched to __raise_softirq_irqoff()
in dev_kfree_skb_irq_reason(), the issue would be resolved since we'd
avoid waking ksoftirqd and therefore wouldn't touch the runqueue lock in this
code path.

However, while that would eliminate the nested lock problem, it could
increase memory pressure by delaying SKB garbage collection, which may
not be acceptable.

Naive question: What if we deferred SKB cleanup only during netpoll operations?

Such as tracking in_netpoll per cpu:

		struct softnet_data {
			....
	+ 		bool                    in_netpoll;
		}

and then choosing between __raise_softirq_irqoff() and raise_softirq_irqoff()?

	@@ -3456,7 +3456,13 @@ void dev_kfree_skb_irq_reason(struct sk_buff *skb, enum skb_drop_reason reason)
		local_irq_save(flags);
		skb->next = __this_cpu_read(softnet_data.completion_queue);
		__this_cpu_write(softnet_data.completion_queue, skb);
	-       raise_softirq_irqoff(NET_TX_SOFTIRQ);
	+       if (__this_cpu_read(softnet_data.in_netpoll))
	+               __raise_softirq_irqoff(NET_TX_SOFTIRQ);
	+       else
	+               raise_softirq_irqoff(NET_TX_SOFTIRQ);
		local_irq_restore(flags);
	}


Is it too hacky!?

Thanks,
--breno

^ permalink raw reply

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Sebastian Andrzej Siewior @ 2026-06-18 15:04 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, netdev, linux-kernel
In-Reply-To: <06d0158e-bf4c-4ad1-8ad3-c8176003ab11@windriver.com>

On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
> 
> > But if the thread is idle then you have one enable too many, don't you?
> > Well you have the NAPI callback which does disable on the local CPU and
> > this resume which enables it on every CPU. So this does not look right.
> > 
> 
> The enable in resume is intentionally unconditional and idempotent
> (writing MPIC_INT_CLEAR_MASK on an already unmasked IRQ is a no-op).
> 
> > The interesting question is what happens to the enable_percpu_irq() from
> > the mvneta_poll(). Is it lost? And if so, how/ why?
> > 
> 
> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
> gets a chance to execute. The sequence is:
> 
> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)

But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
not threaded. That does not look optimal.

> 3. NAPI poll cannot run → enable_percpu_irq() is never called
> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>     but does NOT execute the completion path (no enable_percpu_irq)

napi_schedule() sets NAPIF_STATE_SCHED.
napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
will be invoked unless it leaves somewhere early.
However, if DISABLED was already set then it disables the IRQ source but
does not schedule NAPI. This is probably what happens.

> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
> 
> The unconditional enable in resume covers this case. When NAPI was
> idle at suspend time, the extra enable is harmless.

There is no desc::depth counting here, that got me confused. But that
per-CPU irq is not optimal. Is this a SoC limitation or a design choice?
Having a NAPI instance with IRQ per queue and those configured and
spread among CPUs should cause less trouble and is what others do.
In fact is the only net driver using per-CPU interrupts.

> BR,
> Yun

Sebastian

^ permalink raw reply

* Re: [PATCH net] octeontx2-af: fix CGX debugfs RVU AF PCI reference leaks
From: Simon Horman @ 2026-06-18 15:07 UTC (permalink / raw)
  To: Ratheesh Kannoth
  Cc: davem, hkelam, lcherian, linux-kernel, netdev, pabeni, sgoutham,
	andrew+netdev, edumazet, kuba, Yuho Choi
In-Reply-To: <20260617104525.1321395-1-rkannoth@marvell.com>

On Wed, Jun 17, 2026 at 04:15:25PM +0530, Ratheesh Kannoth wrote:
> CGX per-lmac debugfs seq readers obtained struct rvu via
> pci_get_drvdata(pci_get_device(..., PCI_DEVID_OCTEONTX2_RVU_AF, ...)),
> which leaks a PCI device reference on every read. Store rvu and the CGX
> handle in debugfs inode private data when creating stats, mac_filter,
> and fwdata files (one context per CGX), and use debugfs aux numbers for
> fwdata so lmac_id matches the other CGX debugfs entries.
> 
> Fixes: f967488d095e ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters")
> Fixes: dbc52debf95f ("octeontx2-af: Debugfs support for DMAC filters")
> Fixes: 49f02e6877d1 ("Octeontx2-af: Debugfs support for firmware data")
> Cc: Linu Cherian <lcherian@marvell.com>
> Reported-by: Yuho Choi <dbgh9129@gmail.com>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>

The nit below not withstanding this looks good to me.

Reviewed-by: Simon Horman <horms@kernel.org>

> ---
>  .../marvell/octeontx2/af/rvu_debugfs.c        | 77 ++++++++++---------
>  1 file changed, 42 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c

...

> @@ -2831,18 +2839,14 @@ static void rvu_dbg_npa_init(struct rvu *rvu)
>  
>  static int cgx_print_stats(struct seq_file *s, int lmac_id)
>  {
> +	struct rvu_cgx_lmac_dbgfs_ctx *dctx = s->private;
>  	struct cgx_link_user_info linfo;
>  	struct mac_ops *mac_ops;
> -	void *cgxd = s->private;
> +	void *cgxd = dctx->cgxd;
> +	struct rvu *rvu = dctx->rvu;

nit: It would be nice to preserve reverse xmas tree order - longest line to
     shortest - for local variable declarations. Likewise elsewhere in this
     patch.

>  	u64 ucast, mcast, bcast;
>  	int stat = 0, err = 0;
>  	u64 tx_stat, rx_stat;

...

^ permalink raw reply

* [PATCH iwl-net 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Petr Oros @ 2026-06-18 15:09 UTC (permalink / raw)
  To: netdev
  Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alice Michael, Jacob Keller, Ivan Vecera, Michal Swiatkowski,
	Grzegorz Nitka, intel-wired-lan, linux-kernel
In-Reply-To: <cover.1781786935.git.poros@redhat.com>

When an ice port is part of a vlan-filtering bridge with a wide VLAN
trunk and the netdev is in IFF_PROMISC (typical for bond slaves
attached to a bridge), the driver installs per-VLAN
ICE_SW_LKUP_PROMISC_VLAN entries (recipe 9) in addition to the broad
ICE_SW_LKUP_DFLT VSI Rx rule (recipe 5). Each per-VLAN rule consumes
one Flow Lookup Unit (FLU) entry from a fixed hardware pool of "up to
32K FLU entries" per device, documented in the E810 datasheet
(613875-009 section 7.8.10, Table 7-18, page 1015).

With three active PFs sharing one switch context and a bridge trunk of
vid 2-4094, the configuration would require roughly

  3 PFs * 4093 VLANs * 3 rules per VLAN per PF ~= 36,800 rules

which exceeds the 32K FLU budget. Firmware then responds to further
Add Switch Rules requests with AQ retval 0x10 (LIBIE_AQ_RC_ENOSPC) and
the user-visible failure surfaces as

  ice 0000:5c:00.1: Failed to set VSI 14 as the default forwarding
                    VSI, error -5
  ice 0000:5c:00.1 ens1f1: Error -5 setting default VSI 14 Rx rule

After a switch context has been driven into overrun, subsequent
retries can come back as AQ retval 0x2 (LIBIE_AQ_RC_ENOENT), which has
misled triage attempts toward a perceived recipe binding defect
rather than a capacity issue.

When the DFLT VSI Rx rule is in place it catches every packet on the
lport regardless of VLAN tag, so the per-VLAN PROMISC_VLAN expansion
is redundant. The recipe 4 VLAN prune entries are still installed
per VLAN and continue to track the allowed VID set, but the
IFF_PROMISC sync path disables their enforcement on the VSI via
vlan_ops->dis_rx_filtering() before ice_set_promisc() runs.
ena_rx_filtering() is restored when IFF_PROMISC is cleared.

Skip the per-VLAN expansion at the two call sites that drive it:
ice_set_promisc() falls through to ice_fltr_set_vsi_promisc() and
ice_vlan_rx_add_vid() omits the per-VLAN ICE_MCAST_VLAN_PROMISC_BITS
add. Plain IFF_ALLMULTI without an installed DFLT VSI rule is
unchanged and still installs per-VLAN multicast promisc rules.

Both checks use ice_is_vsi_dflt_vsi() which inspects the recipe
filter list for an installed DFLT rule on this VSI, not
netdev->flags & IFF_PROMISC. The HW-state predicate avoids two
regression vectors that a user-intent predicate would introduce:

1. ice_lag_is_switchdev_running() short-circuits ice_set_dflt_vsi()
   to return 0 without installing the DFLT rule for a PF in
   switchdev LAG mode. An IFF_PROMISC-only check would also
   suppress the per-VLAN fallback, leaving the PF with no rule.

2. When ice_set_dflt_vsi() returns a non-EEXIST error (FLU
   exhausted, switch context divergence), the driver clears
   IFF_PROMISC from vsi->current_netdev_flags but the netdev's own
   flags retain IFF_PROMISC. The user-intent predicate would still
   suppress the per-VLAN fallback even though DFLT failed to
   install.

The predicate is install-time only. The IFF_PROMISC off path closes
the lifecycle gap in ice_vsi_exit_dflt_promisc(): for an IFF_ALLMULTI
VSI with VLANs it reinstates the per-VID rules before clearing the
default rule, so multicast coverage never lapses. If that AQ call
fails the default rule is left in place, ice_vsi_exit_dflt_promisc()
returns the error, and the sync_fltr pass bails with
vsi->current_netdev_flags |= IFF_PROMISC; the current/netdev flag
mismatch re-fires the IFF_PROMISC off path on the next sync. Clearing
the default rule first would instead expose a window where neither
the default rule nor the per-VID rules carry multicast.

If ice_clear_dflt_vsi() fails after the per-VID rules were reinstated
they are deliberately not rolled back. Clearing the default rule is a
removal that frees an FLU entry rather than allocating one, so it
cannot fail for lack of space; a failure is a transient AdminQ error.
The per-VID rules are the steady state for an IFF_ALLMULTI VLAN VSI,
so the only redundant entry left behind is the single un-removed
default rule, not the per-VID set. The retry re-enters this path,
ice_fltr_set_vlan_vsi_promisc() returns -EEXIST for the rules that
already exist so nothing is reallocated, and the default rule is
removed on the next attempt. Rolling the per-VID rules back here would
instead churn thousands of removes and re-adds on every retry.

After the default rule is gone the vid=0 PROMISC rule that paired
with it is redundant and is dropped, but only to reclaim a filter
entry, so a failure there is logged and does not abort the
transition.

ice_set_vsi_promisc() and ice_clear_vsi_promisc() dispatch the
recipe based on whether ICE_PROMISC_VLAN_RX/TX bits are present in
the mask: with the bits set, recipe ICE_SW_LKUP_PROMISC_VLAN is
used; otherwise ICE_SW_LKUP_PROMISC. The else branch in
ice_set_promisc() installs the vid=0 rule in ICE_SW_LKUP_PROMISC.
Because ice_clear_promisc() with VLANs present adds the VLAN bits
and would search ICE_SW_LKUP_PROMISC_VLAN, the recipe mismatch
would leave the vid=0 ICE_SW_LKUP_PROMISC rule orphaned when VLANs
are configured. This is a single stale rule, not a per-cycle leak:
re-adding it on the next promisc on returns -EEXIST rather than
allocating a new entry. The set-time recipe is not recorded, so
ice_clear_promisc() clears both recipes; clearing a rule that is not
present succeeds, both clears run unconditionally, and the first
error is returned.

The two VLAN-0 recipe transition blocks in ice_vlan_rx_add_vid()
and ice_vlan_rx_kill_vid() that promote / demote the vid=0 rule
between ICE_SW_LKUP_PROMISC and ICE_SW_LKUP_PROMISC_VLAN are
likewise guarded by !ice_is_vsi_dflt_vsi(). With DFLT in place the
vid=0 rule already covers every VID and a recipe swap would only
install a redundant rule.

Lab reproduction on an E810-C with the same firmware family (4.80,
NVM 1.3805.0, DDP 1.3.43.0) using four PFs in vlan-filtering bridges
with vid 2-4094 and the slaves brought to IFF_PROMISC before the
bridge VLAN bulk add:

  before fix:  ~12,279 AQ Add Switch Rules per PF, ENOSPC and ENOENT
               responses in dmesg, DFLT VSI Rx rule install fails on
               the affected PF
  after fix:   ~4,093 AQ Add Switch Rules per PF, no AQ errors, DFLT
               VSI Rx rule installs on every PF

The 66.7% reduction in installed switch rules per PF matches the
expected per-VLAN saving: a single DFLT rule replaces the per-VID
PROMISC_VLAN expansion.

Functional regression test with vid 2-100 trunk between two ice
ports through the lab switch (40/40 PASS, 0 AQ errors, 0 ENOSPC
at 4093-VID customer scale):

  vid 50 unicast, vid 100 unicast, vid 50 broadcast ARP,
    vid 100 multicast IPv6 ND
  vid 200/500/1500/4000 isolation (out-of-trunk) and untagged not
    leaked: 0 packets reach any bridge endpoint
  IGMP/MLD snooping, Jumbo MTU 9000, reserved-multicast STP BPDU
  IFF_PROMISC + IFF_ALLMULTI transition (off while allmulti stays)
  Regression reproducer for commit 1273f89578f2 ("ice: Fix broken
    IFF_ALLMULTI handling"): allmulti on -> add vid -> allmulti off
    -> allmulti on plus the orphan-rule Scenario 2; both converge
    with no stale rules
  100-VID, 1000-VID, 4093-VID stress cycles (5/3/2 iterations each)
  switchdev mode toggle preserves IFF_PROMISC pruning state across
    the session (vid 999 multicast received before and after the
    legacy -> switchdev -> legacy cycle)
  SR-IOV: VFs unaffected because ice_set_promisc() early-returns
    for non-PF VSI and VF representors do not register
    ndo_vlan_rx_add_vid

Fixes: 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling")
Signed-off-by: Petr Oros <poros@redhat.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c | 90 ++++++++++++++++++-----
 1 file changed, 70 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 6d24056c247cf4..af8df81fc45623 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -274,7 +274,8 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
 	if (vsi->type != ICE_VSI_PF)
 		return 0;

-	if (ice_vsi_has_non_zero_vlans(vsi)) {
+	/* skip per-VID expansion; the DFLT Rx rule already covers every VID */
+	if (ice_vsi_has_non_zero_vlans(vsi) && !ice_is_vsi_dflt_vsi(vsi)) {
 		promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
 		status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
 						       promisc_m);
@@ -304,9 +305,19 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
 		return 0;

 	if (ice_vsi_has_non_zero_vlans(vsi)) {
-		promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+		int vid0_status;
+
+		/* set time used either recipe (per-VID PROMISC_VLAN, or vid=0
+		 * PROMISC via the ice_set_promisc() else branch), so clear
+		 * both; clearing an absent rule succeeds
+		 */
 		status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
-							 promisc_m);
+				promisc_m | ICE_PROMISC_VLAN_RX |
+				ICE_PROMISC_VLAN_TX);
+		vid0_status = ice_fltr_clear_vsi_promisc(&vsi->back->hw,
+							 vsi->idx, promisc_m, 0);
+		if (!status)
+			status = vid0_status;
 	} else {
 		status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
 						    promisc_m, 0);
@@ -317,6 +328,49 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
 	return status;
 }

+/**
+ * ice_vsi_exit_dflt_promisc - drop the default VSI Rx rule on promisc off
+ * @vsi: the VSI leaving promiscuous mode
+ *
+ * For an IFF_ALLMULTI VSI with VLANs the per-VID multicast rules are
+ * reinstated before the default rule is cleared so coverage never lapses;
+ * the then redundant vid=0 rule is dropped best-effort. The callees log
+ * their own failures, so error returns are not re-logged here.
+ *
+ * Return: 0 on success, negative on error with the default rule left in place.
+ */
+static int ice_vsi_exit_dflt_promisc(struct ice_vsi *vsi)
+{
+	struct ice_vsi_vlan_ops *vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
+	struct net_device *netdev = vsi->netdev;
+	struct ice_hw *hw = &vsi->back->hw;
+	bool restore_mc;
+	int err;
+
+	restore_mc = (vsi->current_netdev_flags & IFF_ALLMULTI) &&
+		     ice_vsi_has_non_zero_vlans(vsi);
+
+	if (restore_mc) {
+		err = ice_fltr_set_vlan_vsi_promisc(hw, vsi,
+						    ICE_MCAST_VLAN_PROMISC_BITS);
+		if (err && err != -EEXIST)
+			return err;
+	}
+
+	err = ice_clear_dflt_vsi(vsi);
+	if (err)
+		return err;
+
+	if (netdev->features & NETIF_F_HW_VLAN_CTAG_FILTER)
+		vlan_ops->ena_rx_filtering(vsi);
+
+	if (restore_mc)
+		ice_fltr_clear_vsi_promisc(hw, vsi->idx, ICE_MCAST_PROMISC_BITS,
+					   0);
+
+	return 0;
+}
+
 /**
  * ice_vsi_sync_fltr - Update the VSI filter list to the HW
  * @vsi: ptr to the VSI
@@ -442,17 +496,12 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
 		} else {
 			/* Clear Rx filter to remove traffic from wire */
 			if (ice_is_vsi_dflt_vsi(vsi)) {
-				err = ice_clear_dflt_vsi(vsi);
+				err = ice_vsi_exit_dflt_promisc(vsi);
 				if (err) {
-					netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
-						   err, vsi->vsi_num);
 					vsi->current_netdev_flags |=
 						IFF_PROMISC;
 					goto out_promisc;
 				}
-				if (vsi->netdev->features &
-				    NETIF_F_HW_VLAN_CTAG_FILTER)
-					vlan_ops->ena_rx_filtering(vsi);
 			}

 			/* disable allmulti here, but only if allmulti is not
@@ -3676,10 +3725,9 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
 	while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
 		usleep_range(1000, 2000);

-	/* Add multicast promisc rule for the VLAN ID to be added if
-	 * all-multicast is currently enabled.
-	 */
-	if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+	/* skip the per-VID rule when the DFLT Rx rule already covers this VID */
+	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+	    !ice_is_vsi_dflt_vsi(vsi)) {
 		ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
 					       ICE_MCAST_VLAN_PROMISC_BITS,
 					       vid);
@@ -3697,11 +3745,12 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
 	if (ret)
 		goto finish;

-	/* If all-multicast is currently enabled and this VLAN ID is only one
-	 * besides VLAN-0 we have to update look-up type of multicast promisc
-	 * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+	/* On the first non-zero VLAN, promote the VLAN-0 multicast promisc
+	 * rule from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN. Skip when
+	 * the DFLT Rx rule is installed; it already covers every VID.
 	 */
 	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+	    !ice_is_vsi_dflt_vsi(vsi) &&
 	    ice_vsi_num_non_zero_vlans(vsi) == 1) {
 		ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
 					   ICE_MCAST_PROMISC_BITS, 0);
@@ -3764,11 +3813,12 @@ int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
 					   ICE_MCAST_VLAN_PROMISC_BITS, vid);

 	if (!ice_vsi_has_non_zero_vlans(vsi)) {
-		/* Update look-up type of multicast promisc rule for VLAN 0
-		 * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
-		 * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+		/* Last non-zero VLAN gone: demote the VLAN-0 multicast promisc
+		 * rule back to ICE_SW_LKUP_PROMISC. Skip when the DFLT Rx rule
+		 * is installed; no recipe swap is needed.
 		 */
-		if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+		if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+		    !ice_is_vsi_dflt_vsi(vsi)) {
 			ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
 						   ICE_MCAST_VLAN_PROMISC_BITS,
 						   0);
-- 
2.53.0

^ permalink raw reply related

* [PATCH iwl-net 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Petr Oros @ 2026-06-18 15:09 UTC (permalink / raw)
  To: netdev
  Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alice Michael, Jacob Keller, Ivan Vecera, Michal Swiatkowski,
	Grzegorz Nitka, intel-wired-lan, linux-kernel
In-Reply-To: <cover.1781786935.git.poros@redhat.com>

ice_eswitch_setup_env() calls ice_set_dflt_vsi() to install the
ICE_SW_LKUP_DFLT Rx rule on the uplink VSI. The helper returns 0 even
when the rule is already in place, so the call is a no-op if
ice_vsi_sync_fltr() had previously installed the DFLT rule in response
to IFF_PROMISC on the uplink netdev. ice_remove_vsi_fltr() called
earlier in ice_eswitch_setup_env() does not affect this rule because
ice_remove_vsi_lkup_fltr() lacks a case for ICE_SW_LKUP_DFLT and falls
into its default branch which only logs. Switchdev mode then adds an
ICE_FLTR_TX leg via ice_cfg_dflt_vsi() on the same VSI handle.

ice_eswitch_release_env() unconditionally removed both the Rx and Tx
DFLT rules. When the Rx DFLT was installed by ice_vsi_sync_fltr()
before the switchdev session started, this clobbered promisc state the
operator had asked for: the DFLT Rx rule disappeared while IFF_PROMISC
was still set on the netdev, and the IFF_PROMISC sync path was not
retriggered, so the uplink ended the session without the catch-all
rule the netdev flags requested.

Skip the Rx DFLT removal when the uplink is still promiscuous, both in
ice_eswitch_release_env() and in the err_def_tx unwind of
ice_eswitch_setup_env(). The Tx leg installed by switchdev is always
removed since switchdev owns it.

The ena_rx_filtering() call earlier in ice_eswitch_release_env() is
left unconditional: it calls ice_cfg_vlan_pruning(), which returns
without enabling pruning while the netdev is in IFF_PROMISC, so it
cannot re-enable VLAN pruning under the preserved DFLT rule and drop
tagged traffic. Pruning is re-enabled later, when the IFF_PROMISC sync
path runs after promisc is actually cleared.

Use vsi->current_netdev_flags rather than the live netdev->flags for
this test. netdev->flags is written under RTNL by dev_change_flags(),
while ice_eswitch_release_env() runs under devl_lock, so reading it
here would be a TOCTOU against a concurrent promisc change. The
IFF_PROMISC bit of current_netdev_flags is written only under
ICE_CFG_BUSY by ice_vsi_sync_fltr(), and ice_set_rx_mode() gates that
sync off for the uplink while ice_is_switchdev_running() is true. The
bit is therefore frozen for the whole session and stable when
release_env reads it.

Because the sync is gated off during the session, a promisc change the
operator makes while switchdev runs never reaches ice_vsi_sync_fltr():
current_netdev_flags keeps the value captured before the session while
netdev->flags carries the new one. Once switchdev is torn down and
pf->eswitch.is_running is cleared, schedule a filter sync from
ice_eswitch_disable_switchdev() so the suppressed change is replayed
and the DFLT Rx rule is reconciled with the current netdev flags. This
also closes the window where release_env kept the rule based on the
frozen flag but the operator had since cleared IFF_PROMISC.

Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
Signed-off-by: Petr Oros <poros@redhat.com>
---
 drivers/net/ethernet/intel/ice/ice_eswitch.c | 32 +++++++++++++++++---
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f77..b6073fc2375019 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -66,8 +66,10 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
 			 ICE_FLTR_TX);
 err_def_tx:
-	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
-			 ICE_FLTR_RX);
+	/* keep the Rx DFLT rule if still promiscuous (see release_env) */
+	if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
+		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
+				 false, ICE_FLTR_RX);
 err_def_rx:
 	ice_vsi_del_vlan_zero(uplink_vsi);
 err_vlan_zero:
@@ -275,11 +277,23 @@ static void ice_eswitch_release_env(struct ice_pf *pf)
 	vlan_ops = ice_get_compat_vsi_vlan_ops(uplink_vsi);

 	ice_vsi_update_local_lb(uplink_vsi, false);
+	/* No-op while IFF_PROMISC is set: ice_cfg_vlan_pruning() self-gates on
+	 * it, so this cannot re-enable VLAN pruning under a preserved DFLT rule.
+	 */
 	vlan_ops->ena_rx_filtering(uplink_vsi);
 	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
 			 ICE_FLTR_TX);
-	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
-			 ICE_FLTR_RX);
+
+	/* Keep the Rx DFLT rule if the uplink is still promiscuous; it must
+	 * outlive the session. current_netdev_flags is used because its
+	 * IFF_PROMISC bit only changes under ice_vsi_sync_fltr(), gated off
+	 * during switchdev, so the read cannot race the RTNL netdev->flags.
+	 * Any change made during the session is replayed on teardown.
+	 */
+	if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
+		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
+				 false, ICE_FLTR_RX);
+
 	ice_fltr_add_mac_and_broadcast(uplink_vsi,
 				       uplink_vsi->port_info->mac.perm_addr,
 				       ICE_FWD_TO_VSI);
@@ -327,10 +341,20 @@ static int ice_eswitch_enable_switchdev(struct ice_pf *pf)
  */
 static void ice_eswitch_disable_switchdev(struct ice_pf *pf)
 {
+	struct ice_vsi *uplink_vsi = pf->eswitch.uplink_vsi;
+
 	ice_eswitch_br_offloads_deinit(pf);
 	ice_eswitch_release_env(pf);

 	pf->eswitch.is_running = false;
+
+	/* ice_set_rx_mode() was gated off during the session; replay a filter
+	 * sync so any suppressed promisc change reconciles the DFLT Rx rule.
+	 */
+	set_bit(ICE_VSI_UMAC_FLTR_CHANGED, uplink_vsi->state);
+	set_bit(ICE_VSI_MMAC_FLTR_CHANGED, uplink_vsi->state);
+	set_bit(ICE_FLAG_FLTR_SYNC, pf->flags);
+	ice_service_task_schedule(pf);
 }

 /**
-- 
2.53.0

^ permalink raw reply related

* [PATCH iwl-net 0/2] ice: fix DFLT Rx rule handling for promisc and switchdev
From: Petr Oros @ 2026-06-18 15:09 UTC (permalink / raw)
  To: netdev
  Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alice Michael, Jacob Keller, Ivan Vecera, Michal Swiatkowski,
	Grzegorz Nitka, intel-wired-lan, linux-kernel

Two fixes for the uplink default VSI Rx rule (DFLT) on E810 when the
netdev is in IFF_PROMISC.

Patch 1 drops the redundant per-VLAN promisc expansion that exhausts
the FLU pool on a wide VLAN trunk across several PFs.

Patch 2 keeps the DFLT Rx rule across a switchdev teardown instead of
clobbering the promisc state the operator asked for.

Lab tested on E810-C: functional, VLAN isolation, IFF_ALLMULTI
regression, stress/flap and switchdev-toggle suites pass with no AQ
errors, and the FLU pool stays under its ceiling with all four PFs
loaded.

Petr Oros (2):
  ice: skip per-VLAN promisc rules when default VSI Rx rule is set
  ice: preserve uplink DFLT Rx rule on switchdev release

 drivers/net/ethernet/intel/ice/ice_eswitch.c | 32 ++++++-
 drivers/net/ethernet/intel/ice/ice_main.c    | 90 +++++++++++++++-----
 2 files changed, 98 insertions(+), 24 deletions(-)

-- 
2.53.0

^ permalink raw reply

* lan7801 looses VLAN Filter Table
From: Sven Schuchmann @ 2026-06-18 15:18 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,
I have a problem with a lan7801 chip in Kernel 6.18. I configure VLAN-ID (2) and an IP address.
But if I disconnect and connect the network-cable several times at some point no packets are 
received anymore. Without using VLAN this does not happen.

I tracked this down that sometimes the VLAN Filter table seems
to get cleared. I hooked into the lan78xx.c driver to dump the vlan table:

static void lan78xx_get_stats(struct net_device *netdev,
			      struct ethtool_stats *stats, u64 *data)
{
	struct lan78xx_net *dev = netdev_priv(netdev);
	struct lan78xx_priv *pdata = (struct lan78xx_priv *)(dev->data[0]);

	lan78xx_update_stats(dev);

	for (int i = 0; i < 3; i++) {
		u32 buf;
		lan78xx_dataport_read(dev, DP_SEL_RSEL_VLAN_DA_, i, 1, &buf);
		if (pdata->vlan_table[i] != buf) 
			netdev_err(dev->net, "VLAN TABLE %d: 0x%08x 0x%08x", i, pdata->vlan_table[i], buf);
		else
			netdev_info(dev->net, "VLAN TABLE %d: 0x%08x 0x%08x", i, pdata->vlan_table[i], buf);
	}

So I can "read out" the table if I do "ethtool -S" and see it in the kernel log.
Normally the output looks like this:
VLAN TABLE 0: 0x00000005 0x00000005
So the table looks as expected. The Local Filter table from pdata is the same as in the chip itself.

But after some cable disconnects and connects I see this:
VLAN TABLE 0: 0x00000005 0x00000000
So the table got cleared or deleted and no paketes on VLAN-ID 2 go through.
I even can do this after I read out the table in lan78xx_get_stats():

	lan78xx_dataport_write(dev, DP_SEL_RSEL_VLAN_DA_, 0,
				DP_SEL_VHF_VLAN_LEN, pdata->vlan_table);

...and with this I can "fix" the table again from the ethtool and it starts working again.

Has someone seen something like this or can point me to a direction where
I could reinit this table (I already tried at the end of lan78xx_mac_link_up() without success...)

Thanks!

Regards, Sven

^ permalink raw reply

* Re: [PATCH iwl v3] ice: retry reading NVM if admin queue returns EBUSY
From: Simon Horman @ 2026-06-18 15:20 UTC (permalink / raw)
  To: robert.malz
  Cc: Simon Horman, anthony.l.nguyen, przemyslaw.kitszel,
	intel-wired-lan, netdev
In-Reply-To: <20260617120753.1785565-1-robert.malz@canonical.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: retry reading NVM if admin queue returns EBUSY

When the admin queue command to read NVM returns EBUSY, the driver
currently treats it as a fatal error and aborts the entire read operation.
This patch adds retry logic to handle this condition.

> diff --git a/drivers/net/ethernet/intel/ice/ice_nvm.c b/drivers/net/ethernet/intel/ice/ice_nvm.c
> --- a/drivers/net/ethernet/intel/ice/ice_nvm.c
> +++ b/drivers/net/ethernet/intel/ice/ice_nvm.c

[ ... ]

> @@ -96,11 +97,25 @@ ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
>  					 offset, read_size,
>  					 data + bytes_read, last_cmd,
>  					 read_shadow_ram, NULL);
> -		if (status)
> -			break;
> -
> -		bytes_read += read_size;
> -		offset += read_size;
> +		if (status) {
> +			if (hw->adminq.sq_last_status != LIBIE_AQ_RC_EBUSY ||
> +			    retry_cnt > ICE_SQ_SEND_MAX_EXECUTE)

[Severity: Low]
Does this check allow an extra retry execution? Since retry_cnt starts at 0, 
using a strictly greater-than check allows the loop to retry 4 times
(5 total executions) instead of the 3 executions intended by
ICE_SQ_SEND_MAX_EXECUTE.

> +				break;
> +			ice_debug(hw, ICE_DBG_NVM,
> +				  "NVM read EBUSY error, retry %d\n",
> +				  retry_cnt + 1);
> +			last_cmd = false;
> +			ice_release_nvm(hw);
> +			msleep(ICE_SQ_SEND_DELAY_TIME_MS);

[Severity: High]
Does dropping the NVM lock here violate the caller's assumption of mutual
exclusion?

Callers of ice_read_flat_nvm() typically acquire the NVM lock beforehand to
ensure an atomic read of multiple sectors without interleaved writes. By
explicitly dropping the lock during the msleep() window, another thread, VF,
or PF could acquire the NVM lock and modify the NVM before the read resumes,
potentially resulting in reading torn or inconsistent NVM data.

> +			status = ice_acquire_nvm(hw, ICE_RES_READ);
> +			if (status)
> +				break;

[Severity: High]
Can this lead to an unbalanced NVM resource lock release and potential data
corruption?

If ice_acquire_nvm() fails here, the loop breaks and returns an error with the
lock currently unheld. However, callers like ice_read_flash_module() are
unaware of this internal lock state change and will unconditionally call
ice_release_nvm() upon failure:

drivers/net/ethernet/intel/ice/ice_nvm.c:ice_read_flash_module() {
	status = ice_acquire_nvm(hw, ICE_RES_READ);
	if (status)
		return status;

	status = ice_read_flat_nvm(hw, start + offset, &length, data, false);

	ice_release_nvm(hw);
    ...

Because firmware tracks resource locks per-PF rather than per-OS-thread, this
errant second release could forcefully strip the lock from another thread on
the same PF that successfully acquired it, exposing the NVM to concurrent
modification.

> +			retry_cnt++;
> +		} else {
> +			bytes_read += read_size;
> +			offset += read_size;
> +			retry_cnt = 0;
> +		}
>  	} while (!last_cmd);
>  
>  	*length = bytes_read;

^ permalink raw reply

* Re: [PATCH net 1/5] net: hns3: unify copper port ksettings configuration path
From: Simon Horman @ 2026-06-18 15:33 UTC (permalink / raw)
  To: shaojijie
  Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260617112721.75186-2-shaojijie@huawei.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: hns3: unify copper port ksettings configuration path

Refactor hns3_set_link_ksettings() and hclge_set_phy_link_ksettings()
to unify the configuration path for copper ports.

> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
[ ... ]
> @@ -863,6 +859,8 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
>  	struct hnae3_handle *handle = hns3_get_handle(netdev);
>  	struct hnae3_ae_dev *ae_dev = hns3_get_ae_dev(handle);
>  	const struct hnae3_ae_ops *ops = hns3_get_ops(handle);
> +	u8 module_type = HNAE3_MODULE_TYPE_UNKNOWN;
> +	u8 media_type = HNAE3_MEDIA_TYPE_UNKNOWN;
>  	int ret;
>  
>  	/* Chip don't support this mode. */
> @@ -878,22 +876,20 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
>  		  cmd->base.autoneg, cmd->base.speed, cmd->base.duplex,
>  		  cmd->lanes);
>  
> -	/* Only support ksettings_set for netdev with phy attached for now */
> -	if (netdev->phydev) {
> -		if (cmd->base.speed == SPEED_1000 &&
> -		    cmd->base.autoneg == AUTONEG_DISABLE)
> -			return -EINVAL;

[Severity: Medium]
Does removing this validation allow invalid configurations for native copper
PHYs?

The previous code explicitly rejected SPEED_1000 with AUTONEG_DISABLE because
IEEE 802.3 1000BASE-T strictly requires autonegotiation. By delegating
directly to phy_ethtool_ksettings_set(), which permits this for 1000BASE-X,
the driver might now silently accept an invalid configuration.

> -		return phy_ethtool_ksettings_set(netdev->phydev, cmd);
> -	} else if (test_bit(HNAE3_DEV_SUPPORT_PHY_IMP_B, ae_dev->caps) &&
> -		   ops->set_phy_link_ksettings) {
> +	if (!ops->get_media_type)
> +		return -EOPNOTSUPP;
> +	ops->get_media_type(handle, &media_type, &module_type);
>  
> -		return ops->set_phy_link_ksettings(handle, cmd);
> -	}
> +	if (media_type == HNAE3_MEDIA_TYPE_COPPER) {
> +		if (!ops->set_phy_link_ksettings)
> +			return -EOPNOTSUPP;
> +		return ops->set_phy_link_ksettings(handle, cmd);
> +	}
[ ... ]
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
[ ... ]
> @@ -3400,10 +3400,28 @@ hclge_set_phy_link_ksettings(struct hnae3_handle *handle,
>  		return ret;
>  	}
>  
> +	linkmode_copy(hdev->hw.mac.advertising, cmd->link_modes.advertising);
> +	return 0;
> +}
> +
> +static int
> +hclge_set_phy_link_ksettings(struct hnae3_handle *handle,
> +			     const struct ethtool_link_ksettings *cmd)
> +{
> +	struct hclge_vport *vport = hclge_get_vport(handle);
> +	struct hclge_dev *hdev = vport->back;
> +	int ret;
> +
> +	if (hnae3_dev_phy_imp_supported(hdev))
> +		ret = hclge_ethtool_ksettings_set(handle, cmd);
> +	else
> +		ret = phy_ethtool_ksettings_set(handle->netdev->phydev, cmd);

[Severity: Critical]
Can handle->netdev->phydev be NULL here, leading to a kernel panic when called
from userspace?

A copper port can legitimately have a NULL phydev if the port's PHY address
is configured as PHY_INEXISTENT in the firmware. 

Because the updated hns3_set_link_ksettings() now routes all 
HNAE3_MEDIA_TYPE_COPPER configurations here without the previous 
if (netdev->phydev) guard, phy_ethtool_ksettings_set() will unconditionally
dereference phydev->mdio.addr.

> +	if (ret)
> +		return ret;
> +
>  	hdev->hw.mac.req_autoneg = cmd->base.autoneg;
>  	hdev->hw.mac.req_speed = cmd->base.speed;
>  	hdev->hw.mac.req_duplex = cmd->base.duplex;

^ permalink raw reply

* Re: [PATCH net 2/5] net: hns3: refactor MAC autoneg and speed configuration
From: Simon Horman @ 2026-06-18 15:34 UTC (permalink / raw)
  To: shaojijie
  Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260617112721.75186-3-shaojijie@huawei.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: hns3: refactor MAC autoneg and speed configuration

This refactor extracts the MAC autoneg and speed/duplex/lane configuration
out of hclge_mac_init() into a new helper hclge_set_autoneg_speed_dup(),
and invokes it from both hclge_init_ae_dev() and hclge_reset_ae_dev().

> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> index 5a00797d9252..2c74675b149f 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -2957,20 +2957,6 @@ static int hclge_mac_init(struct hclge_dev *hdev)
>  	if (!test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state))
>  		hdev->hw.mac.duplex = HCLGE_MAC_FULL;
>  
> -	if (hdev->hw.mac.support_autoneg) {
> -		ret = hclge_set_autoneg_en(hdev, hdev->hw.mac.autoneg);
> -		if (ret)
> -			return ret;
> -	}
> -
> -	if (!hdev->hw.mac.autoneg) {
> -		ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
> -						 hdev->hw.mac.req_duplex,
> -						 hdev->hw.mac.lane_num);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	mac->link = 0;

[ ... ]

> @@ -11901,6 +11908,13 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
>  	if (ret)
>  		goto err_ptp_uninit;
>  
> +	ret = hclge_set_autoneg_speed_dup(hdev);
> +	if (ret) {
> +		dev_err(&pdev->dev,
> +			"failed to set autoneg speed duplex, ret = %d\n", ret);
> +		goto err_ptp_uninit;
> +	}
> +

[High]

Has the relocation of the autoneg/speed configuration in hclge_init_ae_dev()
been considered against the surrounding init order?

Previously the equivalent calls ran inside hclge_mac_init() before
hclge_update_port_info(). After this patch the helper runs after
hclge_update_port_info(), and hclge_update_port_info() ->
hclge_get_sfp_info() unconditionally overwrites hdev->hw.mac.autoneg,
hdev->hw.mac.support_autoneg and hdev->hw.mac.lane_num from firmware
whenever resp->speed_ability is non-zero.

The new helper then feeds the firmware-supplied lane_num back through:

    if (!hdev->hw.mac.autoneg) {
        ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
                                         hdev->hw.mac.req_duplex,
                                         hdev->hw.mac.lane_num);

On a non-copper port where firmware reports forced mode (autoneg=0) and a
stale lane_num inherited from a previous link lifecycle (for example
lane_num=4 from a prior 100G state), the (req_speed, lane_num) pair given
to hclge_cfg_mac_speed_dup_hw() can be inconsistent (for example 25G with
4 lanes), and firmware rejects it with -EINVAL, which propagates back to
hclge_init_ae_dev() and unwinds via err_ptp_uninit, failing the entire
probe.

The follow-up commit a7e52c16ecbf "net: hns3: fix init failure caused by
lane_num contamination" later in this series fixes this by introducing a
separate req_lane_num=0 and passing that to hclge_cfg_mac_speed_dup_hw()
instead of mac.lane_num. Should the fix be folded into this patch so the
series does not introduce a probe-failure regression at this point and
then repair it?

[Medium]

After this change, the last write to MAC speed/duplex on the init path is
hclge_cfg_mac_speed_dup_hw() inside the new helper, which programs
hardware but does not update the cached hdev->hw.mac.speed and
hdev->hw.mac.duplex. Previously the last writer was
hclge_cfg_mac_speed_dup() inside hclge_update_port_info(), which updates
both hardware and the cached fields.

When req_speed differs from the firmware-reported mac.speed (which can
happen if the active SFP speed differs from cfg.default_speed parsed in
hclge_configure()), is hardware left programmed with req_speed while
mac.speed retains the firmware-reported value, leaving the cache out of
sync with the actual hardware configuration until the periodic service
task refreshes mac.speed?

> @@ -12231,6 +12245,13 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
>  		return ret;
>  	}
>  
> +	ret = hclge_set_autoneg_speed_dup(hdev);
> +	if (ret) {
> +		dev_err(&pdev->dev,
> +			"failed to set autoneg speed duplex, ret = %d\n", ret);
> +		return ret;
> +	}
> +
>  	ret = hclge_tp_port_init(hdev);
>  	if (ret) {
>  		dev_err(&pdev->dev, "failed to init tp port, ret = %d\n",

[Low]

Could the commit message be tightened to match the resulting code?

The message says the helper "will also be invoked during the hardware
reset recovery path to re-apply link settings without repeating
unnecessary buffer or MTU initializations." However hclge_reset_ae_dev()
still calls hclge_mac_init() immediately before hclge_set_autoneg_speed_dup(),
and hclge_mac_init() still performs hclge_set_mac_mtu(),
hclge_set_default_loopback() and hclge_buffer_alloc(), so the avoidance
of MTU/buffer re-initialization described in the changelog is not
realized here.

The commit message is also framed as a pure readability/cyclomatic-
complexity refactor, but the autoneg/speed configuration is not just
extracted: in hclge_init_ae_dev() it is moved to a materially later
point (after hclge_update_port_info(), TSO, GRO, VLAN, TM, RSS, FD and
PTP setup), versus the original code where it was the very first
hardware-touching step inside hclge_mac_init(). The reset path keeps the
original relative ordering (helper runs immediately after
hclge_mac_init()), so the init and reset paths are now asymmetric in
placement, and the rationale for the late placement on the init path is
not stated.

^ permalink raw reply

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Maxime Chevallier @ 2026-06-18 15:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Zhou, Yun
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-kernel
In-Reply-To: <20260618150440.cLDwgyDM@linutronix.de>

Hi,

On 6/18/26 17:04, Sebastian Andrzej Siewior wrote:
> On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>>
>>> But if the thread is idle then you have one enable too many, don't you?
>>> Well you have the NAPI callback which does disable on the local CPU and
>>> this resume which enables it on every CPU. So this does not look right.
>>>
>>
>> The enable in resume is intentionally unconditional and idempotent
>> (writing MPIC_INT_CLEAR_MASK on an already unmasked IRQ is a no-op).
>>
>>> The interesting question is what happens to the enable_percpu_irq() from
>>> the mvneta_poll(). Is it lost? And if so, how/ why?
>>>
>>
>> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
>> gets a chance to execute. The sequence is:
>>
>> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
>> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
> 
> But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
> not threaded. That does not look optimal.
> 
>> 3. NAPI poll cannot run → enable_percpu_irq() is never called
>> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>>     but does NOT execute the completion path (no enable_percpu_irq)
> 
> napi_schedule() sets NAPIF_STATE_SCHED.
> napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
> is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
> will be invoked unless it leaves somewhere early.
> However, if DISABLED was already set then it disables the IRQ source but
> does not schedule NAPI. This is probably what happens.
> 
>> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>>
>> The unconditional enable in resume covers this case. When NAPI was
>> idle at suspend time, the extra enable is harmless.
> 
> There is no desc::depth counting here, that got me confused. But that
> per-CPU irq is not optimal. Is this a SoC limitation or a design choice?

I _think_ that on mvneta, the mapping is done by assigning queues to CPUs
directly, and then you have per-cpu register banks to handle the interrupts :

https://elixir.bootlin.com/linux/v7.1/source/drivers/net/ethernet/marvell/mvneta.c#L138

This seems to be confirmed by some comments here [1] :

static void mvneta_percpu_mask_interrupt(void *arg)
{
	struct mvneta_port *pp = arg;

	/* All the queue are masked, but actually only the ones
	 * mapped to this CPU will be masked
	 */
	mvreg_write(pp, MVNETA_INTR_NEW_MASK, 0);
	mvreg_write(pp, MVNETA_INTR_OLD_MASK, 0);
	mvreg_write(pp, MVNETA_INTR_MISC_MASK, 0);
}

These registers will control different queue's interrupt behaviour depending
on which CPU executes that code...

[1] : https://elixir.bootlin.com/linux/v7.1/source/drivers/net/ethernet/marvell/mvneta.c#L1441

This may be why the design ended-up that way. I'm not saying this is ideal
though :)

The percpu interrupt mechanism isn't used on armada 3700 (hence all the special
conditions) because IIRC the interrupt routing is flawed for the network part, and
all interrupts end-up on CPU0 anyways...

Maxime


^ permalink raw reply

* Re: [PATCH net 4/5] net: hns3: differentiate autoneg default values between copper and fiber
From: Simon Horman @ 2026-06-18 15:34 UTC (permalink / raw)
  To: shaojijie
  Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260617112721.75186-5-shaojijie@huawei.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: hns3: differentiate autoneg default values between copper and fiber

This commit addresses a link loss issue during driver initialization on
optical ports connected to forced-mode remote switches. It modifies
hclge_init_ae_dev to set autonegotiation based on media type. Copper ports
retain the AUTONEG_ENABLE default, while optical ports inherit the firmware
pre-configured autoneg setting.

> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -11916,6 +11916,9 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
>  	if (ret)
>  		goto err_ptp_uninit;
>  
> +	if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_COPPER)
> +		hdev->hw.mac.req_autoneg = hdev->hw.mac.autoneg;
> +

[Severity: High]
Does this apply the wrong speed when autonegotiation is disabled on fiber
ports?

During driver initialization, hclge_update_port_info() dynamically probes
the SFP module and updates mac->speed, but does not update mac->req_speed,
which retains its statically configured default capability.

When hdev->hw.mac.autoneg is 0 (forced-mode), req_autoneg becomes 0, and
then hclge_set_autoneg_speed_dup() uses req_speed:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_set_autoneg_speed_dup() {
...
    if (!hdev->hw.mac.req_autoneg) {
        ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
...
}

Could this override the dynamically probed SFP speed with the static default
speed?

[Severity: High]
Does this unconditionally disable autonegotiation for V1 optical hardware?

Looking at hclge_update_port_info(), it only queries SFP information via
hclge_get_sfp_info() for V2 hardware. For V1 hardware, it falls back to
hclge_get_sfp_speed(), which leaves mac->autoneg uninitialized (0):

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_update_port_info() {
...
    if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) {
        speed = mac->speed;
        ret = hclge_get_sfp_info(hdev, mac);
    } else {
        speed = HCLGE_MAC_SPEED_UNKNOWN;
        ret = hclge_get_sfp_speed(hdev, &speed);
    }
...
}

If mac->autoneg is left as 0, wouldn't req_autoneg unconditionally be set
to 0 here, overriding the speed and breaking connectivity for V1 devices?

>  	ret = hclge_set_autoneg_speed_dup(hdev);
>  	if (ret) {
>  		dev_err(&pdev->dev,

^ permalink raw reply

* Re: [PATCH net] net: airoha: fix BQL underflow and UAF in shared QDMA TX ring
From: Lorenzo Bianconi @ 2026-06-18 15:42 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Wayen Yan, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260618-airoha-bql-fixes-v1-1-ffd2c2089518@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 7943 bytes --]

> When multiple netdevs share a QDMA TX ring and one device is stopped,
> netdev_tx_reset_subqueue() zeroes that device's BQL counters while its
> pending skbs remain in the shared HW TX ring. When NAPI later completes
> those skbs via netdev_tx_completed_queue(), the already-zeroed
> dql->num_queued counter underflows.
> Moreover, in the airoha_remove() path, netdevs are unregistered
> sequentially while skbs from previously unregistered netdevs may still
> reference freed net_device memory via skb->dev, causing a use-after-free
> during BQL accounting.
> Fix both issues:
> - Remove netdev_tx_reset_subqueue() from airoha_dev_stop() so pending
>   skbs are completed naturally by NAPI with proper BQL accounting.
> - Add netdev_tx_completed_queue() in airoha_qdma_cleanup_tx_queue()
>   to properly account for skbs freed during queue teardown.
> - Introduce airoha_qdma_tx_disable() to stop TX on all registered
>   netdevs for a given QDMA instance under RTNL lock.
> - Move DMA engine start/stop into probe/remove and
>   airoha_qdma_cleanup(), ensuring TX queues are cleaned up while all
>   netdevs are still registered and skb->dev is valid.
> 
> Fixes: 6df0488dc9dd ("net: airoha: fix BQL accounting in airoha_qdma_cleanup_tx_queue()")

This Fixes tag is wrong, the proper one is the one below:
Fixes: a9c2ca61fec7 ("net: airoha: Support multiple net_devices for a single FE GDM port")

Regards,
Lorenzo

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/airoha/airoha_eth.c | 95 ++++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 64dde6464f3f..4d6a061cd779 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -1004,6 +1004,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>  
>  		e = &q->entry[index];
>  		skb = e->skb;
> +		e->skb = NULL;
>  
>  		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
>  				 DMA_TO_DEVICE);
> @@ -1147,6 +1148,42 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
>  	return 0;
>  }
>  
> +static void airoha_qdma_tx_disable(struct airoha_qdma *qdma)
> +{
> +	struct airoha_eth *eth = qdma->eth;
> +	int i;
> +
> +	/* Protect netdev->reg_state and netif_tx_disable() calls. */
> +	rtnl_lock();
> +
> +	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
> +		struct airoha_gdm_port *port = eth->ports[i];
> +		int j;
> +
> +		if (!port)
> +			continue;
> +
> +		for (j = 0; j < ARRAY_SIZE(port->devs); j++) {
> +			struct airoha_gdm_dev *dev = port->devs[j];
> +			struct net_device *netdev;
> +
> +			if (!dev)
> +				continue;
> +
> +			if (dev->qdma != qdma)
> +				continue;
> +
> +			netdev = netdev_from_priv(dev);
> +			if (netdev->reg_state != NETREG_REGISTERED)
> +				continue;
> +
> +			netif_tx_disable(netdev);
> +		}
> +	}
> +
> +	rtnl_unlock();
> +}
> +
>  static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
>  {
>  	struct airoha_qdma *qdma = q->qdma;
> @@ -1158,13 +1195,20 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
>  	for (i = 0; i < q->ndesc; i++) {
>  		struct airoha_queue_entry *e = &q->entry[i];
>  		struct airoha_qdma_desc *desc = &q->desc[i];
> +		struct sk_buff *skb = e->skb;
>  
>  		if (!e->dma_addr)
>  			continue;
>  
>  		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
>  				 DMA_TO_DEVICE);
> -		dev_kfree_skb_any(e->skb);
> +		if (skb) {
> +			struct netdev_queue *txq;
> +
> +			txq = skb_get_tx_queue(skb->dev, skb);
> +			netdev_tx_completed_queue(txq, 1, skb->len);
> +			dev_kfree_skb_any(skb);
> +		}
>  		e->dma_addr = 0;
>  		e->skb = NULL;
>  		list_add_tail(&e->list, &q->tx_list);
> @@ -1527,6 +1571,23 @@ static void airoha_qdma_cleanup(struct airoha_qdma *qdma)
>  {
>  	int i;
>  
> +	if (test_bit(DEV_STATE_INITIALIZED, &qdma->eth->state)) {
> +		u32 status;
> +
> +		airoha_qdma_tx_disable(qdma);
> +
> +		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> +				  GLOBAL_CFG_TX_DMA_EN_MASK |
> +				  GLOBAL_CFG_RX_DMA_EN_MASK);
> +		if (read_poll_timeout(airoha_qdma_rr, status,
> +				      !(status & (GLOBAL_CFG_TX_DMA_BUSY_MASK |
> +						  GLOBAL_CFG_RX_DMA_BUSY_MASK)),
> +				      USEC_PER_MSEC, 50 * USEC_PER_MSEC, true,
> +				      qdma, REG_QDMA_GLOBAL_CFG))
> +			dev_warn(qdma->eth->dev,
> +				 "QDMA DMA engine busy timeout\n");
> +	}
> +
>  	for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
>  		if (!qdma->q_rx[i].ndesc)
>  			continue;
> @@ -1837,9 +1898,6 @@ static int airoha_dev_open(struct net_device *netdev)
>  	}
>  	port->users++;
>  
> -	airoha_qdma_set(qdma, REG_QDMA_GLOBAL_CFG,
> -			GLOBAL_CFG_TX_DMA_EN_MASK |
> -			GLOBAL_CFG_RX_DMA_EN_MASK);
>  	qdma->users++;
>  
>  	if (!airoha_is_lan_gdm_dev(dev) &&
> @@ -1880,12 +1938,9 @@ static int airoha_dev_stop(struct net_device *netdev)
>  	struct airoha_gdm_dev *dev = netdev_priv(netdev);
>  	struct airoha_gdm_port *port = dev->port;
>  	struct airoha_qdma *qdma = dev->qdma;
> -	int i;
>  
>  	netif_tx_disable(netdev);
>  	airoha_set_vip_for_gdm_port(dev, false);
> -	for (i = 0; i < netdev->num_tx_queues; i++)
> -		netdev_tx_reset_subqueue(netdev, i);
>  
>  	if (--port->users)
>  		airoha_set_port_mtu(dev->eth, port);
> @@ -1893,19 +1948,7 @@ static int airoha_dev_stop(struct net_device *netdev)
>  		airoha_set_gdm_port_fwd_cfg(qdma->eth,
>  					    REG_GDM_FWD_CFG(port->id),
>  					    FE_PSE_PORT_DROP);
> -
> -	if (!--qdma->users) {
> -		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> -				  GLOBAL_CFG_TX_DMA_EN_MASK |
> -				  GLOBAL_CFG_RX_DMA_EN_MASK);
> -
> -		for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
> -			if (!qdma->q_tx[i].ndesc)
> -				continue;
> -
> -			airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
> -		}
> -	}
> +	qdma->users--;
>  
>  	return 0;
>  }
> @@ -3413,8 +3456,12 @@ static int airoha_probe(struct platform_device *pdev)
>  	if (err)
>  		goto error_netdev_free;
>  
> -	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
> +	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++) {
>  		airoha_qdma_start_napi(&eth->qdma[i]);
> +		airoha_qdma_set(&eth->qdma[i], REG_QDMA_GLOBAL_CFG,
> +				GLOBAL_CFG_TX_DMA_EN_MASK |
> +				GLOBAL_CFG_RX_DMA_EN_MASK);
> +	}
>  
>  	for_each_child_of_node(pdev->dev.of_node, np) {
>  		if (!of_device_is_compatible(np, "airoha,eth-mac"))
> @@ -3440,6 +3487,8 @@ static int airoha_probe(struct platform_device *pdev)
>  	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
>  		airoha_qdma_stop_napi(&eth->qdma[i]);
>  
> +	airoha_hw_cleanup(eth);
> +
>  	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
>  		struct airoha_gdm_port *port = eth->ports[i];
>  		int j;
> @@ -3461,7 +3510,6 @@ static int airoha_probe(struct platform_device *pdev)
>  		}
>  		airoha_metadata_dst_free(port);
>  	}
> -	airoha_hw_cleanup(eth);
>  error_netdev_free:
>  	free_netdev(eth->napi_dev);
>  	platform_set_drvdata(pdev, NULL);
> @@ -3477,6 +3525,8 @@ static void airoha_remove(struct platform_device *pdev)
>  	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
>  		airoha_qdma_stop_napi(&eth->qdma[i]);
>  
> +	airoha_hw_cleanup(eth);
> +
>  	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
>  		struct airoha_gdm_port *port = eth->ports[i];
>  		int j;
> @@ -3497,7 +3547,6 @@ static void airoha_remove(struct platform_device *pdev)
>  		}
>  		airoha_metadata_dst_free(port);
>  	}
> -	airoha_hw_cleanup(eth);
>  
>  	free_netdev(eth->napi_dev);
>  	platform_set_drvdata(pdev, NULL);
> 
> ---
> base-commit: 7d8297e26b4e20b5d1c3c3fe51fe81a1c7fbc823
> change-id: 20260618-airoha-bql-fixes-f57b2d108573
> 
> Best regards,
> -- 
> Lorenzo Bianconi <lorenzo@kernel.org>
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Zhou, Yun @ 2026-06-18 15:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, netdev, linux-kernel
In-Reply-To: <20260618150440.cLDwgyDM@linutronix.de>



On 6/18/2026 11:04 PM, Sebastian Andrzej Siewior wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>>
>> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
>> gets a chance to execute. The sequence is:
>>
>> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
>> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
> 
> But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
> not threaded. That does not look optimal.
> 
Correct, percpu IRQs are not threaded. net_rx_action (which calls 
mvneta_poll and ultimately enable_percpu_irq) runs in softirq context 
via ksoftirqd.

>> 3. NAPI poll cannot run → enable_percpu_irq() is never called
>> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>>      but does NOT execute the completion path (no enable_percpu_irq)
> 
> napi_schedule() sets NAPIF_STATE_SCHED.
> napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
> is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
> will be invoked unless it leaves somewhere early.
> However, if DISABLED was already set then it disables the IRQ source but
> does not schedule NAPI. This is probably what happens.

Your analysis makes perfect sense, and that scenario is indeed much more
likely. It looks like I'll need to update the commit message accordingly.

>> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>>
>> The unconditional enable in resume covers this case. When NAPI was
>> idle at suspend time, the extra enable is harmless.
> 
> There is no desc::depth counting here, that got me confused. But that
> per-CPU irq is not optimal. Is this a SoC limitation or a design choice?
> Having a NAPI instance with IRQ per queue and those configured and
> spread among CPUs should cause less trouble and is what others do.
> In fact is the only net driver using per-CPU interrupts.
> 
It is a SoC limitation. Armada XP's MPIC provides a single shared
interrupt for the ethernet controller with per-CPU masking for
interrupt steering — there are no per-queue MSI vectors. The percpu
IRQ model was the only way to distribute interrupt handling across
CPUs given this hardware constraint.

Newer Marvell SoCs (armada3700+) moved away from this model and use
standard IRQs, which is why the armada3700 path does not have this
problem.

^ permalink raw reply

* Fw: [Bug 221665] New: tcp_ecn: doc says default is 2, but should be 5
From: Stephen Hemminger @ 2026-06-18 15:52 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Thu, 18 Jun 2026 13:05:29 +0000
From: bugzilla-daemon@kernel.org
To: stephen@networkplumber.org
Subject: [Bug 221665] New: tcp_ecn: doc says default is 2, but should be 5


https://bugzilla.kernel.org/show_bug.cgi?id=221665

            Bug ID: 221665
           Summary: tcp_ecn: doc says default is 2, but should be 5
           Product: Networking
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: IPV4
          Assignee: stephen@networkplumber.org
          Reporter: pedretti.fabio@gmail.com
                CC: corbet@lwn.net, pabeni@redhat.com
        Regression: No

Here: https://docs.kernel.org/networking/ip-sysctl.html#:~:text=tcp_ecn

tcp_ecn - INTEGER
...
Default: 2


However it looks like the default was recently changed:

"In previous kernels, the value of tcp_ecn is, by default, two, [...] The new
default value is five, [...]"

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ae3e8e6ceedfb3cf74ca18169c942e073586a39

Please address this.
Thanks.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Sebastian Andrzej Siewior @ 2026-06-18 15:53 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, netdev, linux-kernel
In-Reply-To: <3a4ba3da-47ee-4da7-b0da-af500ff1a369@windriver.com>

On 2026-06-18 23:45:51 [+0800], Zhou, Yun wrote:
> > Having a NAPI instance with IRQ per queue and those configured and
> > spread among CPUs should cause less trouble and is what others do.
> > In fact is the only net driver using per-CPU interrupts.
> > 
> It is a SoC limitation. Armada XP's MPIC provides a single shared
> interrupt for the ethernet controller with per-CPU masking for
> interrupt steering — there are no per-queue MSI vectors. The percpu
> IRQ model was the only way to distribute interrupt handling across
> CPUs given this hardware constraint.

Is this a hardware constraint or more a software design choice? From the
other comment it read like it could be changed.
There is nothing wrong to provide 4 interrupts for a device from the
device-tree and then allocate and request all four. This requires that
SMP affinity is supported properly in order to spread it across CPU. You
would also be able to reduce the amount of queues/ interrupts via
ethtool if you would like to isolate a CPU for NOHZ reasons.

Sebastian

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-net 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Loktionov, Aleksandr @ 2026-06-18 16:01 UTC (permalink / raw)
  To: Oros, Petr, netdev@vger.kernel.org
  Cc: Vecera, Ivan, Alice Michael, Kitszel, Przemyslaw, Eric Dumazet,
	linux-kernel@vger.kernel.org, Andrew Lunn, Nguyen, Anthony L,
	Michal Swiatkowski, Keller, Jacob E, Jakub Kicinski, Paolo Abeni,
	David S. Miller, intel-wired-lan@lists.osuosl.org
In-Reply-To: <89efbea9831175e6f57e9fe8557f7a0e48e050b7.1781786935.git.poros@redhat.com>



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Petr Oros
> Sent: Thursday, June 18, 2026 5:09 PM
> To: netdev@vger.kernel.org
> Cc: Vecera, Ivan <ivecera@redhat.com>; Alice Michael
> <alice.michael@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Eric Dumazet <edumazet@google.com>;
> linux-kernel@vger.kernel.org; Andrew Lunn <andrew+netdev@lunn.ch>;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>; David S. Miller <davem@davemloft.net>;
> intel-wired-lan@lists.osuosl.org
> Subject: [Intel-wired-lan] [PATCH iwl-net 1/2] ice: skip per-VLAN
> promisc rules when default VSI Rx rule is set
> 
> When an ice port is part of a vlan-filtering bridge with a wide VLAN
> trunk and the netdev is in IFF_PROMISC (typical for bond slaves
> attached to a bridge), the driver installs per-VLAN
> ICE_SW_LKUP_PROMISC_VLAN entries (recipe 9) in addition to the broad
> ICE_SW_LKUP_DFLT VSI Rx rule (recipe 5). Each per-VLAN rule consumes
> one Flow Lookup Unit (FLU) entry from a fixed hardware pool of "up to
> 32K FLU entries" per device, documented in the E810 datasheet
> (613875-009 section 7.8.10, Table 7-18, page 1015).
> 
> With three active PFs sharing one switch context and a bridge trunk of
> vid 2-4094, the configuration would require roughly
> 
>   3 PFs * 4093 VLANs * 3 rules per VLAN per PF ~= 36,800 rules
> 
> which exceeds the 32K FLU budget. Firmware then responds to further
> Add Switch Rules requests with AQ retval 0x10 (LIBIE_AQ_RC_ENOSPC) and
> the user-visible failure surfaces as
> 
>   ice 0000:5c:00.1: Failed to set VSI 14 as the default forwarding
>                     VSI, error -5
>   ice 0000:5c:00.1 ens1f1: Error -5 setting default VSI 14 Rx rule
> 
> After a switch context has been driven into overrun, subsequent
> retries can come back as AQ retval 0x2 (LIBIE_AQ_RC_ENOENT), which has
> misled triage attempts toward a perceived recipe binding defect rather
> than a capacity issue.
> 
> When the DFLT VSI Rx rule is in place it catches every packet on the
> lport regardless of VLAN tag, so the per-VLAN PROMISC_VLAN expansion
> is redundant. The recipe 4 VLAN prune entries are still installed per
> VLAN and continue to track the allowed VID set, but the IFF_PROMISC
> sync path disables their enforcement on the VSI via
> vlan_ops->dis_rx_filtering() before ice_set_promisc() runs.
> ena_rx_filtering() is restored when IFF_PROMISC is cleared.
> 
> Skip the per-VLAN expansion at the two call sites that drive it:
> ice_set_promisc() falls through to ice_fltr_set_vsi_promisc() and
> ice_vlan_rx_add_vid() omits the per-VLAN ICE_MCAST_VLAN_PROMISC_BITS
> add. Plain IFF_ALLMULTI without an installed DFLT VSI rule is
> unchanged and still installs per-VLAN multicast promisc rules.
> 
> Both checks use ice_is_vsi_dflt_vsi() which inspects the recipe filter
> list for an installed DFLT rule on this VSI, not
> netdev->flags & IFF_PROMISC. The HW-state predicate avoids two
> regression vectors that a user-intent predicate would introduce:
> 
> 1. ice_lag_is_switchdev_running() short-circuits ice_set_dflt_vsi()
>    to return 0 without installing the DFLT rule for a PF in
>    switchdev LAG mode. An IFF_PROMISC-only check would also
>    suppress the per-VLAN fallback, leaving the PF with no rule.
> 
> 2. When ice_set_dflt_vsi() returns a non-EEXIST error (FLU
>    exhausted, switch context divergence), the driver clears
>    IFF_PROMISC from vsi->current_netdev_flags but the netdev's own
>    flags retain IFF_PROMISC. The user-intent predicate would still
>    suppress the per-VLAN fallback even though DFLT failed to
>    install.
> 
> The predicate is install-time only. The IFF_PROMISC off path closes
> the lifecycle gap in ice_vsi_exit_dflt_promisc(): for an IFF_ALLMULTI
> VSI with VLANs it reinstates the per-VID rules before clearing the
> default rule, so multicast coverage never lapses. If that AQ call
> fails the default rule is left in place, ice_vsi_exit_dflt_promisc()
> returns the error, and the sync_fltr pass bails with
> vsi->current_netdev_flags |= IFF_PROMISC; the current/netdev flag
> mismatch re-fires the IFF_PROMISC off path on the next sync. Clearing
> the default rule first would instead expose a window where neither the
> default rule nor the per-VID rules carry multicast.
> 
> If ice_clear_dflt_vsi() fails after the per-VID rules were reinstated
> they are deliberately not rolled back. Clearing the default rule is a
> removal that frees an FLU entry rather than allocating one, so it
> cannot fail for lack of space; a failure is a transient AdminQ error.
> The per-VID rules are the steady state for an IFF_ALLMULTI VLAN VSI,
> so the only redundant entry left behind is the single un-removed
> default rule, not the per-VID set. The retry re-enters this path,
> ice_fltr_set_vlan_vsi_promisc() returns -EEXIST for the rules that
> already exist so nothing is reallocated, and the default rule is
> removed on the next attempt. Rolling the per-VID rules back here would
> instead churn thousands of removes and re-adds on every retry.
> 
> After the default rule is gone the vid=0 PROMISC rule that paired with
> it is redundant and is dropped, but only to reclaim a filter entry, so
> a failure there is logged and does not abort the transition.
> 
> ice_set_vsi_promisc() and ice_clear_vsi_promisc() dispatch the recipe
> based on whether ICE_PROMISC_VLAN_RX/TX bits are present in the mask:
> with the bits set, recipe ICE_SW_LKUP_PROMISC_VLAN is used; otherwise
> ICE_SW_LKUP_PROMISC. The else branch in
> ice_set_promisc() installs the vid=0 rule in ICE_SW_LKUP_PROMISC.
> Because ice_clear_promisc() with VLANs present adds the VLAN bits and
> would search ICE_SW_LKUP_PROMISC_VLAN, the recipe mismatch would leave
> the vid=0 ICE_SW_LKUP_PROMISC rule orphaned when VLANs are configured.
> This is a single stale rule, not a per-cycle leak:
> re-adding it on the next promisc on returns -EEXIST rather than
> allocating a new entry. The set-time recipe is not recorded, so
> ice_clear_promisc() clears both recipes; clearing a rule that is not
> present succeeds, both clears run unconditionally, and the first error
> is returned.
> 
> The two VLAN-0 recipe transition blocks in ice_vlan_rx_add_vid() and
> ice_vlan_rx_kill_vid() that promote / demote the vid=0 rule between
> ICE_SW_LKUP_PROMISC and ICE_SW_LKUP_PROMISC_VLAN are likewise guarded
> by !ice_is_vsi_dflt_vsi(). With DFLT in place the
> vid=0 rule already covers every VID and a recipe swap would only
> install a redundant rule.
> 
> Lab reproduction on an E810-C with the same firmware family (4.80, NVM
> 1.3805.0, DDP 1.3.43.0) using four PFs in vlan-filtering bridges with
> vid 2-4094 and the slaves brought to IFF_PROMISC before the bridge
> VLAN bulk add:
> 
>   before fix:  ~12,279 AQ Add Switch Rules per PF, ENOSPC and ENOENT
>                responses in dmesg, DFLT VSI Rx rule install fails on
>                the affected PF
>   after fix:   ~4,093 AQ Add Switch Rules per PF, no AQ errors, DFLT
>                VSI Rx rule installs on every PF
> 
> The 66.7% reduction in installed switch rules per PF matches the
> expected per-VLAN saving: a single DFLT rule replaces the per-VID
> PROMISC_VLAN expansion.
> 
> Functional regression test with vid 2-100 trunk between two ice ports
> through the lab switch (40/40 PASS, 0 AQ errors, 0 ENOSPC at 4093-VID
> customer scale):
> 
>   vid 50 unicast, vid 100 unicast, vid 50 broadcast ARP,
>     vid 100 multicast IPv6 ND
>   vid 200/500/1500/4000 isolation (out-of-trunk) and untagged not
>     leaked: 0 packets reach any bridge endpoint
>   IGMP/MLD snooping, Jumbo MTU 9000, reserved-multicast STP BPDU
>   IFF_PROMISC + IFF_ALLMULTI transition (off while allmulti stays)
>   Regression reproducer for commit 1273f89578f2 ("ice: Fix broken
>     IFF_ALLMULTI handling"): allmulti on -> add vid -> allmulti off
>     -> allmulti on plus the orphan-rule Scenario 2; both converge
>     with no stale rules
>   100-VID, 1000-VID, 4093-VID stress cycles (5/3/2 iterations each)
>   switchdev mode toggle preserves IFF_PROMISC pruning state across
>     the session (vid 999 multicast received before and after the
>     legacy -> switchdev -> legacy cycle)
>   SR-IOV: VFs unaffected because ice_set_promisc() early-returns
>     for non-PF VSI and VF representors do not register
>     ndo_vlan_rx_add_vid
> 
> Fixes: 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling")
> Signed-off-by: Petr Oros <poros@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 90 ++++++++++++++++++----
> -
>  1 file changed, 70 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c
> b/drivers/net/ethernet/intel/ice/ice_main.c
> index 6d24056c247cf4..af8df81fc45623 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -274,7 +274,8 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8
> promisc_m)
>  	if (vsi->type != ICE_VSI_PF)
>  		return 0;
> 
> -	if (ice_vsi_has_non_zero_vlans(vsi)) {
> +	/* skip per-VID expansion; the DFLT Rx rule already covers
> every VID */
> +	if (ice_vsi_has_non_zero_vlans(vsi) &&
> !ice_is_vsi_dflt_vsi(vsi)) {
>  		promisc_m |= (ICE_PROMISC_VLAN_RX |
> ICE_PROMISC_VLAN_TX);
>  		status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw,
> vsi,
>  						       promisc_m);
> @@ -304,9 +305,19 @@ static int ice_clear_promisc(struct ice_vsi *vsi,
> u8 promisc_m)
>  		return 0;
> 
>  	if (ice_vsi_has_non_zero_vlans(vsi)) {

...

>  			ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi-
> >idx,
> 
> ICE_MCAST_VLAN_PROMISC_BITS,
>  						   0);
> --
> 2.53.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-net 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Loktionov, Aleksandr @ 2026-06-18 16:02 UTC (permalink / raw)
  To: Oros, Petr, netdev@vger.kernel.org
  Cc: Vecera, Ivan, Alice Michael, Kitszel, Przemyslaw, Eric Dumazet,
	linux-kernel@vger.kernel.org, Andrew Lunn, Nguyen, Anthony L,
	Michal Swiatkowski, Keller, Jacob E, Jakub Kicinski, Paolo Abeni,
	David S. Miller, intel-wired-lan@lists.osuosl.org
In-Reply-To: <deef5756e534ef06c12d910c5305d3fd205d30a0.1781786935.git.poros@redhat.com>



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Petr Oros
> Sent: Thursday, June 18, 2026 5:09 PM
> To: netdev@vger.kernel.org
> Cc: Vecera, Ivan <ivecera@redhat.com>; Alice Michael
> <alice.michael@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Eric Dumazet <edumazet@google.com>;
> linux-kernel@vger.kernel.org; Andrew Lunn <andrew+netdev@lunn.ch>;
> Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>; David S. Miller <davem@davemloft.net>;
> intel-wired-lan@lists.osuosl.org
> Subject: [Intel-wired-lan] [PATCH iwl-net 2/2] ice: preserve uplink
> DFLT Rx rule on switchdev release
> 
> ice_eswitch_setup_env() calls ice_set_dflt_vsi() to install the
> ICE_SW_LKUP_DFLT Rx rule on the uplink VSI. The helper returns 0 even
> when the rule is already in place, so the call is a no-op if
> ice_vsi_sync_fltr() had previously installed the DFLT rule in response
> to IFF_PROMISC on the uplink netdev. ice_remove_vsi_fltr() called
> earlier in ice_eswitch_setup_env() does not affect this rule because
> ice_remove_vsi_lkup_fltr() lacks a case for ICE_SW_LKUP_DFLT and falls
> into its default branch which only logs. Switchdev mode then adds an
> ICE_FLTR_TX leg via ice_cfg_dflt_vsi() on the same VSI handle.
> 
> ice_eswitch_release_env() unconditionally removed both the Rx and Tx
> DFLT rules. When the Rx DFLT was installed by ice_vsi_sync_fltr()
> before the switchdev session started, this clobbered promisc state the
> operator had asked for: the DFLT Rx rule disappeared while IFF_PROMISC
> was still set on the netdev, and the IFF_PROMISC sync path was not
> retriggered, so the uplink ended the session without the catch-all
> rule the netdev flags requested.
> 
> Skip the Rx DFLT removal when the uplink is still promiscuous, both in
> ice_eswitch_release_env() and in the err_def_tx unwind of
> ice_eswitch_setup_env(). The Tx leg installed by switchdev is always
> removed since switchdev owns it.
> 
> The ena_rx_filtering() call earlier in ice_eswitch_release_env() is
> left unconditional: it calls ice_cfg_vlan_pruning(), which returns
> without enabling pruning while the netdev is in IFF_PROMISC, so it
> cannot re-enable VLAN pruning under the preserved DFLT rule and drop
> tagged traffic. Pruning is re-enabled later, when the IFF_PROMISC sync
> path runs after promisc is actually cleared.
> 
> Use vsi->current_netdev_flags rather than the live netdev->flags for
> this test. netdev->flags is written under RTNL by dev_change_flags(),
> while ice_eswitch_release_env() runs under devl_lock, so reading it
> here would be a TOCTOU against a concurrent promisc change. The
> IFF_PROMISC bit of current_netdev_flags is written only under
> ICE_CFG_BUSY by ice_vsi_sync_fltr(), and ice_set_rx_mode() gates that
> sync off for the uplink while ice_is_switchdev_running() is true. The
> bit is therefore frozen for the whole session and stable when
> release_env reads it.
> 
> Because the sync is gated off during the session, a promisc change the
> operator makes while switchdev runs never reaches ice_vsi_sync_fltr():
> current_netdev_flags keeps the value captured before the session while
> netdev->flags carries the new one. Once switchdev is torn down and
> pf->eswitch.is_running is cleared, schedule a filter sync from
> ice_eswitch_disable_switchdev() so the suppressed change is replayed
> and the DFLT Rx rule is reconciled with the current netdev flags. This
> also closes the window where release_env kept the rule based on the
> frozen flag but the operator had since cleared IFF_PROMISC.
> 
> Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
> Signed-off-by: Petr Oros <poros@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_eswitch.c | 32 +++++++++++++++++--
> -
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f77..b6073fc2375019 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> @@ -66,8 +66,10 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
>  	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
>  			 ICE_FLTR_TX);
>  err_def_tx:
> -	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> -			 ICE_FLTR_RX);
> +	/* keep the Rx DFLT rule if still promiscuous (see release_env)
> */
> +	if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
> +		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> +				 false, ICE_FLTR_RX);
>  err_def_rx:
>  	ice_vsi_del_vlan_zero(uplink_vsi);
>  err_vlan_zero:
> @@ -275,11 +277,23 @@ static void ice_eswitch_release_env(struct
> ice_pf *pf)
>  	vlan_ops = ice_get_compat_vsi_vlan_ops(uplink_vsi);
> 
>  	ice_vsi_update_local_lb(uplink_vsi, false);
> +	/* No-op while IFF_PROMISC is set: ice_cfg_vlan_pruning() self-
> gates on
> +	 * it, so this cannot re-enable VLAN pruning under a preserved
> DFLT rule.
> +	 */
>  	vlan_ops->ena_rx_filtering(uplink_vsi);
>  	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
>  			 ICE_FLTR_TX);
> -	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> -			 ICE_FLTR_RX);
> +
> +	/* Keep the Rx DFLT rule if the uplink is still promiscuous; it
> must
> +	 * outlive the session. current_netdev_flags is used because
> its
> +	 * IFF_PROMISC bit only changes under ice_vsi_sync_fltr(),
> gated off
> +	 * during switchdev, so the read cannot race the RTNL netdev-
> >flags.
> +	 * Any change made during the session is replayed on teardown.
> +	 */
> +	if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
> +		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> +				 false, ICE_FLTR_RX);
> +
>  	ice_fltr_add_mac_and_broadcast(uplink_vsi,
>  				       uplink_vsi->port_info-
> >mac.perm_addr,
>  				       ICE_FWD_TO_VSI);
> @@ -327,10 +341,20 @@ static int ice_eswitch_enable_switchdev(struct
> ice_pf *pf)
>   */
>  static void ice_eswitch_disable_switchdev(struct ice_pf *pf)  {
> +	struct ice_vsi *uplink_vsi = pf->eswitch.uplink_vsi;
> +
>  	ice_eswitch_br_offloads_deinit(pf);
>  	ice_eswitch_release_env(pf);
> 
>  	pf->eswitch.is_running = false;
> +
> +	/* ice_set_rx_mode() was gated off during the session; replay a
> filter
> +	 * sync so any suppressed promisc change reconciles the DFLT Rx
> rule.
> +	 */
> +	set_bit(ICE_VSI_UMAC_FLTR_CHANGED, uplink_vsi->state);
> +	set_bit(ICE_VSI_MMAC_FLTR_CHANGED, uplink_vsi->state);
> +	set_bit(ICE_FLAG_FLTR_SYNC, pf->flags);
> +	ice_service_task_schedule(pf);
>  }
> 
>  /**
> --
> 2.53.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net v2] ice: eswitch: fix use-after-free of metadata_dst in repr release
From: Loktionov, Aleksandr @ 2026-06-18 16:02 UTC (permalink / raw)
  To: Doruk Tan Ozturk, Nguyen, Anthony L, Kitszel, Przemyslaw,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com
  Cc: michal.swiatkowski@linux.intel.com, Drewek, Wojciech,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	horms@kernel.org
In-Reply-To: <20260618145003.47471-1-doruk@0sec.ai>



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Doruk Tan Ozturk
> Sent: Thursday, June 18, 2026 4:50 PM
> To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel,
> Przemyslaw <przemyslaw.kitszel@intel.com>; andrew+netdev@lunn.ch;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com
> Cc: michal.swiatkowski@linux.intel.com; Drewek, Wojciech
> <wojciech.drewek@intel.com>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> stable@vger.kernel.org; horms@kernel.org
> Subject: [Intel-wired-lan] [PATCH net v2] ice: eswitch: fix use-after-
> free of metadata_dst in repr release
> 
> ice_eswitch_release_repr() frees the port representor metadata_dst via
> metadata_dst_free(), which directly kfree()s the object and ignores
> the dst_entry refcount. The eswitch slow-path TX routine
> ice_eswitch_port_start_xmit() takes a reference on this dst with
> dst_hold() and attaches it to the skb via skb_dst_set(). If such an
> skb is still in flight (e.g. queued in a qdisc) when the representor
> is torn down, the metadata_dst is freed while the skb still points at
> it. When the skb is later freed, dst_release() operates on already-
> freed memory.
> 
> Replace metadata_dst_free() with dst_release() so the metadata_dst is
> freed only after the last reference is dropped. The dst subsystem
> frees metadata_dst objects from dst_destroy() once the refcount
> reaches zero (DST_METADATA is set by metadata_dst_alloc()).
> 
> Same class of bug and fix as commit c32b26aaa2f9 ("netfilter:
> nft_tunnel: fix use-after-free on object destroy").
> 
> Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
> Cc: stable@vger.kernel.org
> Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
> Reviewed-by: Simon Horman <horms@kernel.org>
> ---
> v2:
>  - Correct the Fixes: tag to 1a1c40df2e80 ("ice: set and release
>    switchdev environment"); the previously cited fff292b47ac1 only
> moved
>    the affected code rather than introducing the unbalanced free, and
> the
>    bug dates back to when switchdev support was added (Simon Horman).
>  - Add Simon Horman's Reviewed-by. No functional change.
> 
>  drivers/net/ethernet/intel/ice/ice_eswitch.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f..41b30a7ca4a9 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> @@ -95,7 +95,7 @@ ice_eswitch_release_repr(struct ice_pf *pf, struct
> ice_repr *repr)
>  		return;
> 
>  	ice_vsi_update_security(vsi, ice_vsi_ctx_set_antispoof);
> -	metadata_dst_free(repr->dst);
> +	dst_release(&repr->dst->dst);
>  	repr->dst = NULL;
>  	ice_fltr_add_mac_and_broadcast(vsi, repr->parent_mac,
>  				       ICE_FWD_TO_VSI);
> --
> 2.43.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-18 16:03 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-2-551fa514257e@proton.me>

On 2026-06-14 03:23:31, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>conn->bytes_to_rcv is accumulated in the receive tasklet from the peer's
>producer cursor:
>
>	diff_prod = smc_curs_diff(rmb_desc->len, &prod_old, &prod_new);
>	atomic_add(diff_prod, &conn->bytes_to_rcv);
>
>smc_curs_diff()'s differing-wrap branch returns (size - old.count) +
>new.count, which exceeds rmb_desc->len for a forged producer cursor and
>accumulates across CDC messages, so bytes_to_rcv can grow past the RMB
>(and across many messages can overflow the signed counter negative).
>smc_rx_recvmsg() reads it as the number of readable bytes and performs a
>wrap-around copy whose second chunk (copylen - first_chunk, read from
>ring offset 0) is never re-bounded to rmb_desc->len, reading past the
>RMB into adjacent kernel memory and disclosing it to the peer.

Hi Bryam,

Once we validate the CDC message at the input boundary (as in the
previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
this check becomes unreachable. So I don't think this patch is needed.

Best regards,
Dust


>
>Bound the readable count to rmb_desc->len where it is consumed, treating
>a negative (sign-overflowed) value as out of range too, so the copy
>length can never exceed the ring.  This enforces the documented
>0 <= bytes_to_rcv <= rmb_desc->len invariant at the consumer, where it
>is race-free against the producer update that runs in the receive
>tasklet.
>
>Fixes: 952310ccf2d8 ("smc: receive data from RMBE")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_rx.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
>diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
>index c1d9b923938d..f461cf10b085 100644
>--- a/net/smc/smc_rx.c
>+++ b/net/smc/smc_rx.c
>@@ -442,6 +442,18 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
> 		/* initialize variables for 1st iteration of subsequent loop */
> 		/* could be just 1 byte, even after waiting on data above */
> 		readable = smc_rx_data_available(conn, peeked_bytes);
>+		/* bytes_to_rcv is accumulated from the peer's wire-controlled
>+		 * producer cursor; a forged cursor can drive it past the RMB,
>+		 * or overflow the signed accumulator to a negative value across
>+		 * many CDC messages (which a plain "> len" check would miss
>+		 * before the size_t cast below turns it huge).  Bound it to the
>+		 * RMB in either case so the wrap-around copy cannot run past
>+		 * rmb_desc->len.  This enforces the documented
>+		 * 0 <= bytes_to_rcv <= rmb_desc->len invariant at the consumer,
>+		 * race-free against the producer update in the receive tasklet.
>+		 */
>+		if (readable < 0 || readable > conn->rmb_desc->len)
>+			readable = conn->rmb_desc->len;
> 		splbytes = atomic_read(&conn->splice_pending);
> 		if (!readable || (msg && splbytes)) {
> 			if (splbytes)
>
>-- 
>2.43.0
>

^ permalink raw reply

* Re: [PATCH v3 3/3] net/smc: bound the send length to the send buffer in smc_tx_sendmsg()
From: Dust Li @ 2026-06-18 16:08 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-3-551fa514257e@proton.me>

On 2026-06-14 03:23:32, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>On the SMC-D DMB-merge (nocopy) path, smc_cdc_msg_recv_action() advances
>conn->sndbuf_space from the peer's consumer cursor:
>
>	diff_tx = smc_curs_diff(sndbuf_desc->len, &tx_curs_fin,
>				&local_rx_ctrl.cons);
>	atomic_add(diff_tx, &conn->sndbuf_space);
>
>The consumer cursor is wire-controlled and unvalidated, and
>smc_curs_diff()'s differing-wrap branch can return more than
>sndbuf_desc->len, so a forged cursor drives sndbuf_space past the send
>buffer (and across many CDC messages can overflow the signed counter
>negative).  smc_tx_sendmsg() reads it as the available write space and
>performs a wrap-around copy whose second chunk (copylen - first_chunk,
>written at ring offset 0) is never re-bounded to sndbuf_desc->len,
>writing user data past the send buffer -- a heap out-of-bounds write of
>attacker-influenced length and content.

Hi Bryam,

I think this is the same as patch #2.

Best regards,
Dust


>
>Bound the write space to sndbuf_desc->len where it is consumed, treating
>a negative (sign-overflowed) value as out of range too, so the copy
>length can never exceed the ring.  This enforces the documented
>0 <= sndbuf_space <= sndbuf_desc->len invariant at the producer,
>race-free against the CDC tasklet that advances sndbuf_space.
>
>Fixes: cc0ab806fc52 ("net/smc: adapt cursor update when sndbuf and peer DMB are merged")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_tx.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>index 3144b4b1fe29..5916f02060fb 100644
>--- a/net/smc/smc_tx.c
>+++ b/net/smc/smc_tx.c
>@@ -233,6 +233,19 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
> 		/* initialize variables for 1st iteration of subsequent loop */
> 		/* could be just 1 byte, even after smc_tx_wait above */
> 		writespace = atomic_read(&conn->sndbuf_space);
>+		/* sndbuf_space is advanced from the peer's wire-controlled
>+		 * consumer cursor on the SMC-D DMB-merge path; a forged cursor
>+		 * can inflate it past the send buffer, or overflow the signed
>+		 * accumulator to a negative value across many CDC messages
>+		 * (which a plain "> len" check would miss before the size_t
>+		 * cast below turns it huge).  Bound it to the send buffer in
>+		 * either case so the wrap-around write cannot run past
>+		 * sndbuf_desc->len.  This enforces the documented
>+		 * 0 <= sndbuf_space <= sndbuf_desc->len invariant at the
>+		 * producer, race-free against the CDC tasklet.
>+		 */
>+		if (writespace < 0 || writespace > conn->sndbuf_desc->len)
>+			writespace = conn->sndbuf_desc->len;
> 		/* not more than what user space asked for */
> 		copylen = min_t(size_t, send_remaining, writespace);
> 		/* determine start of sndbuf */
>
>-- 
>2.43.0
>

^ permalink raw reply

* [PATCH net 0/6] ipv6: fix sysctl error handling and missing notifications
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera

While working on a different IPv6 patch series I have spotted multiple
minor bugs around sysctl error handling and notifications. In general,
they are not serious issues.

In addition, there is one more issue in forwarding sysctl as it does not
check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
of this series and I am aiming it at the net-next tree once it re-opens.

Fernando Fernandez Mancera (6):
  ipv6: fix error handling in disable_ipv6 sysctl
  ipv6: fix error handling in ignore_routes_with_linkdown sysctl
  ipv6: fix error handling in forwarding sysctl
  ipv6: fix error handling in disable_policy sysctl
  ipv6: reset value and position for proxy_ndp sysctl restart
  ipv6: fix missing notification for ignore_routes_with_linkdown

 net/ipv6/addrconf.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

-- 
2.54.0

^ permalink raw reply

* [PATCH net 1/6] ipv6: fix error handling in disable_ipv6 sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 56d417b12e57 ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1f21ccb55caa..c901b444a995 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6467,6 +6467,8 @@ static int addrconf_sysctl_disable(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_disable_ipv6(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net 4/6] ipv6: fix error handling in disable_policy sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the disable_policy sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is resetting the position argument even if an error
occurred during proc_dointvec() and not only during sysctl restart.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 0b128b6cff60..8ff015975e27 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6769,6 +6769,8 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
 	lctl = *ctl;
 	lctl.data = &val;
 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
 
 	if (write && (*valp != val))
 		ret = addrconf_disable_policy(ctl, valp, val);
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 2/6] ipv6: fix error handling in ignore_routes_with_linkdown sysctl
From: Fernando Fernandez Mancera @ 2026-06-18 16:22 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, shemminger, dforster, gospo, ddutt, brian.haley,
	horms, pabeni, kuba, edumazet, davem, idosch, dsahern,
	Fernando Fernandez Mancera
In-Reply-To: <20260618162225.4588-1-fmancera@suse.de>

When writing to the ignore_routes_with_linkdown sysctl, if
proc_dointvec() fails to parse the input, it returns a negative error
code. The current implementation is overwriting that error for write
operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c901b444a995..70058d971205 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6671,6 +6671,8 @@ int addrconf_sysctl_ignore_routes_with_linkdown(const struct ctl_table *ctl,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_linkdown(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox