* [PATCH net] net: dsa: realtek: fix memory leak in rtl8366rb_setup_led()
From: David Yang @ 2026-06-18 14:01 UTC (permalink / raw)
To: netdev
Cc: David Yang, Linus Walleij, Alvin Šipraga, Andrew Lunn,
Vladimir Oltean, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Luiz Angelo Daros de Luca, linux-kernel
led_classdev_register_ext() only reads init_data.devicename - it never
stores the pointer. However, the caller allocated devicename with
kasprintf() but never freed it, leaking the string memory.
Fix it with a stack buffer to avoid dynamic buffers completely.
Fixes: 32d617005475 ("net: dsa: realtek: add LED drivers for rtl8366rb")
Signed-off-by: David Yang <mmyangfl@gmail.com>
---
drivers/net/dsa/realtek/rtl8366rb-leds.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/realtek/rtl8366rb-leds.c b/drivers/net/dsa/realtek/rtl8366rb-leds.c
index 509ffd3f8db5..ba50d311cb15 100644
--- a/drivers/net/dsa/realtek/rtl8366rb-leds.c
+++ b/drivers/net/dsa/realtek/rtl8366rb-leds.c
@@ -89,6 +89,7 @@ static int rtl8366rb_setup_led(struct realtek_priv *priv, struct dsa_port *dp,
struct led_init_data init_data = { };
enum led_default_state state;
struct rtl8366rb_led *led;
+ char name[64];
u32 led_group;
int ret;
@@ -129,10 +130,9 @@ static int rtl8366rb_setup_led(struct realtek_priv *priv, struct dsa_port *dp,
init_data.fwnode = led_fwnode;
init_data.devname_mandatory = true;
- init_data.devicename = kasprintf(GFP_KERNEL, "Realtek-%d:0%d:%d",
- dp->ds->index, dp->index, led_group);
- if (!init_data.devicename)
- return -ENOMEM;
+ snprintf(name, sizeof(name), "Realtek-%d:0%d:%d",
+ dp->ds->index, dp->index, led_group);
+ init_data.devicename = name;
ret = devm_led_classdev_register_ext(priv->dev, &led->cdev, &init_data);
if (ret) {
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-18 14:16 UTC (permalink / raw)
To: Mahanta Jambigi, D. Wythe, Dust Li, Sidraya Jayagond,
Wenjia Zhang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
linux-s390, netdev, linux-kernel, jianhao.xu, runyu.xiao
In-Reply-To: <f7e36176-d00a-4471-94ed-d385e579b43d@linux.ibm.com>
Hi,
Thanks for taking a look.
The exact Lockdep stack I have is from the grounded reproducer, not from
a production SMC setup. The reproducer keeps the same callback shape:
the close/flush side holds sk_callback_lock and invokes the installed
sk_data_ready callback, which re-enters smc_clcsock_data_ready() and tries
to take sk_callback_lock again.
The relevant Lockdep report is:
WARNING: possible recursive locking detected
kworker/u4:3/39 is trying to acquire lock:
(sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d
but task is already holding lock:
(sk_callback_lock) at smc_close_flush_work+0xc/0x30
Possible unsafe locking scenario:
CPU0
----
lock(sk_callback_lock);
lock(sk_callback_lock);
*** DEADLOCK ***
Workqueue: smc_close_wq smc_close_flush_work
Call Trace:
dump_stack_lvl
__lock_acquire
lock_acquire
_raw_read_lock_bh
smc_clcsock_data_ready+0xa/0x4d
smc_close_flush_work+0x1f/0x30
process_one_work
worker_thread
kthread
ret_from_fork
The nvmet change I referred to is:
2fa8961d3a6a ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
The stable/backport patch I originally used as the reference is:
1c90f930e7b4 ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
Its commit message says that when the socket is closed while in
TCP_LISTEN, the flush callback can call nvmet_tcp_listen_data_ready()
with sk_callback_lock already held, so nvmet moved the TCP_LISTEN check
before taking sk_callback_lock.
For the TCP_LISTEN check: my reasoning was that smc_clcsock_data_ready()
is installed by smc_listen() on the underlying TCP listen socket and only
queues smc_tcp_listen_work() for the SMC listen/accept path. Once that
underlying socket is no longer in TCP_LISTEN, there should be no SMC
listen accept work to queue from this callback. TCP_SYN_RECV and
TCP_ESTABLISHED are not listen-socket states for this callback path, so I
did not intend the callback to queue listen work for those states.
That said, if SMC expects smc_clcsock_data_ready() to handle a non-LISTEN
state during fallback or another transition, then the proposed check is
too strict and I should rework the fix.
Thanks,
Runyu
^ permalink raw reply
* [PATCH iwl-next v2] ixgbe: Implement PCI reset handler
From: Sergey Temerkhanov @ 2026-06-18 14:22 UTC (permalink / raw)
To: intel-wired-lan; +Cc: netdev, pmenzel
Implement PCI device reset handler to allow the network device to
get re-initialized and function after a PCI-level reset.
This is necessary for the adapter to avoid TX queue timeouts
occurring when the PCI reset is initiated via sysfs during
the operation
Signed-off-by: Sergey Temerkhanov <sergey.temerkhanov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
Previous version: https://lore.kernel.org/netdev/MW4PR11MB6864BC9CA84F060AF7E0248480E42@MW4PR11MB6864.namprd11.prod.outlook.com/
v1->v2 changes: Rearranged the order of operations, switched to poll_timeout_us() macro
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 +
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 82 +++++++++++++++++++
2 files changed, 83 insertions(+)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 594ccb28da20..c4b0c5bb89c6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -912,6 +912,7 @@ enum ixgbe_state_t {
__IXGBE_PTP_TX_IN_PROGRESS,
__IXGBE_RESET_REQUESTED,
__IXGBE_PHY_INIT_COMPLETE,
+ __IXGBE_PCIE_RESET_IN_PROGRESS,
};
struct ixgbe_cb {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2ac274c73d61..0fb64aef223e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -12352,6 +12352,86 @@ static pci_ers_result_t ixgbe_io_slot_reset(struct pci_dev *pdev)
return result;
}
+/* 1500 us poll interval */
+#define IXGBE_RESET_PREP_POLL_INTERVAL_US 1500
+/* 2 second timeout to acquire reset lock before proceeding */
+#define IXGBE_RESET_PREP_TIMEOUT_US 2000000
+
+/**
+ * ixgbe_reset_prep - called before the pci bus is reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Prepare the card for a reset, preventing the service task from running.
+ */
+static void ixgbe_reset_prep(struct pci_dev *pdev)
+{
+ struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
+
+ if (!adapter)
+ return;
+
+ if (poll_timeout_us(test_and_set_bit(__IXGBE_RESETTING, &adapter->state),
+ test_bit(__IXGBE_RESETTING, &adapter->state),
+ IXGBE_RESET_PREP_POLL_INTERVAL_US,
+ IXGBE_RESET_PREP_TIMEOUT_US, false)) {
+ /* ixgbe_reset_done() will exit early if this happens.
+ * A retry will be needed
+ */
+ e_err(drv, "Timed out waiting for __IXGBE_RESETTING to be released. Reset is needed\n");
+ return;
+ }
+
+ /* Sync __IXGBE_RESETTING */
+ smp_mb__after_atomic();
+
+ if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
+ /* Prevent the service task from being requeued in the timer callback */
+ timer_delete_sync(&adapter->service_timer);
+ /* Cancel any possibly queued service task */
+ cancel_work_sync(&adapter->service_task);
+ }
+
+ pci_clear_master(pdev);
+
+ set_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state);
+}
+
+/**
+ * ixgbe_reset_done - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Allow the service task to run and schedule re-initialization.
+ */
+static void ixgbe_reset_done(struct pci_dev *pdev)
+{
+ struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
+
+ if (!adapter)
+ return;
+
+ if (!test_and_clear_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state)) {
+ /* Should never get here */
+ e_err(drv, "Reset done called without PCIe reset in progress\n");
+ return;
+ }
+
+ pci_set_master(pdev);
+
+ /* Allow the service task to run */
+ if (!test_bit(__IXGBE_REMOVING, &adapter->state)) {
+ clear_bit(__IXGBE_RESETTING, &adapter->state);
+ /* Sync __IXGBE_RESETTING */
+ smp_mb__after_atomic();
+ }
+
+ /* Schedule re-initialization */
+ if (!test_bit(__IXGBE_DOWN, &adapter->state)) {
+ set_bit(__IXGBE_RESET_REQUESTED, &adapter->state);
+ if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state))
+ mod_timer(&adapter->service_timer, jiffies + 1);
+ }
+}
+
/**
* ixgbe_io_resume - called when traffic can start flowing again.
* @pdev: Pointer to PCI device
@@ -12384,6 +12464,8 @@ static const struct pci_error_handlers ixgbe_err_handler = {
.error_detected = ixgbe_io_error_detected,
.slot_reset = ixgbe_io_slot_reset,
.resume = ixgbe_io_resume,
+ .reset_prepare = ixgbe_reset_prep,
+ .reset_done = ixgbe_reset_done,
};
static DEFINE_SIMPLE_DEV_PM_OPS(ixgbe_pm_ops, ixgbe_suspend, ixgbe_resume);
--
2.53.0
^ permalink raw reply related
* [PATCH net] net: au1000: move free_irq out of the close-time spinlocked section
From: Runyu Xiao @ 2026-06-18 14:19 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, jianhao.xu, runyu.xiao, stable
au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.
This issue was found by our static analysis tool and then manually
reviewed against the current tree.
The grounded PoC kept the ndo_stop carrier and the au1000_close() ->
free_irq(dev->irq, dev) path while the driver lock was held. Lockdep
reported:
BUG: sleeping function called from invalid context
1 lock held by exploit/192:
#0: (&aup->lock){....}-{2:2}, at: au1000_close+0x23/0x83 [vuln_msv]
[ BUG: Invalid wait context ]
exploit/192 is trying to lock:
(&desc->request_mutex){+.+.}-{3:3}, at: free_irq+0x63/0x360
free_irq+0x63/0x360
au1000_close+0x65/0x83 [vuln_msv]
Drop aup->lock before freeing the IRQ. The protected close-time work
still stops the device and queue before IRQ teardown, but the sleepable
IRQ core path now runs outside the spinlocked section.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/net/ethernet/amd/au1000_eth.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/amd/au1000_eth.c b/drivers/net/ethernet/amd/au1000_eth.c
index 9d35ac348ebe..5a04056e38fa 100644
--- a/drivers/net/ethernet/amd/au1000_eth.c
+++ b/drivers/net/ethernet/amd/au1000_eth.c
@@ -943,9 +943,10 @@ static int au1000_close(struct net_device *dev)
/* stop the device */
netif_stop_queue(dev);
+ spin_unlock_irqrestore(&aup->lock, flags);
+
/* disable the interrupt */
free_irq(dev->irq, dev);
- spin_unlock_irqrestore(&aup->lock, flags);
return 0;
}
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Dust Li @ 2026-06-18 14:29 UTC (permalink / raw)
To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-1-551fa514257e@proton.me>
On 2026-06-14 03:23:30, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>smc_cdc_cursor_to_host() (SMC-R) and smcd_cdc_msg_to_host() (SMC-D)
>import the peer's producer cursor from the wire into the local
>connection cursor with no upper bound against the receive buffer (RMB).
>The urgent path then uses that count as a raw index:
>
> base = conn->rmb_desc->cpu_addr + conn->rx_off;
> conn->urg_rx_byte = *(base + conn->urg_curs.count - 1);
>
>so a peer that advertises a producer cursor past rmb_desc->len reads
>out of bounds of the RMB allocation in the receive tasklet (softirq).
>
>Bound the producer cursor count to rmb_desc->len at the conversion
>boundary, for both SMC-R and SMC-D. Apply the bound to the producer
>cursor only: the consumer cursor indexes the peer's RMB and is bounded
>by peer_rmbe_size, so clamping it to our rmb_desc->len would
>under-credit peer_rmbe_space and stall transmit to a peer whose RMB is
>larger than ours.
>
>Fixes: de8474eb9d50 ("net/smc: urgent data support")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_cdc.h | 27 ++++++++++++++++++++++++---
> 1 file changed, 24 insertions(+), 3 deletions(-)
>
>diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
>index 696cc11f2303..ca76ef630356 100644
>--- a/net/smc/smc_cdc.h
>+++ b/net/smc/smc_cdc.h
>@@ -221,7 +221,8 @@ static inline void smc_host_msg_to_cdc(struct smc_cdc_msg *peer,
>
> static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
> union smc_cdc_cursor *peer,
>- struct smc_connection *conn)
>+ struct smc_connection *conn,
>+ int max_count)
> {
> union smc_host_cursor temp, old;
> union smc_cdc_cursor net;
>@@ -235,6 +236,15 @@ static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
> if ((old.wrap == temp.wrap) &&
> (old.count > temp.count))
> return;
>+ /* The peer producer cursor is wire-controlled and is later used as a
>+ * raw index into our RMB by the urgent path; bound its count to the
>+ * RMB. max_count == 0 leaves the consumer cursor unbounded here: it
>+ * indexes the peer's RMB (bounded by peer_rmbe_size, not our
>+ * rmb_desc->len), so clamping it to rmb_desc->len would under-credit
>+ * peer_rmbe_space and stall transmit to peers with a larger RMB.
>+ */
>+ if (max_count && temp.count > max_count)
>+ temp.count = max_count;
> smc_curs_copy(local, &temp, conn);
> }
>
>@@ -246,8 +256,13 @@ static inline void smcr_cdc_msg_to_host(struct smc_host_cdc_msg *local,
> local->len = peer->len;
> local->seqno = ntohs(peer->seqno);
> local->token = ntohl(peer->token);
>- smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn);
>- smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn);
>+ /* bound the wire-controlled producer cursor to our RMB (used as a raw
>+ * index by the urgent path); leave the consumer cursor unbounded -- it
>+ * indexes the peer's RMB and is bounded by peer_rmbe_size.
>+ */
>+ smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn,
>+ conn->rmb_desc->len);
>+ smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn, 0);
> local->prod_flags = peer->prod_flags;
> local->conn_state_flags = peer->conn_state_flags;
> }
>@@ -260,6 +275,12 @@ static inline void smcd_cdc_msg_to_host(struct smc_host_cdc_msg *local,
>
> temp.wrap = peer->prod.wrap;
> temp.count = peer->prod.count;
>+ /* the peer producer cursor is wire-controlled and is used as a raw
>+ * index into our RMB by the urgent path; bound it to the RMB. The
>+ * consumer cursor below indexes the peer's RMB and is left unbounded.
>+ */
>+ if (temp.count > conn->rmb_desc->len)
>+ temp.count = conn->rmb_desc->len;
> smc_curs_copy(&local->prod, &temp, conn);
>
> temp.wrap = peer->cons.wrap;
Hi Bryam,
I agree the issue is real. SMC-R's original design didn't fully
account for misbehaving peers, which is the root cause behind a
number of similar issues we've seen. The good news is that this
class of problem isn't easy to hit in practice, so it isn't
particularly urgent.
On the approach itself: once we detect that the peer is misbehaving,
I think the right action is to abort the connection and record the
event, rather than silently clamp. An invalid CDC means the whole
communication state can no longer be trusted, so continuing on a
clamped value just papers over a peer bug.
I'd suggest we add a dedicated CDC message check, and route any
failure through the existing abort path, maybe something like bellow:
static bool smc_cdc_msg_check(struct smc_connection *conn,
struct smc_cdc_msg *cdc)
{
u32 prod_count = ntohs(cdc->prod.count);
u32 cons_count = ntohs(cdc->cons.count);
if (prod_count > conn->rmb_desc->len ||
cons_count > conn->peer_rmbe_size ||
cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {
this_cpu_inc(net->smc.smc_stats->...cdc_inval);
net_ratelimited_function(pr_warn,
"smc: invalid CDC from peer (token=%u)\n",
ntohl(cdc->token));
return false;
}
return true;
}
For -stable, your current minimal patch is fine. For net-next, though, I'd prefer
the approach above: validate at the wire boundary, abort on violation, and
make the event observable via smc_stats and a ratelimited warning.
Best regards,
Dust
^ permalink raw reply
* [PATCH v4] net: mvneta_bm: add suspend/resume support to prevent crash after resume
From: Yun Zhou @ 2026-06-18 14:35 UTC (permalink / raw)
To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni
Cc: netdev, linux-kernel, yun.zhou
The mvneta driver uses the hardware Buffer Manager (BM) for RX buffer
allocation. During suspend, mvneta disables its clock, causing BM to
lose all buffer address state. On resume, mvneta_bm_port_init() re-
attaches the BM pool to the NIC, but BM hardware returns stale/garbage
buffer addresses. When NAPI poll processes these buffers, DMA cache
sync hits an invalid virtual address causing a kernel panic:
Unable to handle kernel paging request at virtual address b0000080
PC is at v7_dma_inv_range
Call trace:
v7_dma_inv_range from arch_sync_dma_for_cpu+0x94/0x158
arch_sync_dma_for_cpu from __dma_sync_single_for_cpu+0xc4/0x15c
__dma_sync_single_for_cpu from mvneta_rx_swbm+0x6c8/0xf48
mvneta_rx_swbm from mvneta_poll+0x6fc/0x70c
mvneta_poll from __napi_poll.constprop.0+0x2c/0x1e0
__napi_poll.constprop.0 from net_rx_action+0x160/0x2c4
net_rx_action from handle_softirqs+0xd8/0x2b8
handle_softirqs from run_ksoftirqd+0x30/0x94
run_ksoftirqd from smpboot_thread_fn+0x100/0x204
smpboot_thread_fn from kthread+0xf4/0x110
kthread from ret_from_fork+0x14/0x28
Fix by adding suspend/resume callbacks to the BM driver:
- suspend: drain all buffers (with DMA unmapping), free the BPPE
regions, and reset pool state to FREE before stopping BM and gating
the clock.
- resume: enable the clock, reinitialize BM defaults, and restore pool
read/write pointers and size registers. Pool allocation and buffer
refill are handled by mvneta_resume() through the normal
mvneta_bm_port_init() path, which sees pools as FREE and performs
full initialization identical to probe.
Add a device_link (DL_FLAG_AUTOREMOVE_CONSUMER) in mvneta_probe to
guarantee BM resumes before mvneta and suspends after mvneta. If the
link cannot be created, fall back to SW buffer management to avoid a
potential crash on resume due to unordered PM transitions.
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v4:
- On device_link_add() failure, fall back to SW buffer management
(destroy pools, put BM reference, clear bm_priv) instead of merely
emitting a warning. Without the link, suspend/resume ordering is
not guaranteed and the original crash can still occur.
v3:
- Restore per-pool POOL_SIZE_REG, POOL_READ_PTR_REG, and
POOL_WRITE_PTR_REG in resume, since clock gating loses all BM
register state.
- Check device_link_add() return value and emit dev_warn on failure.
- Replace SIMPLE_DEV_PM_OPS (deprecated) with
DEFINE_SIMPLE_DEV_PM_OPS and pm_sleep_ptr(), removing the
#ifdef CONFIG_PM_SLEEP guard.
- Add dev_warn in suspend if not all buffers could be freed.
v2:
- Drain buffers via mvneta_bm_bufs_free() in suspend instead of only
stopping BM and gating the clock. This ensures proper DMA unmapping
and avoids buffer leaks.
- Free the BPPE DMA-coherent region in suspend so that resume takes
the full probe-time initialization path (alloc + fill), eliminating
the need to modify mvneta_bm_pool_create().
- Reset pool type to MVNETA_BM_FREE in suspend so mvneta_bm_pool_use()
correctly re-creates and refills pools on resume.
- Check clk_prepare_enable() return value in resume.
- Add device_link between mvneta (consumer) and mvneta_bm (supplier)
to guarantee correct suspend/resume ordering.
drivers/net/ethernet/marvell/mvneta.c | 18 ++++++++
drivers/net/ethernet/marvell/mvneta_bm.c | 58 ++++++++++++++++++++++++
2 files changed, 76 insertions(+)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 744d6585a949..543e566425c1 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5678,6 +5678,24 @@ static int mvneta_probe(struct platform_device *pdev)
"use SW buffer management\n");
mvneta_bm_put(pp->bm_priv);
pp->bm_priv = NULL;
+ } else if (!device_link_add(&pdev->dev,
+ &pp->bm_priv->pdev->dev,
+ DL_FLAG_AUTOREMOVE_CONSUMER)) {
+ /*
+ * Link guarantees BM resumes before mvneta.
+ * Without it, BM may not be ready when
+ * mvneta_bm_port_init() runs on resume,
+ * causing stale buffer addresses and a crash.
+ * Fall back to SW management to be safe.
+ */
+ dev_warn(&pdev->dev,
+ "failed to link to BM, use SW buffer management\n");
+ mvneta_bm_pool_destroy(pp->bm_priv,
+ pp->pool_long, 1 << pp->id);
+ mvneta_bm_pool_destroy(pp->bm_priv,
+ pp->pool_short, 1 << pp->id);
+ mvneta_bm_put(pp->bm_priv);
+ pp->bm_priv = NULL;
}
}
/* Set RX packet offset correction for platforms, whose
diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c b/drivers/net/ethernet/marvell/mvneta_bm.c
index 6bb380494919..85162a43eaf6 100644
--- a/drivers/net/ethernet/marvell/mvneta_bm.c
+++ b/drivers/net/ethernet/marvell/mvneta_bm.c
@@ -477,6 +477,63 @@ static void mvneta_bm_remove(struct platform_device *pdev)
clk_disable_unprepare(priv->clk);
}
+static int mvneta_bm_suspend(struct device *dev)
+{
+ struct mvneta_bm *priv = dev_get_drvdata(dev);
+ int i;
+
+ /* Drain buffers and free pool resources while BM is still clocked */
+ for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+ struct mvneta_bm_pool *bm_pool = &priv->bm_pools[i];
+ int size_bytes;
+
+ if (bm_pool->type == MVNETA_BM_FREE)
+ continue;
+
+ mvneta_bm_bufs_free(priv, bm_pool, bm_pool->port_map);
+ if (bm_pool->hwbm_pool.buf_num)
+ dev_warn(&priv->pdev->dev,
+ "pool %d: %d buffers not freed\n",
+ bm_pool->id, bm_pool->hwbm_pool.buf_num);
+
+ size_bytes = sizeof(u32) * bm_pool->hwbm_pool.size;
+ dma_free_coherent(&priv->pdev->dev, size_bytes,
+ bm_pool->virt_addr, bm_pool->phys_addr);
+ bm_pool->virt_addr = NULL;
+ bm_pool->type = MVNETA_BM_FREE;
+ }
+
+ mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_STOP_MASK);
+ clk_disable_unprepare(priv->clk);
+ return 0;
+}
+
+static int mvneta_bm_resume(struct device *dev)
+{
+ struct mvneta_bm *priv = dev_get_drvdata(dev);
+ int i, err;
+
+ err = clk_prepare_enable(priv->clk);
+ if (err)
+ return err;
+
+ /* Reinitialize BM hardware; pools are refilled by mvneta_resume() */
+ mvneta_bm_default_set(priv);
+
+ /* Restore pool registers lost during clock gating */
+ for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+ mvneta_bm_write(priv, MVNETA_BM_POOL_READ_PTR_REG(i), 0);
+ mvneta_bm_write(priv, MVNETA_BM_POOL_WRITE_PTR_REG(i), 0);
+ mvneta_bm_write(priv, MVNETA_BM_POOL_SIZE_REG(i),
+ priv->bm_pools[i].hwbm_pool.size);
+ }
+
+ mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_START_MASK);
+ return 0;
+}
+
+static DEFINE_SIMPLE_DEV_PM_OPS(mvneta_bm_pm_ops, mvneta_bm_suspend, mvneta_bm_resume);
+
static const struct of_device_id mvneta_bm_match[] = {
{ .compatible = "marvell,armada-380-neta-bm" },
{ }
@@ -489,6 +546,7 @@ static struct platform_driver mvneta_bm_driver = {
.driver = {
.name = MVNETA_BM_DRIVER_NAME,
.of_match_table = mvneta_bm_match,
+ .pm = pm_sleep_ptr(&mvneta_bm_pm_ops),
},
};
--
2.43.0
^ permalink raw reply related
* [RFC v3 net-next] net: airoha: add HW GRO offload support
From: Lorenzo Bianconi @ 2026-06-18 14:42 UTC (permalink / raw)
To: andrew+netdev, davem, edumazet, kuba, pabeni
Cc: lorenzo, aleksander.lobakin, linux-arm-kernel, linux-mediatek,
netdev
Add hardware GRO offload support to the airoha_eth driver, leveraging
the EN7581/AN7583 SoC's 8 dedicated LRO hardware queues mapped to RX
queues 24-31. HW GRO offloading does not support Scatter-Gather (SG) so
it is required to increase the page_pool allocation order to 2 for RX
queues 24-31 (LRO queues).
Since HW GRO is configured per-QDMA and shared across all devices using
it, HW GRO is mutually exclusive with multiple active devices on the
same QDMA block. Call netdev_update_features() on sibling devices in
ndo_open/ndo_stop so that NETIF_F_GRO_HW availability is re-evaluated
when the QDMA user count changes.
Set CHECKSUM_PARTIAL with pseudo-header checksum on aggregated packets
so that L3-forwarded traffic is correctly handled by the GSO/TSO path
on the egress device.
Performance comparison between GRO and HW GRO has been carried out using
a 10Gbps NIC:
GRO: ~2.7 Gbps
HW GRO: ~8.1 Gbps
Tested-by: Madhur Agrawal <madhur.agrawal@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
Changes in RFC v3:
- Add missing TCP header length check.
- Fix TCP checkum calculation.
- Disable LRO running ndo_stop callback.
- Implement packet header split in order to support HW-GRO
- Link to v2: https://lore.kernel.org/r/20260610-airoha-eth-lro-v2-1-54be99b9a2d5@kernel.org
Changes in v2:
- Rebase on top of net-next main branch.
- Link to v1: https://lore.kernel.org/r/20260606-airoha-eth-lro-v1-1-0ebceb0eafc3@kernel.org
Changes in v1:
- Please note this patch depends on the following patch not applied yet
to net-next
https://lore.kernel.org/netdev/20260606-airoha_qdma_users-no-atomic-v1-1-86e2d6a1bfaf@kernel.org/T/#u
- Restrict LRO to single user QDMA.
- Introduce some more sanity checks.
- Disable scatter-gather for LRO queues.
- Run netif_receive_skb() for LRO packets.
- Link to v3: https://lore.kernel.org/r/20260528-airoha-eth-lro-v3-1-dd09c1fb000e@kernel.org
Changes in RFC v3:
- Fix double-free of the page_pool of airoha_qdma_lro_rx_process()
fails.
- Set AIROHA_LRO_PAGE_ORDER according to PAGE_SIZE.
- Add missig gso metadata for the LRO packet.
- Link to v2: https://lore.kernel.org/r/20260526-airoha-eth-lro-v2-1-24e2a9e7a397@kernel.org
Changes in RFC v2:
- Improve performances fixing buf_size computation.
- Fix possible overflow in REG_CDM_LRO_LIMIT() register configuration.
- Require the device to be not running before configuring LRO.
- Fix configuration order in airoha_fe_lro_is_enabled().
- Check skb header length in airoha_qdma_lro_rx_process().
- Do not check net_device feature in airoha_qdma_rx_process() before
executing airoha_qdma_lro_rx_process() but rely on
airoha_qdma_lro_rx_process() logic.
- Fix possible double recycle in airoha_qdma_rx_process() for LRO
packets.
- Always use AIROHA_RXQ_LRO_MAX_AGG_COUNT macro for max LRO aggregated
fragments in airoha_fe_lro_init_rx_queue().
- Link to v1: https://lore.kernel.org/r/20260520-airoha-eth-lro-v1-1-129cc33766e9@kernel.org
---
drivers/net/ethernet/airoha/airoha_eth.c | 364 ++++++++++++++++++++--
drivers/net/ethernet/airoha/airoha_eth.h | 24 ++
drivers/net/ethernet/airoha/airoha_regs.h | 22 +-
3 files changed, 386 insertions(+), 24 deletions(-)
diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f3f..2aa6915d424e 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -10,8 +10,10 @@
#include <linux/tcp.h>
#include <linux/u64_stats_sync.h>
#include <net/dst_metadata.h>
+#include <net/ip6_checksum.h>
#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
+#include <net/tcp.h>
#include <uapi/linux/ppp_defs.h>
#include "airoha_regs.h"
@@ -486,6 +488,73 @@ static void airoha_fe_crsn_qsel_init(struct airoha_eth *eth)
CDM_CRSN_QSEL_Q1));
}
+static void airoha_fe_lro_rxq_enable(struct airoha_eth *eth, int qdma_id,
+ int lro_queue_index, int qid,
+ int buf_size)
+{
+ int id = qdma_id + 1;
+
+ airoha_fe_rmw(eth, REG_CDM_LRO_LIMIT(id),
+ CDM_LRO_AGG_NUM_MASK | CDM_LRO_AGG_SIZE_MASK,
+ FIELD_PREP(CDM_LRO_AGG_SIZE_MASK, buf_size) |
+ FIELD_PREP(CDM_LRO_AGG_NUM_MASK,
+ AIROHA_RXQ_LRO_MAX_AGG_COUNT));
+ airoha_fe_rmw(eth, REG_CDM_LRO_AGE_TIME(id),
+ CDM_LRO_AGE_TIME_MASK | CDM_LRO_AGG_TIME_MASK,
+ FIELD_PREP(CDM_LRO_AGE_TIME_MASK,
+ AIROHA_RXQ_LRO_MAX_AGE_TIME) |
+ FIELD_PREP(CDM_LRO_AGG_TIME_MASK,
+ AIROHA_RXQ_LRO_MAX_AGG_TIME));
+ airoha_fe_rmw(eth, REG_CDM_LRO_RXQ(id, lro_queue_index),
+ LRO_RXQ_MASK(lro_queue_index),
+ __field_prep(LRO_RXQ_MASK(lro_queue_index), qid));
+ airoha_fe_set(eth, REG_CDM_LRO_EN(id), BIT(lro_queue_index));
+}
+
+static void airoha_fe_lro_disable(struct airoha_eth *eth, int qdma_id)
+{
+ int i, id = qdma_id + 1;
+
+ airoha_fe_clear(eth, REG_CDM_LRO_EN(id), LRO_RXQ_EN_MASK);
+ airoha_fe_clear(eth, REG_CDM_LRO_LIMIT(id),
+ CDM_LRO_AGG_NUM_MASK | CDM_LRO_AGG_SIZE_MASK);
+ airoha_fe_clear(eth, REG_CDM_LRO_AGE_TIME(id),
+ CDM_LRO_AGE_TIME_MASK | CDM_LRO_AGG_TIME_MASK);
+ for (i = 0; i < AIROHA_MAX_NUM_LRO_QUEUES; i++)
+ airoha_fe_clear(eth, REG_CDM_LRO_RXQ(id, i), LRO_RXQ_MASK(i));
+}
+
+static bool airoha_fe_lro_is_enabled(struct airoha_eth *eth, int qdma_id)
+{
+ return airoha_fe_get(eth, REG_CDM_LRO_EN(qdma_id + 1),
+ LRO_RXQ_EN_MASK);
+}
+
+static void airoha_dev_lro_enable(struct airoha_gdm_dev *dev)
+{
+ struct airoha_qdma *qdma = dev->qdma;
+ struct airoha_eth *eth = qdma->eth;
+ int qdma_id = qdma - ð->qdma[0];
+ int i, lro_queue_index = 0;
+
+ for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
+ struct airoha_queue *q = &qdma->q_rx[i];
+ u32 size;
+
+ if (!q->ndesc)
+ continue;
+
+ if (!airoha_qdma_is_lro_queue(q))
+ continue;
+
+ size = SKB_WITH_OVERHEAD(AIROHA_RX_LEN(q->buf_size));
+ size = min_t(u32, size, CDM_LRO_AGG_SIZE_MASK);
+ airoha_fe_lro_rxq_enable(eth, qdma_id, lro_queue_index, i,
+ size);
+ lro_queue_index++;
+ }
+}
+
static int airoha_fe_init(struct airoha_eth *eth)
{
airoha_fe_maccr_init(eth);
@@ -611,6 +680,7 @@ static int airoha_qdma_fill_rx_queue(struct airoha_queue *q)
e->dma_addr = page_pool_get_dma_addr(page) + offset;
e->dma_len = SKB_WITH_OVERHEAD(AIROHA_RX_LEN(q->buf_size));
+ WRITE_ONCE(desc->tcp_ts_reply, 0);
val = FIELD_PREP(QDMA_DESC_LEN_MASK, e->dma_len);
WRITE_ONCE(desc->ctrl, cpu_to_le32(val));
WRITE_ONCE(desc->addr, cpu_to_le32(e->dma_addr));
@@ -652,12 +722,173 @@ airoha_qdma_get_gdm_dev(struct airoha_eth *eth, struct airoha_qdma_desc *desc)
return port->devs[d] ? port->devs[d] : ERR_PTR(-ENODEV);
}
+static struct sk_buff *airoha_qdma_lro_rx_skb(struct airoha_queue *q,
+ struct airoha_qdma_desc *desc,
+ struct airoha_queue_entry *e)
+{
+ u32 len, th_off, tcp_ack_seq, agg_count, data_off, data_len;
+ u32 desc_ctrl = le32_to_cpu(READ_ONCE(desc->ctrl));
+ u32 msg1 = le32_to_cpu(READ_ONCE(desc->msg1));
+ u32 msg2 = le32_to_cpu(READ_ONCE(desc->msg2));
+ u32 msg3 = le32_to_cpu(READ_ONCE(desc->msg3));
+ struct skb_shared_info *shinfo;
+ u16 tcp_win, l2_len;
+ struct sk_buff *skb;
+ struct tcphdr *th;
+ struct page *page;
+ bool ipv4, ipv6;
+
+ ipv4 = FIELD_GET(QDMA_ETH_RXMSG_IP4_MASK, msg1);
+ ipv6 = FIELD_GET(QDMA_ETH_RXMSG_IP6_MASK, msg1);
+ if (!ipv4 && !ipv6)
+ return NULL;
+
+ l2_len = FIELD_GET(QDMA_ETH_RXMSG_L2_LEN_MASK, msg2);
+ len = FIELD_GET(QDMA_DESC_LEN_MASK, desc_ctrl);
+
+ if (ipv4) {
+ struct iphdr *iph;
+
+ if (len < l2_len + sizeof(*iph))
+ return NULL;
+
+ iph = (struct iphdr *)(e->buf + l2_len);
+ if (iph->protocol != IPPROTO_TCP)
+ return NULL;
+
+ if (iph->ihl < 5)
+ return NULL;
+
+ th_off = l2_len + (iph->ihl << 2);
+ if (len < th_off)
+ return NULL;
+
+ iph->tot_len = cpu_to_be16(len - l2_len);
+ iph->check = 0;
+ iph->check = ip_fast_csum((void *)iph, iph->ihl);
+ } else {
+ struct ipv6hdr *ip6h;
+
+ th_off = l2_len + sizeof(*ip6h);
+ if (len < th_off)
+ return NULL;
+
+ ip6h = (struct ipv6hdr *)(e->buf + l2_len);
+ if (ip6h->nexthdr != NEXTHDR_TCP)
+ return NULL;
+
+ ip6h->payload_len = cpu_to_be16(len - th_off);
+ }
+
+ if (len < th_off + sizeof(*th))
+ return NULL;
+
+ th = (struct tcphdr *)(e->buf + th_off);
+ if (th->doff < 5)
+ return NULL;
+
+ data_off = th_off + (th->doff << 2);
+ if (len < data_off)
+ return NULL;
+
+ tcp_win = FIELD_GET(QDMA_ETH_RXMSG_TCP_WIN_MASK, msg3);
+ tcp_ack_seq = le32_to_cpu(READ_ONCE(desc->data));
+ th->ack_seq = cpu_to_be32(tcp_ack_seq);
+ th->window = cpu_to_be16(tcp_win);
+
+ /* Check tcp timestamp option */
+ if (th->doff == (sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4) {
+ u32 topt = get_unaligned_be32(th + 1);
+
+ if (topt == ((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
+ (TCPOPT_TIMESTAMP << 8) | TCPOLEN_TIMESTAMP)) {
+ u8 *ptr = (u8 *)th + sizeof(*th) + 2 * sizeof(__be32);
+ __le32 tcp_ts_reply = READ_ONCE(desc->tcp_ts_reply);
+
+ put_unaligned_be32(le32_to_cpu(tcp_ts_reply), ptr);
+ }
+ }
+
+ if (ipv4) {
+ struct iphdr *iph = (struct iphdr *)(e->buf + l2_len);
+
+ th->check = ~tcp_v4_check(len - th_off, iph->saddr,
+ iph->daddr, 0);
+ } else {
+ struct ipv6hdr *ip6h = (struct ipv6hdr *)(e->buf + l2_len);
+
+ th->check = ~tcp_v6_check(len - th_off, &ip6h->saddr,
+ &ip6h->daddr, 0);
+ }
+
+ /* Split network headers and payload to rely on GRO.
+ * We need to do it in the driver since the NIC does
+ * not support it.
+ */
+ skb = napi_alloc_skb(&q->napi, data_off);
+ if (!skb)
+ return NULL;
+
+ __skb_put(skb, data_off);
+ memcpy(skb->data, e->buf, data_off);
+
+ page = virt_to_head_page(e->buf);
+ data_len = len - data_off;
+ shinfo = skb_shinfo(skb);
+ skb_add_rx_frag(skb, shinfo->nr_frags, page,
+ e->buf + data_off - page_address(page), data_len,
+ q->buf_size);
+
+ shinfo->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
+ agg_count = FIELD_GET(QDMA_ETH_RXMSG_AGG_COUNT_MASK, msg2);
+ shinfo->gso_size = DIV_ROUND_UP(data_len, agg_count);
+ shinfo->gso_segs = agg_count;
+
+ skb->csum_start = skb_headroom(skb) + th_off;
+ skb->csum_offset = offsetof(struct tcphdr, check);
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ return skb;
+}
+
+static struct sk_buff *airoha_qdma_build_rx_skb(struct airoha_queue *q,
+ struct airoha_qdma_desc *desc,
+ struct airoha_queue_entry *e,
+ struct net_device *dev)
+{
+ u32 msg2 = le32_to_cpu(READ_ONCE(desc->msg2));
+ int qid = q - &q->qdma->q_rx[0];
+ struct sk_buff *skb;
+
+ if (FIELD_GET(QDMA_ETH_RXMSG_AGG_COUNT_MASK, msg2) > 1) { /* LRO */
+ skb = airoha_qdma_lro_rx_skb(q, desc, e);
+ if (!skb)
+ return NULL;
+ } else {
+ u32 desc_ctrl = le32_to_cpu(READ_ONCE(desc->ctrl));
+ u32 len = FIELD_GET(QDMA_DESC_LEN_MASK, desc_ctrl);
+
+ skb = napi_build_skb(e->buf - AIROHA_RX_HEADROOM, q->buf_size);
+ if (!skb)
+ return NULL;
+
+ skb_reserve(skb, AIROHA_RX_HEADROOM);
+ __skb_put(skb, len);
+ skb->ip_summed = CHECKSUM_UNNECESSARY;
+ }
+
+ skb_mark_for_recycle(skb);
+ skb->dev = dev;
+ skb_record_rx_queue(skb, qid);
+ skb->protocol = eth_type_trans(skb, dev);
+
+ return skb;
+}
+
static int airoha_qdma_rx_process(struct airoha_queue *q, int budget)
{
enum dma_data_direction dir = page_pool_get_dma_dir(q->page_pool);
- struct airoha_qdma *qdma = q->qdma;
- struct airoha_eth *eth = qdma->eth;
- int qid = q - &qdma->q_rx[0];
+ struct airoha_eth *eth = q->qdma->eth;
int done = 0;
while (done < budget) {
@@ -693,18 +924,9 @@ static int airoha_qdma_rx_process(struct airoha_queue *q, int budget)
netdev = netdev_from_priv(dev);
if (!q->skb) { /* first buffer */
- q->skb = napi_build_skb(e->buf - AIROHA_RX_HEADROOM,
- q->buf_size);
+ q->skb = airoha_qdma_build_rx_skb(q, desc, e, netdev);
if (!q->skb)
goto free_frag;
-
- skb_reserve(q->skb, AIROHA_RX_HEADROOM);
- __skb_put(q->skb, len);
- skb_mark_for_recycle(q->skb);
- q->skb->dev = netdev;
- q->skb->protocol = eth_type_trans(q->skb, netdev);
- q->skb->ip_summed = CHECKSUM_UNNECESSARY;
- skb_record_rx_queue(q->skb, qid);
} else { /* scattered frame */
struct skb_shared_info *shinfo = skb_shinfo(q->skb);
int nr_frags = shinfo->nr_frags;
@@ -795,12 +1017,10 @@ static int airoha_qdma_rx_napi_poll(struct napi_struct *napi, int budget)
static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
struct airoha_qdma *qdma, int ndesc)
{
- const struct page_pool_params pp_params = {
- .order = 0,
+ struct page_pool_params pp_params = {
.pool_size = 256,
.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
.dma_dir = DMA_FROM_DEVICE,
- .max_len = PAGE_SIZE,
.nid = NUMA_NO_NODE,
.dev = qdma->eth->dev,
.napi = &q->napi,
@@ -808,9 +1028,10 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
struct airoha_eth *eth = qdma->eth;
int qid = q - &qdma->q_rx[0], thr;
dma_addr_t dma_addr;
+ bool lro_q;
- q->buf_size = PAGE_SIZE / 2;
q->qdma = qdma;
+ lro_q = airoha_qdma_is_lro_queue(q);
q->entry = devm_kzalloc(eth->dev, ndesc * sizeof(*q->entry),
GFP_KERNEL);
@@ -822,6 +1043,9 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
if (!q->desc)
return -ENOMEM;
+ pp_params.order = lro_q ? AIROHA_LRO_PAGE_ORDER : 0;
+ pp_params.max_len = PAGE_SIZE << pp_params.order;
+
q->page_pool = page_pool_create(&pp_params);
if (IS_ERR(q->page_pool)) {
int err = PTR_ERR(q->page_pool);
@@ -830,6 +1054,7 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
return err;
}
+ q->buf_size = lro_q ? pp_params.max_len : pp_params.max_len / 2;
q->ndesc = ndesc;
netif_napi_add(eth->napi_dev, &q->napi, airoha_qdma_rx_napi_poll);
@@ -843,7 +1068,12 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
FIELD_PREP(RX_RING_THR_MASK, thr));
airoha_qdma_rmw(qdma, REG_RX_DMA_IDX(qid), RX_RING_DMA_IDX_MASK,
FIELD_PREP(RX_RING_DMA_IDX_MASK, q->head));
- airoha_qdma_set(qdma, REG_RX_SCATTER_CFG(qid), RX_RING_SG_EN_MASK);
+ if (lro_q)
+ airoha_qdma_clear(qdma, REG_RX_SCATTER_CFG(qid),
+ RX_RING_SG_EN_MASK);
+ else
+ airoha_qdma_set(qdma, REG_RX_SCATTER_CFG(qid),
+ RX_RING_SG_EN_MASK);
airoha_qdma_fill_rx_queue(q);
@@ -865,6 +1095,7 @@ static void airoha_qdma_cleanup_rx_queue(struct airoha_queue *q)
page_pool_get_dma_dir(q->page_pool));
page_pool_put_full_page(q->page_pool, page, false);
/* Reset DMA descriptor */
+ WRITE_ONCE(desc->tcp_ts_reply, 0);
WRITE_ONCE(desc->ctrl, 0);
WRITE_ONCE(desc->addr, 0);
WRITE_ONCE(desc->data, 0);
@@ -1802,6 +2033,37 @@ static void airoha_update_hw_stats(struct airoha_gdm_dev *dev)
spin_unlock(&port->stats_lock);
}
+static void airoha_update_netdev_features(struct airoha_gdm_dev *dev)
+{
+ struct airoha_eth *eth = dev->eth;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
+ struct airoha_gdm_port *port = eth->ports[i];
+ int j;
+
+ if (!port)
+ continue;
+
+ for (j = 0; j < ARRAY_SIZE(port->devs); j++) {
+ struct airoha_gdm_dev *iter_dev = port->devs[j];
+ struct net_device *netdev;
+
+ if (!iter_dev || iter_dev == dev)
+ continue;
+
+ if (iter_dev->qdma != dev->qdma)
+ continue;
+
+ netdev = netdev_from_priv(iter_dev);
+ if (netdev->reg_state != NETREG_REGISTERED)
+ continue;
+
+ netdev_update_features(netdev);
+ }
+ }
+}
+
static int airoha_dev_open(struct net_device *netdev)
{
int err, len = ETH_HLEN + netdev->mtu + ETH_FCS_LEN;
@@ -1809,6 +2071,17 @@ static int airoha_dev_open(struct net_device *netdev)
struct airoha_gdm_port *port = dev->port;
u32 cur_len, pse_port = FE_PSE_PORT_PPE1;
struct airoha_qdma *qdma = dev->qdma;
+ int qdma_id = qdma - &qdma->eth->qdma[0];
+
+ /* HW GRO is configured on the QDMA and it is shared between
+ * all the devices using it. Refuse to open a second device on
+ * the same QDMA if HW GRO is enabled on any device sharing it.
+ */
+ if (qdma->users && airoha_fe_lro_is_enabled(qdma->eth, qdma_id)) {
+ netdev_warn(netdev, "required to disable HW GRO on QDMA%d\n",
+ qdma_id);
+ return -EBUSY;
+ }
netif_tx_start_all_queues(netdev);
err = airoha_set_vip_for_gdm_port(dev, true);
@@ -1848,6 +2121,11 @@ static int airoha_dev_open(struct net_device *netdev)
airoha_set_gdm_port_fwd_cfg(qdma->eth, REG_GDM_FWD_CFG(port->id),
pse_port);
+ if (netdev->features & NETIF_F_GRO_HW)
+ airoha_dev_lro_enable(dev);
+
+ airoha_update_netdev_features(dev);
+
return 0;
}
@@ -1895,6 +2173,9 @@ static int airoha_dev_stop(struct net_device *netdev)
FE_PSE_PORT_DROP);
if (!--qdma->users) {
+ int qdma_id = qdma - &qdma->eth->qdma[0];
+
+ airoha_fe_lro_disable(qdma->eth, qdma_id);
airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
GLOBAL_CFG_TX_DMA_EN_MASK |
GLOBAL_CFG_RX_DMA_EN_MASK);
@@ -1907,6 +2188,8 @@ static int airoha_dev_stop(struct net_device *netdev)
}
}
+ airoha_update_netdev_features(dev);
+
return 0;
}
@@ -2176,6 +2459,41 @@ int airoha_get_fe_port(struct airoha_gdm_dev *dev)
}
}
+static netdev_features_t airoha_dev_fix_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+ struct airoha_gdm_dev *dev = netdev_priv(netdev);
+ struct airoha_qdma *qdma = dev->qdma;
+
+ if (qdma->users > 1)
+ features &= ~NETIF_F_GRO_HW;
+
+ return features;
+}
+
+static int airoha_dev_set_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+ netdev_features_t diff = netdev->features ^ features;
+ struct airoha_gdm_dev *dev = netdev_priv(netdev);
+
+ if (!(diff & NETIF_F_GRO_HW))
+ return 0;
+
+ if (!netif_running(netdev))
+ return 0;
+
+ if (features & NETIF_F_GRO_HW) {
+ airoha_dev_lro_enable(dev);
+ } else {
+ int qdma_id = dev->qdma - &dev->eth->qdma[0];
+
+ airoha_fe_lro_disable(dev->eth, qdma_id);
+ }
+
+ return 0;
+}
+
static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
struct net_device *netdev)
{
@@ -3102,6 +3420,8 @@ static const struct net_device_ops airoha_netdev_ops = {
.ndo_stop = airoha_dev_stop,
.ndo_change_mtu = airoha_dev_change_mtu,
.ndo_select_queue = airoha_dev_select_queue,
+ .ndo_fix_features = airoha_dev_fix_features,
+ .ndo_set_features = airoha_dev_set_features,
.ndo_start_xmit = airoha_dev_xmit,
.ndo_get_stats64 = airoha_dev_get_stats64,
.ndo_set_mac_address = airoha_dev_set_macaddr,
@@ -3189,11 +3509,9 @@ static int airoha_alloc_gdm_device(struct airoha_eth *eth,
netdev->ethtool_ops = &airoha_ethtool_ops;
netdev->max_mtu = AIROHA_MAX_MTU;
netdev->watchdog_timeo = 5 * HZ;
- netdev->hw_features = NETIF_F_IP_CSUM | NETIF_F_RXCSUM | NETIF_F_TSO6 |
- NETIF_F_IPV6_CSUM | NETIF_F_SG | NETIF_F_TSO |
- NETIF_F_HW_TC;
- netdev->features |= netdev->hw_features;
- netdev->vlan_features = netdev->hw_features;
+ netdev->hw_features = AIROHA_HW_FEATURES | NETIF_F_GRO_HW;
+ netdev->features |= AIROHA_HW_FEATURES;
+ netdev->vlan_features = AIROHA_HW_FEATURES;
SET_NETDEV_DEV(netdev, eth->dev);
/* reserve hw queues for HTB offloading */
diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
index 41d2e7a1f9fb..c13757a88aba 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.h
+++ b/drivers/net/ethernet/airoha/airoha_eth.h
@@ -44,6 +44,18 @@
(_n) == 15 ? 128 : \
(_n) == 0 ? 1024 : 16)
+#define AIROHA_LRO_PAGE_ORDER order_base_2(SZ_16K / PAGE_SIZE)
+#define AIROHA_MAX_NUM_LRO_QUEUES 8
+#define AIROHA_RXQ_LRO_EN_MASK GENMASK(31, 24)
+#define AIROHA_RXQ_LRO_MAX_AGG_COUNT 64
+#define AIROHA_RXQ_LRO_MAX_AGG_TIME 100
+#define AIROHA_RXQ_LRO_MAX_AGE_TIME 2000
+
+#define AIROHA_HW_FEATURES \
+ (NETIF_F_IP_CSUM | NETIF_F_RXCSUM | \
+ NETIF_F_TSO6 | NETIF_F_IPV6_CSUM | \
+ NETIF_F_SG | NETIF_F_TSO | NETIF_F_HW_TC)
+
#define PSE_RSV_PAGES 128
#define PSE_QUEUE_RSV_PAGES 64
@@ -673,6 +685,18 @@ static inline bool airoha_is_7583(struct airoha_eth *eth)
return eth->soc->version == 0x7583;
}
+static inline bool airoha_qdma_is_lro_queue(struct airoha_queue *q)
+{
+ struct airoha_qdma *qdma = q->qdma;
+ int qid = q - &qdma->q_rx[0];
+
+ /* EN7581 SoC supports at most 8 LRO rx queues */
+ BUILD_BUG_ON(hweight32(AIROHA_RXQ_LRO_EN_MASK) >
+ AIROHA_MAX_NUM_LRO_QUEUES);
+
+ return !!(AIROHA_RXQ_LRO_EN_MASK & BIT(qid));
+}
+
int airoha_get_fe_port(struct airoha_gdm_dev *dev);
bool airoha_is_valid_gdm_dev(struct airoha_eth *eth,
struct airoha_gdm_dev *dev);
diff --git a/drivers/net/ethernet/airoha/airoha_regs.h b/drivers/net/ethernet/airoha/airoha_regs.h
index 436f3c8779c1..dfc786583774 100644
--- a/drivers/net/ethernet/airoha/airoha_regs.h
+++ b/drivers/net/ethernet/airoha/airoha_regs.h
@@ -122,6 +122,20 @@
#define CDM_CRSN_QSEL_REASON_MASK(_n) \
GENMASK(4 + (((_n) % 4) << 3), (((_n) % 4) << 3))
+#define REG_CDM_LRO_RXQ(_n, _m) (CDM_BASE(_n) + 0x78 + ((_m) & 0x4))
+#define LRO_RXQ_MASK(_n) GENMASK(4 + (((_n) & 0x3) << 3), ((_n) & 0x3) << 3)
+
+#define REG_CDM_LRO_EN(_n) (CDM_BASE(_n) + 0x80)
+#define LRO_RXQ_EN_MASK GENMASK(7, 0)
+
+#define REG_CDM_LRO_LIMIT(_n) (CDM_BASE(_n) + 0x84)
+#define CDM_LRO_AGG_NUM_MASK GENMASK(23, 16)
+#define CDM_LRO_AGG_SIZE_MASK GENMASK(15, 0)
+
+#define REG_CDM_LRO_AGE_TIME(_n) (CDM_BASE(_n) + 0x88)
+#define CDM_LRO_AGE_TIME_MASK GENMASK(31, 16)
+#define CDM_LRO_AGG_TIME_MASK GENMASK(15, 0)
+
#define REG_GDM_FWD_CFG(_n) GDM_BASE(_n)
#define GDM_PAD_EN_MASK BIT(28)
#define GDM_DROP_CRC_ERR_MASK BIT(23)
@@ -883,9 +897,15 @@
#define QDMA_ETH_RXMSG_SPORT_MASK GENMASK(25, 21)
#define QDMA_ETH_RXMSG_CRSN_MASK GENMASK(20, 16)
#define QDMA_ETH_RXMSG_PPE_ENTRY_MASK GENMASK(15, 0)
+/* RX MSG2 */
+#define QDMA_ETH_RXMSG_AGG_COUNT_MASK GENMASK(31, 24)
+#define QDMA_ETH_RXMSG_L2_LEN_MASK GENMASK(6, 0)
+/* RX MSG3 */
+#define QDMA_ETH_RXMSG_AGG_LEN_MASK GENMASK(31, 16)
+#define QDMA_ETH_RXMSG_TCP_WIN_MASK GENMASK(15, 0)
struct airoha_qdma_desc {
- __le32 rsv;
+ __le32 tcp_ts_reply;
__le32 ctrl;
__le32 addr;
__le32 data;
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net] eth: fbnic: take netif_addr_lock_bh() around rx mode address programming
From: Simon Horman @ 2026-06-18 14:50 UTC (permalink / raw)
To: Daniel Zahka
Cc: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Sanman Pradhan,
netdev, linux-kernel
In-Reply-To: <20260617-linux-fbnic-hwaddr-v1-1-3f9f5dee7f99@gmail.com>
On Wed, Jun 17, 2026 at 03:39:49AM -0700, Daniel Zahka wrote:
> When __fbnic_set_rx_mode() is called from contexts other than
> .ndo_set_rx_mode_async(), the uc and mc addr lists are accessed
> without the addr lock that __hw_addr_sync_dev() and
> __hw_addr_unsync_dev() require. Wrap these unprotected accesses with
> netif_addr_lock_bh(). fbnic_clear_rx_mode() has similar issues.
>
> Fixes: eb690ef8d1c2 ("eth: fbnic: Add L2 address programming")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* [PATCH net v2] ice: eswitch: fix use-after-free of metadata_dst in repr release
From: Doruk Tan Ozturk @ 2026-06-18 14:50 UTC (permalink / raw)
To: anthony.l.nguyen, przemyslaw.kitszel, andrew+netdev, davem,
edumazet, kuba, pabeni
Cc: michal.swiatkowski, wojciech.drewek, intel-wired-lan, netdev,
linux-kernel, stable, horms
ice_eswitch_release_repr() frees the port representor metadata_dst via
metadata_dst_free(), which directly kfree()s the object and ignores the
dst_entry refcount. The eswitch slow-path TX routine
ice_eswitch_port_start_xmit() takes a reference on this dst with
dst_hold() and attaches it to the skb via skb_dst_set(). If such an skb
is still in flight (e.g. queued in a qdisc) when the representor is torn
down, the metadata_dst is freed while the skb still points at it. When
the skb is later freed, dst_release() operates on already-freed memory.
Replace metadata_dst_free() with dst_release() so the metadata_dst is
freed only after the last reference is dropped. The dst subsystem frees
metadata_dst objects from dst_destroy() once the refcount reaches zero
(DST_METADATA is set by metadata_dst_alloc()).
Same class of bug and fix as commit c32b26aaa2f9 ("netfilter:
nft_tunnel: fix use-after-free on object destroy").
Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
Reviewed-by: Simon Horman <horms@kernel.org>
---
v2:
- Correct the Fixes: tag to 1a1c40df2e80 ("ice: set and release
switchdev environment"); the previously cited fff292b47ac1 only moved
the affected code rather than introducing the unbalanced free, and the
bug dates back to when switchdev support was added (Simon Horman).
- Add Simon Horman's Reviewed-by. No functional change.
drivers/net/ethernet/intel/ice/ice_eswitch.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f..41b30a7ca4a9 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -95,7 +95,7 @@ ice_eswitch_release_repr(struct ice_pf *pf, struct ice_repr *repr)
return;
ice_vsi_update_security(vsi, ice_vsi_ctx_set_antispoof);
- metadata_dst_free(repr->dst);
+ dst_release(&repr->dst->dst);
repr->dst = NULL;
ice_fltr_add_mac_and_broadcast(vsi, repr->parent_mac,
ICE_FWD_TO_VSI);
--
2.43.0
^ permalink raw reply related
* [PATCH net v3] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Doruk Tan Ozturk @ 2026-06-18 14:55 UTC (permalink / raw)
To: saeedm, leon, tariqt, mbloch, sd, andrew+netdev, davem, edumazet,
kuba, pabeni
Cc: borisp, raeds, ehakim, netdev, linux-rdma, linux-kernel, stable,
horms
When an offloaded MACsec RX SC is deleted, macsec_del_rxsc_ctx() released
the per-SC metadata_dst with metadata_dst_free(), which calls kfree()
unconditionally and ignores the dst reference count. The RX datapath in
mlx5e_macsec_offload_handle_rx_skb() looks up the SC under rcu_read_lock()
via xa_load() and, while still holding only the RCU read lock, takes a
reference with dst_hold() and attaches the dst to the skb with
skb_dst_set().
A reader that has already obtained the rx_sc pointer can therefore race
with the delete path:
CPU0 (del_rxsc) CPU1 (rx datapath)
-------------- ------------------
rcu_read_lock();
rx_sc = xa_load(...)->rx_sc;
xa_erase(...);
metadata_dst_free(rx_sc->md_dst); /* kfree(), ignores refcount */
dst_hold(&rx_sc->md_dst->dst); /* UAF */
skb_dst_set(skb, &rx_sc->md_dst->dst);
metadata_dst_free() frees the object even though the datapath still holds
(or is about to take) a reference, so the subsequent dst_hold() /
skb_dst_set() and the later skb free operate on freed memory.
Fix the owner side by dropping the reference with dst_release() instead of
freeing unconditionally. dst_release() only schedules the RCU-deferred
dst_destroy() once the reference count reaches zero, so a concurrent reader
that still holds a reference keeps the object alive.
Dropping the owner reference is not sufficient on its own: once the owner
reference is the last one, dst_release() drops the count to zero and the
destroy is merely RCU-deferred. A racing reader that runs plain dst_hold()
on that already-dead dst gets rcuref_get() == false but dst_hold() only
WARNs and attaches the dying dst to the skb anyway; the later skb free then
calls dst_release() on an object whose destroy is already scheduled, again
a use-after-free.
Convert the RX datapath to dst_hold_safe(), which returns false (without
warning) when the dst is already dead, and only attach it to the skb when a
reference was successfully taken. When the SC is being deleted the in-flight
packet simply proceeds without the offload metadata_dst: skb_metadata_dst()
returns NULL, the MACsec core sees !is_macsec_md_dst and skips this secy
(rx_uses_md_dst path), which is the correct behaviour for a packet whose SC
is going away.
While reworking the datapath lookup, also guard the two NULL dereferences
on the same path that an automated review (forwarded by Simon Horman)
flagged: xa_load() can return NULL when the fs_id has just been erased, and
mlx5e_macsec_add_rxsc() publishes sc_xarray_element via xa_alloc() before
rx_sc->md_dst is allocated, so a packet carrying a freshly recycled fs_id
can observe a non-NULL rx_sc whose md_dst is still NULL. Check both before
dereferencing.
Note: macsec_del_rxsc_ctx() also kfree()s rx_sc->sc_xarray_element without
an RCU grace period while the same datapath reads it under rcu_read_lock();
that is a separate pre-existing issue and is left to a follow-up patch.
Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
---
v3:
- Also guard the RX-datapath NULL dereferences flagged by the automated
review: NULL-check the xa_load() result and rx_sc->md_dst before use.
- Note the unrelated non-RCU kfree(sc_xarray_element) in the delete path
as a separate follow-up rather than folding it in here.
v2:
- Convert the RX datapath dst_hold() to dst_hold_safe() so a reader racing
the SC delete cannot attach a dst whose last reference was just dropped
(per the automated review forwarded by Simon Horman).
v1: https://lore.kernel.org/netdev/20260615140534.52691-1-doruk@0sec.ai/
.../net/ethernet/mellanox/mlx5/core/en_accel/macsec.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index 71b3a05..fb2c64d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -829,7 +829,7 @@ static void macsec_del_rxsc_ctx(struct mlx5e_macsec *macsec, struct mlx5e_macsec
*/
list_del_rcu(&rx_sc->rx_sc_list_element);
xa_erase(&macsec->sc_xarray, rx_sc->sc_xarray_element->fs_id);
- metadata_dst_free(rx_sc->md_dst);
+ dst_release(&rx_sc->md_dst->dst);
kfree(rx_sc->sc_xarray_element);
kfree_rcu_mightsleep(rx_sc);
}
@@ -1695,10 +1695,10 @@ void mlx5e_macsec_offload_handle_rx_skb(struct net_device *netdev,
rcu_read_lock();
sc_xarray_element = xa_load(&macsec->sc_xarray, fs_id);
- rx_sc = sc_xarray_element->rx_sc;
- if (rx_sc) {
- dst_hold(&rx_sc->md_dst->dst);
- skb_dst_set(skb, &rx_sc->md_dst->dst);
+ rx_sc = sc_xarray_element ? sc_xarray_element->rx_sc : NULL;
+ if (rx_sc && rx_sc->md_dst) {
+ if (dst_hold_safe(&rx_sc->md_dst->dst))
+ skb_dst_set(skb, &rx_sc->md_dst->dst);
}
rcu_read_unlock();
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Breno Leitao @ 2026-06-18 14:57 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Peter Zijlstra, Petr Mladek, Sebastian Andrzej Siewior,
John Ogness, Sergey Senozhatsky, Vlad Poenaru, Thomas Gleixner,
netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260617132127.645534d1@kernel.org>
On Wed, Jun 17, 2026 at 01:21:27PM -0700, Jakub Kicinski wrote:
> On Wed, 17 Jun 2026 07:56:50 -0700 Breno Leitao wrote:
> > As far as I can tell, there isn't a network driver today whose transmit
> > path is completely lockless, so, even if we make netpoll lockless.
> >
> > It's unlikely any NIC will ever achieve this, given that NIC TX
> > fundamentally relies on a shared DMA ring and doorbell register, which
> > inherently cannot be made lockless.
>
> The lock which protects the queue is maintained by the stack,
> and we trylock it. Maybe I lost the thread but if you're saying
> that writes to netconsole are impossible from arbitrary context,
> that is _not_ true, AFAIU. We can queue a packet and kick off
> the transfer on well-behaved drivers.
>
> Main problem is the opportunistic freeing up of the queue space.
> If we could avoid that in atomic context I think we'd be good.
Thanks for the clarification, this is quite valuable.
Let me verify my understanding: if we switched to __raise_softirq_irqoff()
in dev_kfree_skb_irq_reason(), the issue would be resolved since we'd
avoid waking ksoftirqd and therefore wouldn't touch the runqueue lock in this
code path.
However, while that would eliminate the nested lock problem, it could
increase memory pressure by delaying SKB garbage collection, which may
not be acceptable.
Naive question: What if we deferred SKB cleanup only during netpoll operations?
Such as tracking in_netpoll per cpu:
struct softnet_data {
....
+ bool in_netpoll;
}
and then choosing between __raise_softirq_irqoff() and raise_softirq_irqoff()?
@@ -3456,7 +3456,13 @@ void dev_kfree_skb_irq_reason(struct sk_buff *skb, enum skb_drop_reason reason)
local_irq_save(flags);
skb->next = __this_cpu_read(softnet_data.completion_queue);
__this_cpu_write(softnet_data.completion_queue, skb);
- raise_softirq_irqoff(NET_TX_SOFTIRQ);
+ if (__this_cpu_read(softnet_data.in_netpoll))
+ __raise_softirq_irqoff(NET_TX_SOFTIRQ);
+ else
+ raise_softirq_irqoff(NET_TX_SOFTIRQ);
local_irq_restore(flags);
}
Is it too hacky!?
Thanks,
--breno
^ permalink raw reply
* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Sebastian Andrzej Siewior @ 2026-06-18 15:04 UTC (permalink / raw)
To: Zhou, Yun
Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
maxime.chevallier, netdev, linux-kernel
In-Reply-To: <06d0158e-bf4c-4ad1-8ad3-c8176003ab11@windriver.com>
On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>
> > But if the thread is idle then you have one enable too many, don't you?
> > Well you have the NAPI callback which does disable on the local CPU and
> > this resume which enables it on every CPU. So this does not look right.
> >
>
> The enable in resume is intentionally unconditional and idempotent
> (writing MPIC_INT_CLEAR_MASK on an already unmasked IRQ is a no-op).
>
> > The interesting question is what happens to the enable_percpu_irq() from
> > the mvneta_poll(). Is it lost? And if so, how/ why?
> >
>
> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
> gets a chance to execute. The sequence is:
>
> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
not threaded. That does not look optimal.
> 3. NAPI poll cannot run → enable_percpu_irq() is never called
> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
> but does NOT execute the completion path (no enable_percpu_irq)
napi_schedule() sets NAPIF_STATE_SCHED.
napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
will be invoked unless it leaves somewhere early.
However, if DISABLED was already set then it disables the IRQ source but
does not schedule NAPI. This is probably what happens.
> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>
> The unconditional enable in resume covers this case. When NAPI was
> idle at suspend time, the extra enable is harmless.
There is no desc::depth counting here, that got me confused. But that
per-CPU irq is not optimal. Is this a SoC limitation or a design choice?
Having a NAPI instance with IRQ per queue and those configured and
spread among CPUs should cause less trouble and is what others do.
In fact is the only net driver using per-CPU interrupts.
> BR,
> Yun
Sebastian
^ permalink raw reply
* Re: [PATCH net] octeontx2-af: fix CGX debugfs RVU AF PCI reference leaks
From: Simon Horman @ 2026-06-18 15:07 UTC (permalink / raw)
To: Ratheesh Kannoth
Cc: davem, hkelam, lcherian, linux-kernel, netdev, pabeni, sgoutham,
andrew+netdev, edumazet, kuba, Yuho Choi
In-Reply-To: <20260617104525.1321395-1-rkannoth@marvell.com>
On Wed, Jun 17, 2026 at 04:15:25PM +0530, Ratheesh Kannoth wrote:
> CGX per-lmac debugfs seq readers obtained struct rvu via
> pci_get_drvdata(pci_get_device(..., PCI_DEVID_OCTEONTX2_RVU_AF, ...)),
> which leaks a PCI device reference on every read. Store rvu and the CGX
> handle in debugfs inode private data when creating stats, mac_filter,
> and fwdata files (one context per CGX), and use debugfs aux numbers for
> fwdata so lmac_id matches the other CGX debugfs entries.
>
> Fixes: f967488d095e ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters")
> Fixes: dbc52debf95f ("octeontx2-af: Debugfs support for DMAC filters")
> Fixes: 49f02e6877d1 ("Octeontx2-af: Debugfs support for firmware data")
> Cc: Linu Cherian <lcherian@marvell.com>
> Reported-by: Yuho Choi <dbgh9129@gmail.com>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
The nit below not withstanding this looks good to me.
Reviewed-by: Simon Horman <horms@kernel.org>
> ---
> .../marvell/octeontx2/af/rvu_debugfs.c | 77 ++++++++++---------
> 1 file changed, 42 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
...
> @@ -2831,18 +2839,14 @@ static void rvu_dbg_npa_init(struct rvu *rvu)
>
> static int cgx_print_stats(struct seq_file *s, int lmac_id)
> {
> + struct rvu_cgx_lmac_dbgfs_ctx *dctx = s->private;
> struct cgx_link_user_info linfo;
> struct mac_ops *mac_ops;
> - void *cgxd = s->private;
> + void *cgxd = dctx->cgxd;
> + struct rvu *rvu = dctx->rvu;
nit: It would be nice to preserve reverse xmas tree order - longest line to
shortest - for local variable declarations. Likewise elsewhere in this
patch.
> u64 ucast, mcast, bcast;
> int stat = 0, err = 0;
> u64 tx_stat, rx_stat;
...
^ permalink raw reply
* [PATCH iwl-net 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Petr Oros @ 2026-06-18 15:09 UTC (permalink / raw)
To: netdev
Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alice Michael, Jacob Keller, Ivan Vecera, Michal Swiatkowski,
Grzegorz Nitka, intel-wired-lan, linux-kernel
In-Reply-To: <cover.1781786935.git.poros@redhat.com>
When an ice port is part of a vlan-filtering bridge with a wide VLAN
trunk and the netdev is in IFF_PROMISC (typical for bond slaves
attached to a bridge), the driver installs per-VLAN
ICE_SW_LKUP_PROMISC_VLAN entries (recipe 9) in addition to the broad
ICE_SW_LKUP_DFLT VSI Rx rule (recipe 5). Each per-VLAN rule consumes
one Flow Lookup Unit (FLU) entry from a fixed hardware pool of "up to
32K FLU entries" per device, documented in the E810 datasheet
(613875-009 section 7.8.10, Table 7-18, page 1015).
With three active PFs sharing one switch context and a bridge trunk of
vid 2-4094, the configuration would require roughly
3 PFs * 4093 VLANs * 3 rules per VLAN per PF ~= 36,800 rules
which exceeds the 32K FLU budget. Firmware then responds to further
Add Switch Rules requests with AQ retval 0x10 (LIBIE_AQ_RC_ENOSPC) and
the user-visible failure surfaces as
ice 0000:5c:00.1: Failed to set VSI 14 as the default forwarding
VSI, error -5
ice 0000:5c:00.1 ens1f1: Error -5 setting default VSI 14 Rx rule
After a switch context has been driven into overrun, subsequent
retries can come back as AQ retval 0x2 (LIBIE_AQ_RC_ENOENT), which has
misled triage attempts toward a perceived recipe binding defect
rather than a capacity issue.
When the DFLT VSI Rx rule is in place it catches every packet on the
lport regardless of VLAN tag, so the per-VLAN PROMISC_VLAN expansion
is redundant. The recipe 4 VLAN prune entries are still installed
per VLAN and continue to track the allowed VID set, but the
IFF_PROMISC sync path disables their enforcement on the VSI via
vlan_ops->dis_rx_filtering() before ice_set_promisc() runs.
ena_rx_filtering() is restored when IFF_PROMISC is cleared.
Skip the per-VLAN expansion at the two call sites that drive it:
ice_set_promisc() falls through to ice_fltr_set_vsi_promisc() and
ice_vlan_rx_add_vid() omits the per-VLAN ICE_MCAST_VLAN_PROMISC_BITS
add. Plain IFF_ALLMULTI without an installed DFLT VSI rule is
unchanged and still installs per-VLAN multicast promisc rules.
Both checks use ice_is_vsi_dflt_vsi() which inspects the recipe
filter list for an installed DFLT rule on this VSI, not
netdev->flags & IFF_PROMISC. The HW-state predicate avoids two
regression vectors that a user-intent predicate would introduce:
1. ice_lag_is_switchdev_running() short-circuits ice_set_dflt_vsi()
to return 0 without installing the DFLT rule for a PF in
switchdev LAG mode. An IFF_PROMISC-only check would also
suppress the per-VLAN fallback, leaving the PF with no rule.
2. When ice_set_dflt_vsi() returns a non-EEXIST error (FLU
exhausted, switch context divergence), the driver clears
IFF_PROMISC from vsi->current_netdev_flags but the netdev's own
flags retain IFF_PROMISC. The user-intent predicate would still
suppress the per-VLAN fallback even though DFLT failed to
install.
The predicate is install-time only. The IFF_PROMISC off path closes
the lifecycle gap in ice_vsi_exit_dflt_promisc(): for an IFF_ALLMULTI
VSI with VLANs it reinstates the per-VID rules before clearing the
default rule, so multicast coverage never lapses. If that AQ call
fails the default rule is left in place, ice_vsi_exit_dflt_promisc()
returns the error, and the sync_fltr pass bails with
vsi->current_netdev_flags |= IFF_PROMISC; the current/netdev flag
mismatch re-fires the IFF_PROMISC off path on the next sync. Clearing
the default rule first would instead expose a window where neither
the default rule nor the per-VID rules carry multicast.
If ice_clear_dflt_vsi() fails after the per-VID rules were reinstated
they are deliberately not rolled back. Clearing the default rule is a
removal that frees an FLU entry rather than allocating one, so it
cannot fail for lack of space; a failure is a transient AdminQ error.
The per-VID rules are the steady state for an IFF_ALLMULTI VLAN VSI,
so the only redundant entry left behind is the single un-removed
default rule, not the per-VID set. The retry re-enters this path,
ice_fltr_set_vlan_vsi_promisc() returns -EEXIST for the rules that
already exist so nothing is reallocated, and the default rule is
removed on the next attempt. Rolling the per-VID rules back here would
instead churn thousands of removes and re-adds on every retry.
After the default rule is gone the vid=0 PROMISC rule that paired
with it is redundant and is dropped, but only to reclaim a filter
entry, so a failure there is logged and does not abort the
transition.
ice_set_vsi_promisc() and ice_clear_vsi_promisc() dispatch the
recipe based on whether ICE_PROMISC_VLAN_RX/TX bits are present in
the mask: with the bits set, recipe ICE_SW_LKUP_PROMISC_VLAN is
used; otherwise ICE_SW_LKUP_PROMISC. The else branch in
ice_set_promisc() installs the vid=0 rule in ICE_SW_LKUP_PROMISC.
Because ice_clear_promisc() with VLANs present adds the VLAN bits
and would search ICE_SW_LKUP_PROMISC_VLAN, the recipe mismatch
would leave the vid=0 ICE_SW_LKUP_PROMISC rule orphaned when VLANs
are configured. This is a single stale rule, not a per-cycle leak:
re-adding it on the next promisc on returns -EEXIST rather than
allocating a new entry. The set-time recipe is not recorded, so
ice_clear_promisc() clears both recipes; clearing a rule that is not
present succeeds, both clears run unconditionally, and the first
error is returned.
The two VLAN-0 recipe transition blocks in ice_vlan_rx_add_vid()
and ice_vlan_rx_kill_vid() that promote / demote the vid=0 rule
between ICE_SW_LKUP_PROMISC and ICE_SW_LKUP_PROMISC_VLAN are
likewise guarded by !ice_is_vsi_dflt_vsi(). With DFLT in place the
vid=0 rule already covers every VID and a recipe swap would only
install a redundant rule.
Lab reproduction on an E810-C with the same firmware family (4.80,
NVM 1.3805.0, DDP 1.3.43.0) using four PFs in vlan-filtering bridges
with vid 2-4094 and the slaves brought to IFF_PROMISC before the
bridge VLAN bulk add:
before fix: ~12,279 AQ Add Switch Rules per PF, ENOSPC and ENOENT
responses in dmesg, DFLT VSI Rx rule install fails on
the affected PF
after fix: ~4,093 AQ Add Switch Rules per PF, no AQ errors, DFLT
VSI Rx rule installs on every PF
The 66.7% reduction in installed switch rules per PF matches the
expected per-VLAN saving: a single DFLT rule replaces the per-VID
PROMISC_VLAN expansion.
Functional regression test with vid 2-100 trunk between two ice
ports through the lab switch (40/40 PASS, 0 AQ errors, 0 ENOSPC
at 4093-VID customer scale):
vid 50 unicast, vid 100 unicast, vid 50 broadcast ARP,
vid 100 multicast IPv6 ND
vid 200/500/1500/4000 isolation (out-of-trunk) and untagged not
leaked: 0 packets reach any bridge endpoint
IGMP/MLD snooping, Jumbo MTU 9000, reserved-multicast STP BPDU
IFF_PROMISC + IFF_ALLMULTI transition (off while allmulti stays)
Regression reproducer for commit 1273f89578f2 ("ice: Fix broken
IFF_ALLMULTI handling"): allmulti on -> add vid -> allmulti off
-> allmulti on plus the orphan-rule Scenario 2; both converge
with no stale rules
100-VID, 1000-VID, 4093-VID stress cycles (5/3/2 iterations each)
switchdev mode toggle preserves IFF_PROMISC pruning state across
the session (vid 999 multicast received before and after the
legacy -> switchdev -> legacy cycle)
SR-IOV: VFs unaffected because ice_set_promisc() early-returns
for non-PF VSI and VF representors do not register
ndo_vlan_rx_add_vid
Fixes: 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling")
Signed-off-by: Petr Oros <poros@redhat.com>
---
drivers/net/ethernet/intel/ice/ice_main.c | 90 ++++++++++++++++++-----
1 file changed, 70 insertions(+), 20 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 6d24056c247cf4..af8df81fc45623 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -274,7 +274,8 @@ static int ice_set_promisc(struct ice_vsi *vsi, u8 promisc_m)
if (vsi->type != ICE_VSI_PF)
return 0;
- if (ice_vsi_has_non_zero_vlans(vsi)) {
+ /* skip per-VID expansion; the DFLT Rx rule already covers every VID */
+ if (ice_vsi_has_non_zero_vlans(vsi) && !ice_is_vsi_dflt_vsi(vsi)) {
promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
status = ice_fltr_set_vlan_vsi_promisc(&vsi->back->hw, vsi,
promisc_m);
@@ -304,9 +305,19 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
return 0;
if (ice_vsi_has_non_zero_vlans(vsi)) {
- promisc_m |= (ICE_PROMISC_VLAN_RX | ICE_PROMISC_VLAN_TX);
+ int vid0_status;
+
+ /* set time used either recipe (per-VID PROMISC_VLAN, or vid=0
+ * PROMISC via the ice_set_promisc() else branch), so clear
+ * both; clearing an absent rule succeeds
+ */
status = ice_fltr_clear_vlan_vsi_promisc(&vsi->back->hw, vsi,
- promisc_m);
+ promisc_m | ICE_PROMISC_VLAN_RX |
+ ICE_PROMISC_VLAN_TX);
+ vid0_status = ice_fltr_clear_vsi_promisc(&vsi->back->hw,
+ vsi->idx, promisc_m, 0);
+ if (!status)
+ status = vid0_status;
} else {
status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
promisc_m, 0);
@@ -317,6 +328,49 @@ static int ice_clear_promisc(struct ice_vsi *vsi, u8 promisc_m)
return status;
}
+/**
+ * ice_vsi_exit_dflt_promisc - drop the default VSI Rx rule on promisc off
+ * @vsi: the VSI leaving promiscuous mode
+ *
+ * For an IFF_ALLMULTI VSI with VLANs the per-VID multicast rules are
+ * reinstated before the default rule is cleared so coverage never lapses;
+ * the then redundant vid=0 rule is dropped best-effort. The callees log
+ * their own failures, so error returns are not re-logged here.
+ *
+ * Return: 0 on success, negative on error with the default rule left in place.
+ */
+static int ice_vsi_exit_dflt_promisc(struct ice_vsi *vsi)
+{
+ struct ice_vsi_vlan_ops *vlan_ops = ice_get_compat_vsi_vlan_ops(vsi);
+ struct net_device *netdev = vsi->netdev;
+ struct ice_hw *hw = &vsi->back->hw;
+ bool restore_mc;
+ int err;
+
+ restore_mc = (vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ ice_vsi_has_non_zero_vlans(vsi);
+
+ if (restore_mc) {
+ err = ice_fltr_set_vlan_vsi_promisc(hw, vsi,
+ ICE_MCAST_VLAN_PROMISC_BITS);
+ if (err && err != -EEXIST)
+ return err;
+ }
+
+ err = ice_clear_dflt_vsi(vsi);
+ if (err)
+ return err;
+
+ if (netdev->features & NETIF_F_HW_VLAN_CTAG_FILTER)
+ vlan_ops->ena_rx_filtering(vsi);
+
+ if (restore_mc)
+ ice_fltr_clear_vsi_promisc(hw, vsi->idx, ICE_MCAST_PROMISC_BITS,
+ 0);
+
+ return 0;
+}
+
/**
* ice_vsi_sync_fltr - Update the VSI filter list to the HW
* @vsi: ptr to the VSI
@@ -442,17 +496,12 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
} else {
/* Clear Rx filter to remove traffic from wire */
if (ice_is_vsi_dflt_vsi(vsi)) {
- err = ice_clear_dflt_vsi(vsi);
+ err = ice_vsi_exit_dflt_promisc(vsi);
if (err) {
- netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
- err, vsi->vsi_num);
vsi->current_netdev_flags |=
IFF_PROMISC;
goto out_promisc;
}
- if (vsi->netdev->features &
- NETIF_F_HW_VLAN_CTAG_FILTER)
- vlan_ops->ena_rx_filtering(vsi);
}
/* disable allmulti here, but only if allmulti is not
@@ -3676,10 +3725,9 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
usleep_range(1000, 2000);
- /* Add multicast promisc rule for the VLAN ID to be added if
- * all-multicast is currently enabled.
- */
- if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ /* skip the per-VID rule when the DFLT Rx rule already covers this VID */
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ !ice_is_vsi_dflt_vsi(vsi)) {
ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
ICE_MCAST_VLAN_PROMISC_BITS,
vid);
@@ -3697,11 +3745,12 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
if (ret)
goto finish;
- /* If all-multicast is currently enabled and this VLAN ID is only one
- * besides VLAN-0 we have to update look-up type of multicast promisc
- * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
+ /* On the first non-zero VLAN, promote the VLAN-0 multicast promisc
+ * rule from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN. Skip when
+ * the DFLT Rx rule is installed; it already covers every VID.
*/
if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ !ice_is_vsi_dflt_vsi(vsi) &&
ice_vsi_num_non_zero_vlans(vsi) == 1) {
ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
ICE_MCAST_PROMISC_BITS, 0);
@@ -3764,11 +3813,12 @@ int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
ICE_MCAST_VLAN_PROMISC_BITS, vid);
if (!ice_vsi_has_non_zero_vlans(vsi)) {
- /* Update look-up type of multicast promisc rule for VLAN 0
- * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
- * all-multicast is enabled and VLAN 0 is the only VLAN rule.
+ /* Last non-zero VLAN gone: demote the VLAN-0 multicast promisc
+ * rule back to ICE_SW_LKUP_PROMISC. Skip when the DFLT Rx rule
+ * is installed; no recipe swap is needed.
*/
- if (vsi->current_netdev_flags & IFF_ALLMULTI) {
+ if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
+ !ice_is_vsi_dflt_vsi(vsi)) {
ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
ICE_MCAST_VLAN_PROMISC_BITS,
0);
--
2.53.0
^ permalink raw reply related
* [PATCH iwl-net 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Petr Oros @ 2026-06-18 15:09 UTC (permalink / raw)
To: netdev
Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alice Michael, Jacob Keller, Ivan Vecera, Michal Swiatkowski,
Grzegorz Nitka, intel-wired-lan, linux-kernel
In-Reply-To: <cover.1781786935.git.poros@redhat.com>
ice_eswitch_setup_env() calls ice_set_dflt_vsi() to install the
ICE_SW_LKUP_DFLT Rx rule on the uplink VSI. The helper returns 0 even
when the rule is already in place, so the call is a no-op if
ice_vsi_sync_fltr() had previously installed the DFLT rule in response
to IFF_PROMISC on the uplink netdev. ice_remove_vsi_fltr() called
earlier in ice_eswitch_setup_env() does not affect this rule because
ice_remove_vsi_lkup_fltr() lacks a case for ICE_SW_LKUP_DFLT and falls
into its default branch which only logs. Switchdev mode then adds an
ICE_FLTR_TX leg via ice_cfg_dflt_vsi() on the same VSI handle.
ice_eswitch_release_env() unconditionally removed both the Rx and Tx
DFLT rules. When the Rx DFLT was installed by ice_vsi_sync_fltr()
before the switchdev session started, this clobbered promisc state the
operator had asked for: the DFLT Rx rule disappeared while IFF_PROMISC
was still set on the netdev, and the IFF_PROMISC sync path was not
retriggered, so the uplink ended the session without the catch-all
rule the netdev flags requested.
Skip the Rx DFLT removal when the uplink is still promiscuous, both in
ice_eswitch_release_env() and in the err_def_tx unwind of
ice_eswitch_setup_env(). The Tx leg installed by switchdev is always
removed since switchdev owns it.
The ena_rx_filtering() call earlier in ice_eswitch_release_env() is
left unconditional: it calls ice_cfg_vlan_pruning(), which returns
without enabling pruning while the netdev is in IFF_PROMISC, so it
cannot re-enable VLAN pruning under the preserved DFLT rule and drop
tagged traffic. Pruning is re-enabled later, when the IFF_PROMISC sync
path runs after promisc is actually cleared.
Use vsi->current_netdev_flags rather than the live netdev->flags for
this test. netdev->flags is written under RTNL by dev_change_flags(),
while ice_eswitch_release_env() runs under devl_lock, so reading it
here would be a TOCTOU against a concurrent promisc change. The
IFF_PROMISC bit of current_netdev_flags is written only under
ICE_CFG_BUSY by ice_vsi_sync_fltr(), and ice_set_rx_mode() gates that
sync off for the uplink while ice_is_switchdev_running() is true. The
bit is therefore frozen for the whole session and stable when
release_env reads it.
Because the sync is gated off during the session, a promisc change the
operator makes while switchdev runs never reaches ice_vsi_sync_fltr():
current_netdev_flags keeps the value captured before the session while
netdev->flags carries the new one. Once switchdev is torn down and
pf->eswitch.is_running is cleared, schedule a filter sync from
ice_eswitch_disable_switchdev() so the suppressed change is replayed
and the DFLT Rx rule is reconciled with the current netdev flags. This
also closes the window where release_env kept the rule based on the
frozen flag but the operator had since cleared IFF_PROMISC.
Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
Signed-off-by: Petr Oros <poros@redhat.com>
---
drivers/net/ethernet/intel/ice/ice_eswitch.c | 32 +++++++++++++++++---
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f77..b6073fc2375019 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -66,8 +66,10 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
ICE_FLTR_TX);
err_def_tx:
- ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
- ICE_FLTR_RX);
+ /* keep the Rx DFLT rule if still promiscuous (see release_env) */
+ if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
+ ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
+ false, ICE_FLTR_RX);
err_def_rx:
ice_vsi_del_vlan_zero(uplink_vsi);
err_vlan_zero:
@@ -275,11 +277,23 @@ static void ice_eswitch_release_env(struct ice_pf *pf)
vlan_ops = ice_get_compat_vsi_vlan_ops(uplink_vsi);
ice_vsi_update_local_lb(uplink_vsi, false);
+ /* No-op while IFF_PROMISC is set: ice_cfg_vlan_pruning() self-gates on
+ * it, so this cannot re-enable VLAN pruning under a preserved DFLT rule.
+ */
vlan_ops->ena_rx_filtering(uplink_vsi);
ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
ICE_FLTR_TX);
- ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
- ICE_FLTR_RX);
+
+ /* Keep the Rx DFLT rule if the uplink is still promiscuous; it must
+ * outlive the session. current_netdev_flags is used because its
+ * IFF_PROMISC bit only changes under ice_vsi_sync_fltr(), gated off
+ * during switchdev, so the read cannot race the RTNL netdev->flags.
+ * Any change made during the session is replayed on teardown.
+ */
+ if (!(uplink_vsi->current_netdev_flags & IFF_PROMISC))
+ ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
+ false, ICE_FLTR_RX);
+
ice_fltr_add_mac_and_broadcast(uplink_vsi,
uplink_vsi->port_info->mac.perm_addr,
ICE_FWD_TO_VSI);
@@ -327,10 +341,20 @@ static int ice_eswitch_enable_switchdev(struct ice_pf *pf)
*/
static void ice_eswitch_disable_switchdev(struct ice_pf *pf)
{
+ struct ice_vsi *uplink_vsi = pf->eswitch.uplink_vsi;
+
ice_eswitch_br_offloads_deinit(pf);
ice_eswitch_release_env(pf);
pf->eswitch.is_running = false;
+
+ /* ice_set_rx_mode() was gated off during the session; replay a filter
+ * sync so any suppressed promisc change reconciles the DFLT Rx rule.
+ */
+ set_bit(ICE_VSI_UMAC_FLTR_CHANGED, uplink_vsi->state);
+ set_bit(ICE_VSI_MMAC_FLTR_CHANGED, uplink_vsi->state);
+ set_bit(ICE_FLAG_FLTR_SYNC, pf->flags);
+ ice_service_task_schedule(pf);
}
/**
--
2.53.0
^ permalink raw reply related
* [PATCH iwl-net 0/2] ice: fix DFLT Rx rule handling for promisc and switchdev
From: Petr Oros @ 2026-06-18 15:09 UTC (permalink / raw)
To: netdev
Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alice Michael, Jacob Keller, Ivan Vecera, Michal Swiatkowski,
Grzegorz Nitka, intel-wired-lan, linux-kernel
Two fixes for the uplink default VSI Rx rule (DFLT) on E810 when the
netdev is in IFF_PROMISC.
Patch 1 drops the redundant per-VLAN promisc expansion that exhausts
the FLU pool on a wide VLAN trunk across several PFs.
Patch 2 keeps the DFLT Rx rule across a switchdev teardown instead of
clobbering the promisc state the operator asked for.
Lab tested on E810-C: functional, VLAN isolation, IFF_ALLMULTI
regression, stress/flap and switchdev-toggle suites pass with no AQ
errors, and the FLU pool stays under its ceiling with all four PFs
loaded.
Petr Oros (2):
ice: skip per-VLAN promisc rules when default VSI Rx rule is set
ice: preserve uplink DFLT Rx rule on switchdev release
drivers/net/ethernet/intel/ice/ice_eswitch.c | 32 ++++++-
drivers/net/ethernet/intel/ice/ice_main.c | 90 +++++++++++++++-----
2 files changed, 98 insertions(+), 24 deletions(-)
--
2.53.0
^ permalink raw reply
* lan7801 looses VLAN Filter Table
From: Sven Schuchmann @ 2026-06-18 15:18 UTC (permalink / raw)
To: netdev@vger.kernel.org
Hi,
I have a problem with a lan7801 chip in Kernel 6.18. I configure VLAN-ID (2) and an IP address.
But if I disconnect and connect the network-cable several times at some point no packets are
received anymore. Without using VLAN this does not happen.
I tracked this down that sometimes the VLAN Filter table seems
to get cleared. I hooked into the lan78xx.c driver to dump the vlan table:
static void lan78xx_get_stats(struct net_device *netdev,
struct ethtool_stats *stats, u64 *data)
{
struct lan78xx_net *dev = netdev_priv(netdev);
struct lan78xx_priv *pdata = (struct lan78xx_priv *)(dev->data[0]);
lan78xx_update_stats(dev);
for (int i = 0; i < 3; i++) {
u32 buf;
lan78xx_dataport_read(dev, DP_SEL_RSEL_VLAN_DA_, i, 1, &buf);
if (pdata->vlan_table[i] != buf)
netdev_err(dev->net, "VLAN TABLE %d: 0x%08x 0x%08x", i, pdata->vlan_table[i], buf);
else
netdev_info(dev->net, "VLAN TABLE %d: 0x%08x 0x%08x", i, pdata->vlan_table[i], buf);
}
So I can "read out" the table if I do "ethtool -S" and see it in the kernel log.
Normally the output looks like this:
VLAN TABLE 0: 0x00000005 0x00000005
So the table looks as expected. The Local Filter table from pdata is the same as in the chip itself.
But after some cable disconnects and connects I see this:
VLAN TABLE 0: 0x00000005 0x00000000
So the table got cleared or deleted and no paketes on VLAN-ID 2 go through.
I even can do this after I read out the table in lan78xx_get_stats():
lan78xx_dataport_write(dev, DP_SEL_RSEL_VLAN_DA_, 0,
DP_SEL_VHF_VLAN_LEN, pdata->vlan_table);
...and with this I can "fix" the table again from the ethtool and it starts working again.
Has someone seen something like this or can point me to a direction where
I could reinit this table (I already tried at the end of lan78xx_mac_link_up() without success...)
Thanks!
Regards, Sven
^ permalink raw reply
* Re: [PATCH iwl v3] ice: retry reading NVM if admin queue returns EBUSY
From: Simon Horman @ 2026-06-18 15:20 UTC (permalink / raw)
To: robert.malz
Cc: Simon Horman, anthony.l.nguyen, przemyslaw.kitszel,
intel-wired-lan, netdev
In-Reply-To: <20260617120753.1785565-1-robert.malz@canonical.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: retry reading NVM if admin queue returns EBUSY
When the admin queue command to read NVM returns EBUSY, the driver
currently treats it as a fatal error and aborts the entire read operation.
This patch adds retry logic to handle this condition.
> diff --git a/drivers/net/ethernet/intel/ice/ice_nvm.c b/drivers/net/ethernet/intel/ice/ice_nvm.c
> --- a/drivers/net/ethernet/intel/ice/ice_nvm.c
> +++ b/drivers/net/ethernet/intel/ice/ice_nvm.c
[ ... ]
> @@ -96,11 +97,25 @@ ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
> offset, read_size,
> data + bytes_read, last_cmd,
> read_shadow_ram, NULL);
> - if (status)
> - break;
> -
> - bytes_read += read_size;
> - offset += read_size;
> + if (status) {
> + if (hw->adminq.sq_last_status != LIBIE_AQ_RC_EBUSY ||
> + retry_cnt > ICE_SQ_SEND_MAX_EXECUTE)
[Severity: Low]
Does this check allow an extra retry execution? Since retry_cnt starts at 0,
using a strictly greater-than check allows the loop to retry 4 times
(5 total executions) instead of the 3 executions intended by
ICE_SQ_SEND_MAX_EXECUTE.
> + break;
> + ice_debug(hw, ICE_DBG_NVM,
> + "NVM read EBUSY error, retry %d\n",
> + retry_cnt + 1);
> + last_cmd = false;
> + ice_release_nvm(hw);
> + msleep(ICE_SQ_SEND_DELAY_TIME_MS);
[Severity: High]
Does dropping the NVM lock here violate the caller's assumption of mutual
exclusion?
Callers of ice_read_flat_nvm() typically acquire the NVM lock beforehand to
ensure an atomic read of multiple sectors without interleaved writes. By
explicitly dropping the lock during the msleep() window, another thread, VF,
or PF could acquire the NVM lock and modify the NVM before the read resumes,
potentially resulting in reading torn or inconsistent NVM data.
> + status = ice_acquire_nvm(hw, ICE_RES_READ);
> + if (status)
> + break;
[Severity: High]
Can this lead to an unbalanced NVM resource lock release and potential data
corruption?
If ice_acquire_nvm() fails here, the loop breaks and returns an error with the
lock currently unheld. However, callers like ice_read_flash_module() are
unaware of this internal lock state change and will unconditionally call
ice_release_nvm() upon failure:
drivers/net/ethernet/intel/ice/ice_nvm.c:ice_read_flash_module() {
status = ice_acquire_nvm(hw, ICE_RES_READ);
if (status)
return status;
status = ice_read_flat_nvm(hw, start + offset, &length, data, false);
ice_release_nvm(hw);
...
Because firmware tracks resource locks per-PF rather than per-OS-thread, this
errant second release could forcefully strip the lock from another thread on
the same PF that successfully acquired it, exposing the NVM to concurrent
modification.
> + retry_cnt++;
> + } else {
> + bytes_read += read_size;
> + offset += read_size;
> + retry_cnt = 0;
> + }
> } while (!last_cmd);
>
> *length = bytes_read;
^ permalink raw reply
* Re: [PATCH net 1/5] net: hns3: unify copper port ksettings configuration path
From: Simon Horman @ 2026-06-18 15:33 UTC (permalink / raw)
To: shaojijie
Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
netdev, linux-kernel
In-Reply-To: <20260617112721.75186-2-shaojijie@huawei.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: hns3: unify copper port ksettings configuration path
Refactor hns3_set_link_ksettings() and hclge_set_phy_link_ksettings()
to unify the configuration path for copper ports.
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
[ ... ]
> @@ -863,6 +859,8 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
> struct hnae3_handle *handle = hns3_get_handle(netdev);
> struct hnae3_ae_dev *ae_dev = hns3_get_ae_dev(handle);
> const struct hnae3_ae_ops *ops = hns3_get_ops(handle);
> + u8 module_type = HNAE3_MODULE_TYPE_UNKNOWN;
> + u8 media_type = HNAE3_MEDIA_TYPE_UNKNOWN;
> int ret;
>
> /* Chip don't support this mode. */
> @@ -878,22 +876,20 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
> cmd->base.autoneg, cmd->base.speed, cmd->base.duplex,
> cmd->lanes);
>
> - /* Only support ksettings_set for netdev with phy attached for now */
> - if (netdev->phydev) {
> - if (cmd->base.speed == SPEED_1000 &&
> - cmd->base.autoneg == AUTONEG_DISABLE)
> - return -EINVAL;
[Severity: Medium]
Does removing this validation allow invalid configurations for native copper
PHYs?
The previous code explicitly rejected SPEED_1000 with AUTONEG_DISABLE because
IEEE 802.3 1000BASE-T strictly requires autonegotiation. By delegating
directly to phy_ethtool_ksettings_set(), which permits this for 1000BASE-X,
the driver might now silently accept an invalid configuration.
> - return phy_ethtool_ksettings_set(netdev->phydev, cmd);
> - } else if (test_bit(HNAE3_DEV_SUPPORT_PHY_IMP_B, ae_dev->caps) &&
> - ops->set_phy_link_ksettings) {
> + if (!ops->get_media_type)
> + return -EOPNOTSUPP;
> + ops->get_media_type(handle, &media_type, &module_type);
>
> - return ops->set_phy_link_ksettings(handle, cmd);
> - }
> + if (media_type == HNAE3_MEDIA_TYPE_COPPER) {
> + if (!ops->set_phy_link_ksettings)
> + return -EOPNOTSUPP;
> + return ops->set_phy_link_ksettings(handle, cmd);
> + }
[ ... ]
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
[ ... ]
> @@ -3400,10 +3400,28 @@ hclge_set_phy_link_ksettings(struct hnae3_handle *handle,
> return ret;
> }
>
> + linkmode_copy(hdev->hw.mac.advertising, cmd->link_modes.advertising);
> + return 0;
> +}
> +
> +static int
> +hclge_set_phy_link_ksettings(struct hnae3_handle *handle,
> + const struct ethtool_link_ksettings *cmd)
> +{
> + struct hclge_vport *vport = hclge_get_vport(handle);
> + struct hclge_dev *hdev = vport->back;
> + int ret;
> +
> + if (hnae3_dev_phy_imp_supported(hdev))
> + ret = hclge_ethtool_ksettings_set(handle, cmd);
> + else
> + ret = phy_ethtool_ksettings_set(handle->netdev->phydev, cmd);
[Severity: Critical]
Can handle->netdev->phydev be NULL here, leading to a kernel panic when called
from userspace?
A copper port can legitimately have a NULL phydev if the port's PHY address
is configured as PHY_INEXISTENT in the firmware.
Because the updated hns3_set_link_ksettings() now routes all
HNAE3_MEDIA_TYPE_COPPER configurations here without the previous
if (netdev->phydev) guard, phy_ethtool_ksettings_set() will unconditionally
dereference phydev->mdio.addr.
> + if (ret)
> + return ret;
> +
> hdev->hw.mac.req_autoneg = cmd->base.autoneg;
> hdev->hw.mac.req_speed = cmd->base.speed;
> hdev->hw.mac.req_duplex = cmd->base.duplex;
^ permalink raw reply
* Re: [PATCH net 2/5] net: hns3: refactor MAC autoneg and speed configuration
From: Simon Horman @ 2026-06-18 15:34 UTC (permalink / raw)
To: shaojijie
Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
netdev, linux-kernel
In-Reply-To: <20260617112721.75186-3-shaojijie@huawei.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: hns3: refactor MAC autoneg and speed configuration
This refactor extracts the MAC autoneg and speed/duplex/lane configuration
out of hclge_mac_init() into a new helper hclge_set_autoneg_speed_dup(),
and invokes it from both hclge_init_ae_dev() and hclge_reset_ae_dev().
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> index 5a00797d9252..2c74675b149f 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -2957,20 +2957,6 @@ static int hclge_mac_init(struct hclge_dev *hdev)
> if (!test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state))
> hdev->hw.mac.duplex = HCLGE_MAC_FULL;
>
> - if (hdev->hw.mac.support_autoneg) {
> - ret = hclge_set_autoneg_en(hdev, hdev->hw.mac.autoneg);
> - if (ret)
> - return ret;
> - }
> -
> - if (!hdev->hw.mac.autoneg) {
> - ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
> - hdev->hw.mac.req_duplex,
> - hdev->hw.mac.lane_num);
> - if (ret)
> - return ret;
> - }
> -
> mac->link = 0;
[ ... ]
> @@ -11901,6 +11908,13 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
> if (ret)
> goto err_ptp_uninit;
>
> + ret = hclge_set_autoneg_speed_dup(hdev);
> + if (ret) {
> + dev_err(&pdev->dev,
> + "failed to set autoneg speed duplex, ret = %d\n", ret);
> + goto err_ptp_uninit;
> + }
> +
[High]
Has the relocation of the autoneg/speed configuration in hclge_init_ae_dev()
been considered against the surrounding init order?
Previously the equivalent calls ran inside hclge_mac_init() before
hclge_update_port_info(). After this patch the helper runs after
hclge_update_port_info(), and hclge_update_port_info() ->
hclge_get_sfp_info() unconditionally overwrites hdev->hw.mac.autoneg,
hdev->hw.mac.support_autoneg and hdev->hw.mac.lane_num from firmware
whenever resp->speed_ability is non-zero.
The new helper then feeds the firmware-supplied lane_num back through:
if (!hdev->hw.mac.autoneg) {
ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
hdev->hw.mac.req_duplex,
hdev->hw.mac.lane_num);
On a non-copper port where firmware reports forced mode (autoneg=0) and a
stale lane_num inherited from a previous link lifecycle (for example
lane_num=4 from a prior 100G state), the (req_speed, lane_num) pair given
to hclge_cfg_mac_speed_dup_hw() can be inconsistent (for example 25G with
4 lanes), and firmware rejects it with -EINVAL, which propagates back to
hclge_init_ae_dev() and unwinds via err_ptp_uninit, failing the entire
probe.
The follow-up commit a7e52c16ecbf "net: hns3: fix init failure caused by
lane_num contamination" later in this series fixes this by introducing a
separate req_lane_num=0 and passing that to hclge_cfg_mac_speed_dup_hw()
instead of mac.lane_num. Should the fix be folded into this patch so the
series does not introduce a probe-failure regression at this point and
then repair it?
[Medium]
After this change, the last write to MAC speed/duplex on the init path is
hclge_cfg_mac_speed_dup_hw() inside the new helper, which programs
hardware but does not update the cached hdev->hw.mac.speed and
hdev->hw.mac.duplex. Previously the last writer was
hclge_cfg_mac_speed_dup() inside hclge_update_port_info(), which updates
both hardware and the cached fields.
When req_speed differs from the firmware-reported mac.speed (which can
happen if the active SFP speed differs from cfg.default_speed parsed in
hclge_configure()), is hardware left programmed with req_speed while
mac.speed retains the firmware-reported value, leaving the cache out of
sync with the actual hardware configuration until the periodic service
task refreshes mac.speed?
> @@ -12231,6 +12245,13 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
> return ret;
> }
>
> + ret = hclge_set_autoneg_speed_dup(hdev);
> + if (ret) {
> + dev_err(&pdev->dev,
> + "failed to set autoneg speed duplex, ret = %d\n", ret);
> + return ret;
> + }
> +
> ret = hclge_tp_port_init(hdev);
> if (ret) {
> dev_err(&pdev->dev, "failed to init tp port, ret = %d\n",
[Low]
Could the commit message be tightened to match the resulting code?
The message says the helper "will also be invoked during the hardware
reset recovery path to re-apply link settings without repeating
unnecessary buffer or MTU initializations." However hclge_reset_ae_dev()
still calls hclge_mac_init() immediately before hclge_set_autoneg_speed_dup(),
and hclge_mac_init() still performs hclge_set_mac_mtu(),
hclge_set_default_loopback() and hclge_buffer_alloc(), so the avoidance
of MTU/buffer re-initialization described in the changelog is not
realized here.
The commit message is also framed as a pure readability/cyclomatic-
complexity refactor, but the autoneg/speed configuration is not just
extracted: in hclge_init_ae_dev() it is moved to a materially later
point (after hclge_update_port_info(), TSO, GRO, VLAN, TM, RSS, FD and
PTP setup), versus the original code where it was the very first
hardware-touching step inside hclge_mac_init(). The reset path keeps the
original relative ordering (helper runs immediately after
hclge_mac_init()), so the init and reset paths are now asymmetric in
placement, and the rationale for the late placement on the init path is
not stated.
^ permalink raw reply
* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Maxime Chevallier @ 2026-06-18 15:34 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Zhou, Yun
Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
netdev, linux-kernel
In-Reply-To: <20260618150440.cLDwgyDM@linutronix.de>
Hi,
On 6/18/26 17:04, Sebastian Andrzej Siewior wrote:
> On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>>
>>> But if the thread is idle then you have one enable too many, don't you?
>>> Well you have the NAPI callback which does disable on the local CPU and
>>> this resume which enables it on every CPU. So this does not look right.
>>>
>>
>> The enable in resume is intentionally unconditional and idempotent
>> (writing MPIC_INT_CLEAR_MASK on an already unmasked IRQ is a no-op).
>>
>>> The interesting question is what happens to the enable_percpu_irq() from
>>> the mvneta_poll(). Is it lost? And if so, how/ why?
>>>
>>
>> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
>> gets a chance to execute. The sequence is:
>>
>> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
>> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
>
> But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
> not threaded. That does not look optimal.
>
>> 3. NAPI poll cannot run → enable_percpu_irq() is never called
>> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>> but does NOT execute the completion path (no enable_percpu_irq)
>
> napi_schedule() sets NAPIF_STATE_SCHED.
> napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
> is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
> will be invoked unless it leaves somewhere early.
> However, if DISABLED was already set then it disables the IRQ source but
> does not schedule NAPI. This is probably what happens.
>
>> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>>
>> The unconditional enable in resume covers this case. When NAPI was
>> idle at suspend time, the extra enable is harmless.
>
> There is no desc::depth counting here, that got me confused. But that
> per-CPU irq is not optimal. Is this a SoC limitation or a design choice?
I _think_ that on mvneta, the mapping is done by assigning queues to CPUs
directly, and then you have per-cpu register banks to handle the interrupts :
https://elixir.bootlin.com/linux/v7.1/source/drivers/net/ethernet/marvell/mvneta.c#L138
This seems to be confirmed by some comments here [1] :
static void mvneta_percpu_mask_interrupt(void *arg)
{
struct mvneta_port *pp = arg;
/* All the queue are masked, but actually only the ones
* mapped to this CPU will be masked
*/
mvreg_write(pp, MVNETA_INTR_NEW_MASK, 0);
mvreg_write(pp, MVNETA_INTR_OLD_MASK, 0);
mvreg_write(pp, MVNETA_INTR_MISC_MASK, 0);
}
These registers will control different queue's interrupt behaviour depending
on which CPU executes that code...
[1] : https://elixir.bootlin.com/linux/v7.1/source/drivers/net/ethernet/marvell/mvneta.c#L1441
This may be why the design ended-up that way. I'm not saying this is ideal
though :)
The percpu interrupt mechanism isn't used on armada 3700 (hence all the special
conditions) because IIRC the interrupt routing is flawed for the network part, and
all interrupts end-up on CPU0 anyways...
Maxime
^ permalink raw reply
* Re: [PATCH net 4/5] net: hns3: differentiate autoneg default values between copper and fiber
From: Simon Horman @ 2026-06-18 15:34 UTC (permalink / raw)
To: shaojijie
Cc: Simon Horman, davem, edumazet, kuba, pabeni, andrew+netdev,
shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
netdev, linux-kernel
In-Reply-To: <20260617112721.75186-5-shaojijie@huawei.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: hns3: differentiate autoneg default values between copper and fiber
This commit addresses a link loss issue during driver initialization on
optical ports connected to forced-mode remote switches. It modifies
hclge_init_ae_dev to set autonegotiation based on media type. Copper ports
retain the AUTONEG_ENABLE default, while optical ports inherit the firmware
pre-configured autoneg setting.
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> @@ -11916,6 +11916,9 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
> if (ret)
> goto err_ptp_uninit;
>
> + if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_COPPER)
> + hdev->hw.mac.req_autoneg = hdev->hw.mac.autoneg;
> +
[Severity: High]
Does this apply the wrong speed when autonegotiation is disabled on fiber
ports?
During driver initialization, hclge_update_port_info() dynamically probes
the SFP module and updates mac->speed, but does not update mac->req_speed,
which retains its statically configured default capability.
When hdev->hw.mac.autoneg is 0 (forced-mode), req_autoneg becomes 0, and
then hclge_set_autoneg_speed_dup() uses req_speed:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_set_autoneg_speed_dup() {
...
if (!hdev->hw.mac.req_autoneg) {
ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
...
}
Could this override the dynamically probed SFP speed with the static default
speed?
[Severity: High]
Does this unconditionally disable autonegotiation for V1 optical hardware?
Looking at hclge_update_port_info(), it only queries SFP information via
hclge_get_sfp_info() for V2 hardware. For V1 hardware, it falls back to
hclge_get_sfp_speed(), which leaves mac->autoneg uninitialized (0):
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_update_port_info() {
...
if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) {
speed = mac->speed;
ret = hclge_get_sfp_info(hdev, mac);
} else {
speed = HCLGE_MAC_SPEED_UNKNOWN;
ret = hclge_get_sfp_speed(hdev, &speed);
}
...
}
If mac->autoneg is left as 0, wouldn't req_autoneg unconditionally be set
to 0 here, overriding the speed and breaking connectivity for V1 devices?
> ret = hclge_set_autoneg_speed_dup(hdev);
> if (ret) {
> dev_err(&pdev->dev,
^ permalink raw reply
* Re: [PATCH net] net: airoha: fix BQL underflow and UAF in shared QDMA TX ring
From: Lorenzo Bianconi @ 2026-06-18 15:42 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Wayen Yan, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260618-airoha-bql-fixes-v1-1-ffd2c2089518@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 7943 bytes --]
> When multiple netdevs share a QDMA TX ring and one device is stopped,
> netdev_tx_reset_subqueue() zeroes that device's BQL counters while its
> pending skbs remain in the shared HW TX ring. When NAPI later completes
> those skbs via netdev_tx_completed_queue(), the already-zeroed
> dql->num_queued counter underflows.
> Moreover, in the airoha_remove() path, netdevs are unregistered
> sequentially while skbs from previously unregistered netdevs may still
> reference freed net_device memory via skb->dev, causing a use-after-free
> during BQL accounting.
> Fix both issues:
> - Remove netdev_tx_reset_subqueue() from airoha_dev_stop() so pending
> skbs are completed naturally by NAPI with proper BQL accounting.
> - Add netdev_tx_completed_queue() in airoha_qdma_cleanup_tx_queue()
> to properly account for skbs freed during queue teardown.
> - Introduce airoha_qdma_tx_disable() to stop TX on all registered
> netdevs for a given QDMA instance under RTNL lock.
> - Move DMA engine start/stop into probe/remove and
> airoha_qdma_cleanup(), ensuring TX queues are cleaned up while all
> netdevs are still registered and skb->dev is valid.
>
> Fixes: 6df0488dc9dd ("net: airoha: fix BQL accounting in airoha_qdma_cleanup_tx_queue()")
This Fixes tag is wrong, the proper one is the one below:
Fixes: a9c2ca61fec7 ("net: airoha: Support multiple net_devices for a single FE GDM port")
Regards,
Lorenzo
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
> drivers/net/ethernet/airoha/airoha_eth.c | 95 ++++++++++++++++++++++++--------
> 1 file changed, 72 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 64dde6464f3f..4d6a061cd779 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -1004,6 +1004,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>
> e = &q->entry[index];
> skb = e->skb;
> + e->skb = NULL;
>
> dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> DMA_TO_DEVICE);
> @@ -1147,6 +1148,42 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
> return 0;
> }
>
> +static void airoha_qdma_tx_disable(struct airoha_qdma *qdma)
> +{
> + struct airoha_eth *eth = qdma->eth;
> + int i;
> +
> + /* Protect netdev->reg_state and netif_tx_disable() calls. */
> + rtnl_lock();
> +
> + for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
> + struct airoha_gdm_port *port = eth->ports[i];
> + int j;
> +
> + if (!port)
> + continue;
> +
> + for (j = 0; j < ARRAY_SIZE(port->devs); j++) {
> + struct airoha_gdm_dev *dev = port->devs[j];
> + struct net_device *netdev;
> +
> + if (!dev)
> + continue;
> +
> + if (dev->qdma != qdma)
> + continue;
> +
> + netdev = netdev_from_priv(dev);
> + if (netdev->reg_state != NETREG_REGISTERED)
> + continue;
> +
> + netif_tx_disable(netdev);
> + }
> + }
> +
> + rtnl_unlock();
> +}
> +
> static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> {
> struct airoha_qdma *qdma = q->qdma;
> @@ -1158,13 +1195,20 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
> for (i = 0; i < q->ndesc; i++) {
> struct airoha_queue_entry *e = &q->entry[i];
> struct airoha_qdma_desc *desc = &q->desc[i];
> + struct sk_buff *skb = e->skb;
>
> if (!e->dma_addr)
> continue;
>
> dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> DMA_TO_DEVICE);
> - dev_kfree_skb_any(e->skb);
> + if (skb) {
> + struct netdev_queue *txq;
> +
> + txq = skb_get_tx_queue(skb->dev, skb);
> + netdev_tx_completed_queue(txq, 1, skb->len);
> + dev_kfree_skb_any(skb);
> + }
> e->dma_addr = 0;
> e->skb = NULL;
> list_add_tail(&e->list, &q->tx_list);
> @@ -1527,6 +1571,23 @@ static void airoha_qdma_cleanup(struct airoha_qdma *qdma)
> {
> int i;
>
> + if (test_bit(DEV_STATE_INITIALIZED, &qdma->eth->state)) {
> + u32 status;
> +
> + airoha_qdma_tx_disable(qdma);
> +
> + airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> + GLOBAL_CFG_TX_DMA_EN_MASK |
> + GLOBAL_CFG_RX_DMA_EN_MASK);
> + if (read_poll_timeout(airoha_qdma_rr, status,
> + !(status & (GLOBAL_CFG_TX_DMA_BUSY_MASK |
> + GLOBAL_CFG_RX_DMA_BUSY_MASK)),
> + USEC_PER_MSEC, 50 * USEC_PER_MSEC, true,
> + qdma, REG_QDMA_GLOBAL_CFG))
> + dev_warn(qdma->eth->dev,
> + "QDMA DMA engine busy timeout\n");
> + }
> +
> for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
> if (!qdma->q_rx[i].ndesc)
> continue;
> @@ -1837,9 +1898,6 @@ static int airoha_dev_open(struct net_device *netdev)
> }
> port->users++;
>
> - airoha_qdma_set(qdma, REG_QDMA_GLOBAL_CFG,
> - GLOBAL_CFG_TX_DMA_EN_MASK |
> - GLOBAL_CFG_RX_DMA_EN_MASK);
> qdma->users++;
>
> if (!airoha_is_lan_gdm_dev(dev) &&
> @@ -1880,12 +1938,9 @@ static int airoha_dev_stop(struct net_device *netdev)
> struct airoha_gdm_dev *dev = netdev_priv(netdev);
> struct airoha_gdm_port *port = dev->port;
> struct airoha_qdma *qdma = dev->qdma;
> - int i;
>
> netif_tx_disable(netdev);
> airoha_set_vip_for_gdm_port(dev, false);
> - for (i = 0; i < netdev->num_tx_queues; i++)
> - netdev_tx_reset_subqueue(netdev, i);
>
> if (--port->users)
> airoha_set_port_mtu(dev->eth, port);
> @@ -1893,19 +1948,7 @@ static int airoha_dev_stop(struct net_device *netdev)
> airoha_set_gdm_port_fwd_cfg(qdma->eth,
> REG_GDM_FWD_CFG(port->id),
> FE_PSE_PORT_DROP);
> -
> - if (!--qdma->users) {
> - airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
> - GLOBAL_CFG_TX_DMA_EN_MASK |
> - GLOBAL_CFG_RX_DMA_EN_MASK);
> -
> - for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
> - if (!qdma->q_tx[i].ndesc)
> - continue;
> -
> - airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
> - }
> - }
> + qdma->users--;
>
> return 0;
> }
> @@ -3413,8 +3456,12 @@ static int airoha_probe(struct platform_device *pdev)
> if (err)
> goto error_netdev_free;
>
> - for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
> + for (i = 0; i < ARRAY_SIZE(eth->qdma); i++) {
> airoha_qdma_start_napi(ð->qdma[i]);
> + airoha_qdma_set(ð->qdma[i], REG_QDMA_GLOBAL_CFG,
> + GLOBAL_CFG_TX_DMA_EN_MASK |
> + GLOBAL_CFG_RX_DMA_EN_MASK);
> + }
>
> for_each_child_of_node(pdev->dev.of_node, np) {
> if (!of_device_is_compatible(np, "airoha,eth-mac"))
> @@ -3440,6 +3487,8 @@ static int airoha_probe(struct platform_device *pdev)
> for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
> airoha_qdma_stop_napi(ð->qdma[i]);
>
> + airoha_hw_cleanup(eth);
> +
> for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
> struct airoha_gdm_port *port = eth->ports[i];
> int j;
> @@ -3461,7 +3510,6 @@ static int airoha_probe(struct platform_device *pdev)
> }
> airoha_metadata_dst_free(port);
> }
> - airoha_hw_cleanup(eth);
> error_netdev_free:
> free_netdev(eth->napi_dev);
> platform_set_drvdata(pdev, NULL);
> @@ -3477,6 +3525,8 @@ static void airoha_remove(struct platform_device *pdev)
> for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
> airoha_qdma_stop_napi(ð->qdma[i]);
>
> + airoha_hw_cleanup(eth);
> +
> for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
> struct airoha_gdm_port *port = eth->ports[i];
> int j;
> @@ -3497,7 +3547,6 @@ static void airoha_remove(struct platform_device *pdev)
> }
> airoha_metadata_dst_free(port);
> }
> - airoha_hw_cleanup(eth);
>
> free_netdev(eth->napi_dev);
> platform_set_drvdata(pdev, NULL);
>
> ---
> base-commit: 7d8297e26b4e20b5d1c3c3fe51fe81a1c7fbc823
> change-id: 20260618-airoha-bql-fixes-f57b2d108573
>
> Best regards,
> --
> Lorenzo Bianconi <lorenzo@kernel.org>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Zhou, Yun @ 2026-06-18 15:45 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
maxime.chevallier, netdev, linux-kernel
In-Reply-To: <20260618150440.cLDwgyDM@linutronix.de>
On 6/18/2026 11:04 PM, Sebastian Andrzej Siewior wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On 2026-06-18 21:56:12 [+0800], Zhou, Yun wrote:
>>
>> The enable_percpu_irq() from mvneta_poll is not "lost" — it never
>> gets a chance to execute. The sequence is:
>>
>> 1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
>> 2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
>
> But napi_schedule() runs in ksoftird, doesn't it? The per-CPU IRQs are
> not threaded. That does not look optimal.
>
Correct, percpu IRQs are not threaded. net_rx_action (which calls
mvneta_poll and ultimately enable_percpu_irq) runs in softirq context
via ksoftirqd.
>> 3. NAPI poll cannot run → enable_percpu_irq() is never called
>> 4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
>> but does NOT execute the completion path (no enable_percpu_irq)
>
> napi_schedule() sets NAPIF_STATE_SCHED.
> napi_disable() sets NAPI_STATE_DISABLE and waits until NAPIF_STATE_SCHED
> is cleared. So if NAPIF_STATE_SCHED was set then enable_percpu_irq()
> will be invoked unless it leaves somewhere early.
> However, if DISABLED was already set then it disables the IRQ source but
> does not schedule NAPI. This is probably what happens.
Your analysis makes perfect sense, and that scenario is indeed much more
likely. It looks like I'll need to update the commit message accordingly.
>> 5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
>>
>> The unconditional enable in resume covers this case. When NAPI was
>> idle at suspend time, the extra enable is harmless.
>
> There is no desc::depth counting here, that got me confused. But that
> per-CPU irq is not optimal. Is this a SoC limitation or a design choice?
> Having a NAPI instance with IRQ per queue and those configured and
> spread among CPUs should cause less trouble and is what others do.
> In fact is the only net driver using per-CPU interrupts.
>
It is a SoC limitation. Armada XP's MPIC provides a single shared
interrupt for the ethernet controller with per-CPU masking for
interrupt steering — there are no per-queue MSI vectors. The percpu
IRQ model was the only way to distribute interrupt handling across
CPUs given this hardware constraint.
Newer Marvell SoCs (armada3700+) moved away from this model and use
standard IRQs, which is why the armada3700 path does not have this
problem.
^ permalink raw reply
* Fw: [Bug 221665] New: tcp_ecn: doc says default is 2, but should be 5
From: Stephen Hemminger @ 2026-06-18 15:52 UTC (permalink / raw)
To: netdev
Begin forwarded message:
Date: Thu, 18 Jun 2026 13:05:29 +0000
From: bugzilla-daemon@kernel.org
To: stephen@networkplumber.org
Subject: [Bug 221665] New: tcp_ecn: doc says default is 2, but should be 5
https://bugzilla.kernel.org/show_bug.cgi?id=221665
Bug ID: 221665
Summary: tcp_ecn: doc says default is 2, but should be 5
Product: Networking
Version: 2.5
Hardware: All
OS: Linux
Status: NEW
Severity: normal
Priority: P3
Component: IPV4
Assignee: stephen@networkplumber.org
Reporter: pedretti.fabio@gmail.com
CC: corbet@lwn.net, pabeni@redhat.com
Regression: No
Here: https://docs.kernel.org/networking/ip-sysctl.html#:~:text=tcp_ecn
tcp_ecn - INTEGER
...
Default: 2
However it looks like the default was recently changed:
"In previous kernels, the value of tcp_ecn is, by default, two, [...] The new
default value is five, [...]"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ae3e8e6ceedfb3cf74ca18169c942e073586a39
Please address this.
Thanks.
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox