* [PATCH net v4 2/2] net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
From: Lorenzo Bianconi @ 2026-04-17 6:36 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lorenzo Bianconi, Simon Horman
Cc: linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-0-e04bcc2c9642@kernel.org>
Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in
airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to
TX_CPU_IDX to notify the NIC the QDMA TX ring is empty.
Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
drivers/net/ethernet/airoha/airoha_eth.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 690bfaf8d7d9..6d9f82c677a0 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1039,12 +1039,15 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
{
- struct airoha_eth *eth = q->qdma->eth;
- int i;
+ struct airoha_qdma *qdma = q->qdma;
+ struct airoha_eth *eth = qdma->eth;
+ int i, qid = q - &qdma->q_tx[0];
+ u16 index = 0;
spin_lock_bh(&q->lock);
for (i = 0; i < q->ndesc; i++) {
struct airoha_queue_entry *e = &q->entry[i];
+ struct airoha_qdma_desc *desc = &q->desc[i];
if (!e->dma_addr)
continue;
@@ -1055,8 +1058,33 @@ static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
e->dma_addr = 0;
e->skb = NULL;
list_add_tail(&e->list, &q->tx_list);
+
+ /* Reset DMA descriptor */
+ WRITE_ONCE(desc->ctrl, 0);
+ WRITE_ONCE(desc->addr, 0);
+ WRITE_ONCE(desc->data, 0);
+ WRITE_ONCE(desc->msg0, 0);
+ WRITE_ONCE(desc->msg1, 0);
+ WRITE_ONCE(desc->msg2, 0);
+
q->queued--;
}
+
+ if (!list_empty(&q->tx_list)) {
+ struct airoha_queue_entry *e;
+
+ e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
+ list);
+ index = e - q->entry;
+ }
+ /* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
+ * empty.
+ */
+ airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
+ FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
+ airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
+ FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
+
spin_unlock_bh(&q->lock);
}
--
2.53.0
^ permalink raw reply related
* [PATCH net v4 1/2] net: airoha: Move ndesc initialization at end of airoha_qdma_init_tx()
From: Lorenzo Bianconi @ 2026-04-17 6:36 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lorenzo Bianconi, Simon Horman
Cc: linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-0-e04bcc2c9642@kernel.org>
If queue entry list allocation fails in airoha_qdma_init_tx_queue routine,
airoha_qdma_cleanup_tx_queue() will trigger a NULL pointer dereference
accessing the queue entry array. The issue is due to the early ndesc
initialization in airoha_qdma_init_tx_queue(). Fix the issue moving ndesc
initialization at end of airoha_qdma_init_tx routine.
Fixes: 3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order DMA tx descriptors")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
drivers/net/ethernet/airoha/airoha_eth.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index e1ab15f1ee7d..690bfaf8d7d9 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -954,27 +954,27 @@ static int airoha_qdma_init_tx_queue(struct airoha_queue *q,
dma_addr_t dma_addr;
spin_lock_init(&q->lock);
- q->ndesc = size;
q->qdma = qdma;
q->free_thr = 1 + MAX_SKB_FRAGS;
INIT_LIST_HEAD(&q->tx_list);
- q->entry = devm_kzalloc(eth->dev, q->ndesc * sizeof(*q->entry),
+ q->entry = devm_kzalloc(eth->dev, size * sizeof(*q->entry),
GFP_KERNEL);
if (!q->entry)
return -ENOMEM;
- q->desc = dmam_alloc_coherent(eth->dev, q->ndesc * sizeof(*q->desc),
+ q->desc = dmam_alloc_coherent(eth->dev, size * sizeof(*q->desc),
&dma_addr, GFP_KERNEL);
if (!q->desc)
return -ENOMEM;
- for (i = 0; i < q->ndesc; i++) {
+ for (i = 0; i < size; i++) {
u32 val = FIELD_PREP(QDMA_DESC_DONE_MASK, 1);
list_add_tail(&q->entry[i].list, &q->tx_list);
WRITE_ONCE(q->desc[i].ctrl, cpu_to_le32(val));
}
+ q->ndesc = size;
/* xmit ring drop default setting */
airoha_qdma_set(qdma, REG_TX_RING_BLOCKING(qid),
--
2.53.0
^ permalink raw reply related
* [PATCH net v4 0/2] net: airoha: Fix airoha_qdma_cleanup_tx_queue() processing
From: Lorenzo Bianconi @ 2026-04-17 6:36 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lorenzo Bianconi, Simon Horman
Cc: linux-arm-kernel, linux-mediatek, netdev
Add missing bits in airoha_qdma_cleanup_tx_queue routine.
Fix airoha_qdma_cleanup_tx_queue processing errors intorduced in commit
'3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order
DMA tx descriptors")'.
---
Changes in v4:
- Drop patch 2/3 to move entries to queue head in case of DMA mapping
failure in airoha_dev_xmit().
- Link to v3: https://lore.kernel.org/r/20260416-airoha_qdma_cleanup_tx_queue-fix-net-v3-0-2b69f5788580@kernel.org
Changes in v3:
- Move ndesc initialization fix in a dedicated patch.
- Add patch 2/3 to move entries to queue head in case of DMA mapping
failure in airoha_dev_xmit().
- Cosmetics.
- Link to v2: https://lore.kernel.org/r/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022@kernel.org
Changes in v2:
- Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
order to avoid any possible NULL pointer dereference in
airoha_qdma_cleanup_tx_queue()
- Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
- Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
---
Lorenzo Bianconi (2):
net: airoha: Move ndesc initialization at end of airoha_qdma_init_tx()
net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
drivers/net/ethernet/airoha/airoha_eth.c | 40 +++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 6 deletions(-)
---
base-commit: 82c21069028c5db3463f851ae8ac9cc2e38a3827
change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
Best regards,
--
Lorenzo Bianconi <lorenzo@kernel.org>
^ permalink raw reply
* Re: Path forward for NFC in the kernel
From: Michael Walle @ 2026-04-17 6:35 UTC (permalink / raw)
To: Jakub Kicinski, Michael Thalmeier, Raymond Hackley, Bongsu Jeon,
Krzysztof Kozlowski, Mark Greer
Cc: netdev
In-Reply-To: <20260416101041.4c533306@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 881 bytes --]
Hi,
On Thu Apr 16, 2026 at 7:10 PM CEST, Jakub Kicinski wrote:
> Hi folks!
>
> We are struggling to keep up with the number of security reports and AI
> generated patches in the kernel. NFC is infamous for being a huge CVE
> magnet. We need someone to step up as a maintainer, create an NFC tree
> and handle all the incoming submissions. Send us (or Linus if you
> prefer) periodic PRs, like WiFi, Bluetooth etc. do. If that does not
> happen I'm afraid we'll have to move the NFC code out of the tree,
> put it up on GH or some such, and let it accumulate CVEs there..
>
> I'm planning to send a PR to Linus to shed the unmaintained code early
> next week. We need to have a maintainer established by then.
Thanks for asking, but I'm busy renovating my house, sorry. I
couldn't put much work into that. The former is already stressful
enough :)
-michael
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 297 bytes --]
^ permalink raw reply
* [PATCH iwl-net 4/4] ice: report EIPE checksum errors to the OS on E830
From: Aleksandr Loktionov @ 2026-04-17 6:29 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov; +Cc: netdev, Jan Glaza
In-Reply-To: <20260417062954.1241900-1-aleksandr.loktionov@intel.com>
From: Jan Glaza <jan.glaza@intel.com>
For E830 adapters the hardware-reported EIPE (Ethernet Inline IPsec
Engine) error is a reliable indication that a received packet failed
decryption and has a bad checksum. Route EIPE errors through the
generic checksum error path on E830 so the error is visible via
standard ethtool statistics (rx_csum_bad).
On previous devices (E810, E82X) the EIPE flag can be spuriously set
on encapsulated packets with inner L2 padding, so those adapters only
increment the driver-private hw_rx_eipe_error counter without routing
through the checksum error path.
Fixes: 0ca6755f3cc2 ("ice: Add a new counter for Rx EIPE errors")
Signed-off-by: Jan Glaza <jan.glaza@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index e695a66..82d9d2c4 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -140,6 +140,8 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EIPE_S)))) {
ring->vsi->back->hw_rx_eipe_error++;
+ if (ring->vsi->back->hw.mac_type == ICE_MAC_E830)
+ goto checksum_fail;
return;
}
--
2.52.0
^ permalink raw reply related
* [PATCH iwl-net 3/4] ice: support RDMA on 4+-port E830 devices
From: Aleksandr Loktionov @ 2026-04-17 6:29 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
Cc: netdev, Lukasz Czapnik
In-Reply-To: <20260417062954.1241900-1-aleksandr.loktionov@intel.com>
From: Lukasz Czapnik <lukasz.czapnik@intel.com>
E810 and E82X devices do not support RDMA on configurations with more
than 4 ports. This limitation does not apply to E830 devices, which
have a different hardware design and support RDMA regardless of the
port count.
Narrow the RDMA capability disable condition to skip E830 devices.
Fixes: ba1124f58afd ("ice: Add E830 device IDs, MAC type and registers")
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
drivers/net/ethernet/intel/ice/ice_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index ce11fea..0e40011 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -2509,7 +2509,7 @@ ice_recalc_port_limited_caps(struct ice_hw *hw, struct ice_hw_common_caps *caps)
caps->maxtc = 4;
ice_debug(hw, ICE_DBG_INIT, "reducing maxtc to %d (based on #ports)\n",
caps->maxtc);
- if (caps->rdma) {
+ if (caps->rdma && hw->mac_type != ICE_MAC_E830) {
ice_debug(hw, ICE_DBG_INIT, "forcing RDMA off\n");
caps->rdma = 0;
}
--
2.52.0
^ permalink raw reply related
* [PATCH iwl-net 2/4] ice: fix autoneg disable when link partner doesn't support AN
From: Aleksandr Loktionov @ 2026-04-17 6:29 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
Cc: netdev, Konrad Knitter
In-Reply-To: <20260417062954.1241900-1-aleksandr.loktionov@intel.com>
From: Konrad Knitter <konrad.knitter@intel.com>
Disabling autonegotiation was silently ignored when autoneg had not yet
completed (ICE_AQ_AN_COMPLETED was not set), leaving the configuration
unchanged with no error. This could prevent link from forming if the
link partner requires non-autoneg mode.
Extend the condition to also allow disabling autoneg when the link
partner reports no AN ability (ICE_AQ_LP_AN_ABILITY clear). Gate the
ICE_AQ_LP_AN_ABILITY check on the link being up so that stale or
zeroed an_info when link is down does not produce a false positive.
Introduce the helper ice_autoneg_disable_allowed() to make the check
explicit.
Fixes: f1a4a66d2310 ("ice: fix set pause param autoneg check")
Signed-off-by: Konrad Knitter <konrad.knitter@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
drivers/net/ethernet/intel/ice/ice_ethtool.c | 26 ++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 30d2550..e41fc8d 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -2509,6 +2509,28 @@ ice_ksettings_find_adv_link_speed(const struct ethtool_link_ksettings *ks)
return adv_link_speed;
}
+/**
+ * ice_autoneg_disable_allowed - check if autoneg can be disabled
+ * @p: port info
+ *
+ * Check if autonegotiation can be disabled based on link state.
+ * ICE_AQ_LP_AN_ABILITY is only valid when the link is up; gate that
+ * check accordingly to avoid false positives from stale link data.
+ *
+ * Return: true if autoneg has completed, or if the link is up and the
+ * link partner does not advertise autonegotiation capability.
+ */
+static bool ice_autoneg_disable_allowed(struct ice_port_info *p)
+{
+ u8 an_info = p->phy.link_info.an_info;
+
+ if (an_info & ICE_AQ_AN_COMPLETED)
+ return true;
+ /* ICE_AQ_LP_AN_ABILITY is only valid when link is up */
+ return (p->phy.link_info.link_info & ICE_AQ_LINK_UP) &&
+ !(an_info & ICE_AQ_LP_AN_ABILITY);
+}
+
/**
* ice_setup_autoneg
* @p: port info
@@ -2547,8 +2569,8 @@ ice_setup_autoneg(struct ice_port_info *p, struct ethtool_link_ksettings *ks,
}
}
} else {
- /* If autoneg is currently enabled */
- if (p->phy.link_info.an_info & ICE_AQ_AN_COMPLETED) {
+ /* If autoneg completed or link partner does not support AN */
+ if (ice_autoneg_disable_allowed(p)) {
/* If autoneg is supported 10GBASE_T is the only PHY
* that can disable it, so otherwise return error
*/
--
2.52.0
^ permalink raw reply related
* [PATCH iwl-net 1/4] ice: fix asymmetric pause negotiation reporting in ethtool
From: Aleksandr Loktionov @ 2026-04-17 6:29 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
Cc: netdev, Tomasz Lichwala
In-Reply-To: <20260417062954.1241900-1-aleksandr.loktionov@intel.com>
From: Tomasz Lichwala <tomasz.lichwala@intel.com>
Add Asym_Pause to the supported link modes so that asymmetric pause
negotiation is properly reported via ethtool. Without Asym_Pause in
the supported modes, 'ethtool -a' incorrectly shows 'RX/TX negotiated: off'
for asymmetric pause configurations, even when pause is properly
negotiated and functional at the hardware level.
Fixes: 5a056cd7ead2 ("ice: add lp_advertising flow control support")
Signed-off-by: Tomasz Lichwala <tomasz.lichwala@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
drivers/net/ethernet/intel/ice/ice_ethtool.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index e6a20af..30d2550 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -2373,8 +2373,9 @@ ice_get_link_ksettings(struct net_device *netdev,
break;
}
- /* flow control is symmetric and always supported */
+ /* flow control is symmetric or asymmetric and always supported */
ethtool_link_ksettings_add_link_mode(ks, supported, Pause);
+ ethtool_link_ksettings_add_link_mode(ks, supported, Asym_Pause);
caps = kzalloc_obj(*caps);
if (!caps)
--
2.52.0
^ permalink raw reply related
* [PATCH iwl-net 0/4] ice: fixes for pause reporting, autoneg, RDMA and EIPE
From: Aleksandr Loktionov @ 2026-04-17 6:29 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov; +Cc: netdev
This is v2 of the ice fixes patchset for iwl-net.
v2 changes:
- Dropped patch "ice: fix 'adjust' timer programming for E830 devices"
as it has already been applied to the iwl-net tree.
This series fixes four issues in the Intel ice driver:
- Asymmetric Pause capability was missing from the ethtool-reported
supported link modes, causing ethtool to always show Pause as
unsupported even when the hardware supports asymmetric flow control.
- Autoneg disable was only attempted when AN had already completed,
ignoring the case where the link partner does not advertise AN ability
at all (AN37). Both conditions should allow the user to disable
autoneg.
- RDMA was incorrectly disabled on E830 devices with 4 or more ports
because the generic port-limited-capabilities path capped maxtc=4 and
then cleared the RDMA capability bit. E830 does not have that
limitation and must be skipped.
- On E830, Ethernet Inline IPsec Engine (EIPE) decryption errors trigger
a checksum-error path that returned early without reporting the error
to the OS. The packet must be forwarded to the stack with the
checksum error flag set so the OS can handle it correctly.
Jan Glaza (1):
ice: report EIPE checksum errors to the OS on E830
Konrad Knitter (1):
ice: fix autoneg disable when link partner doesn't support AN
Lukasz Czapnik (1):
ice: support RDMA on 4+-port E830 devices
Tomasz Lichwala (1):
ice: fix asymmetric pause negotiation reporting in ethtool
drivers/net/ethernet/intel/ice/ice_common.c | 2 +-
drivers/net/ethernet/intel/ice/ice_ethtool.c | 30 ++++++++++++++++++++++++--
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 2 ++
3 files changed, 31 insertions(+), 3 deletions(-)
--
2.52.0
^ permalink raw reply
* Re: [PATCH net v3 0/3] net: airoha: Fix airoha_qdma_cleanup_tx_queue() processing
From: Lorenzo Bianconi @ 2026-04-17 6:26 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
Cc: linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260416-airoha_qdma_cleanup_tx_queue-fix-net-v3-0-2b69f5788580@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1679 bytes --]
> Add missing bits in airoha_qdma_cleanup_tx_queue routine.
> Fix airoha_qdma_cleanup_tx_queue processing errors intorduced in commit
> '3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order
> DMA tx descriptors")'.
>
> ---
> Changes in v3:
> - Move ndesc initialization fix in a dedicated patch.
> - Add patch 2/3 to move entries to queue head in case of DMA mapping
> failure in airoha_dev_xmit().
> - Cosmetics.
> - Link to v2: https://lore.kernel.org/r/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022@kernel.org
>
> Changes in v2:
> - Move q->ndesc initialization at end of airoha_qdma_init_tx routine in
> order to avoid any possible NULL pointer dereference in
> airoha_qdma_cleanup_tx_queue()
> - Check if q->tx_list is empty in airoha_qdma_cleanup_tx_queue()
> - Link to v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org
>
> ---
> Lorenzo Bianconi (3):
> net: airoha: Move ndesc initialization at end of airoha_qdma_init_tx()
> net: airoha: Move entries to queue head in case of DMA mapping failure in airoha_dev_xmit()
> net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue()
Please drop this version, I will send a new one dropping patch 2/3.
Regards,
Lorenzo
>
> drivers/net/ethernet/airoha/airoha_eth.c | 42 ++++++++++++++++++++++++++------
> 1 file changed, 35 insertions(+), 7 deletions(-)
> ---
> base-commit: 3f20012a3964f487ae1e9ff942e2f35d4e9595bf
> change-id: 20260410-airoha_qdma_cleanup_tx_queue-fix-net-93375f5ee80f
>
> Best regards,
> --
> Lorenzo Bianconi <lorenzo@kernel.org>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH net v2 1/2] bnge: fix initial HWRM sequence
From: Vikas Gupta @ 2026-04-17 6:16 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, edumazet, pabeni, andrew+netdev, horms, netdev,
linux-kernel, vsrama-krishna.nemani, bhargava.marreddy,
rajashekar.hudumula, ajit.khaparde, dharmender.garg,
rahul-rg.gupta
In-Reply-To: <20260417022201.1807745-1-kuba@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]
> > @@ -73,25 +73,35 @@ static int bnge_func_qcaps(struct bnge_dev *bd)
>
> [ ... ]
>
> > return 0;
> > +
> > +err_free_ctx_mem:
> > + bnge_free_ctx_mem(bd);
> > + return rc;
> > }
>
> Could this error path leave the firmware with active DMA pointers to freed
> memory?
>
> Looking at bnge_alloc_ctx_mem(), it configures the firmware with host DMA
> addresses for the backing store via bnge_backing_store_cfg(). If a
> subsequent step like bnge_hwrm_func_resc_qcaps() fails, this new error path
> directly frees the context memory.
HW does not access the context memory unless rings are active (traffic
is running).
Therefore, It is safe to free the memory on the host.
>
> Previously, errors in this sequence triggered bnge_fw_unregister_dev(),
> which safely unregistered the firmware before freeing the memory. There is
> even a comment in bnge_fw_unregister_dev() documenting this requirement:
bnge_fw_unregister_dev() is called in both success and failure scenarios.
Therefore, free context memory using bnge_free_ctx_mem() after
bnge_hwrm_func_drv_unrgtr() ensures that the context memory is invalidated
at the hardware level.
>
> static void bnge_fw_unregister_dev(struct bnge_dev *bd)
> {
> /* ctx mem free after unrgtr only */
> bnge_hwrm_func_drv_unrgtr(bd);
> bnge_free_ctx_mem(bd);
> }
>
> Since the firmware isn't told to stop using the memory before it is freed
> here, is there a risk the hardware might DMA into the freed host pages?
There is no communication mechanism to inform the firmware to free the
context memory.
>
> > @@ -132,32 +142,25 @@ static int bnge_fw_register_dev(struct bnge_dev *bd)
>
> [ ... ]
>
> > -err_func_unrgtr:
> > - bnge_fw_unregister_dev(bd);
> > +err_free_ctx_mem:
> > + bnge_free_ctx_mem(bd);
> > return rc;
> > }
>
> This error path appears to have the same regression. If
> bnge_hwrm_func_drv_rgtr() fails after bnge_func_qcaps() has already
> configured the backing store, freeing the context memory directly without
> unregistering might allow the hardware to access freed memory.
Even if bnge_hwrm_func_drv_rgtr() fails, it is still safe to free the context
memory at the host because the driver unloads from this point.
AI reviews appear to ignore logic related to handling context memory
in the patch.
I see no valid comments on the patch.
Thanks,
Vikas
> --
> pw-bot: cr
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5465 bytes --]
^ permalink raw reply
* [PATCH] net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd()
From: Zero Mark @ 2026-04-17 6:07 UTC (permalink / raw)
To: Willem de Bruijn
Cc: security, David S . Miller, Jakub Kicinski, Eric Dumazet, netdev,
Zero Mark
In-Reply-To: <CAHOBGNBvZOXGzzMDuHWw1RrRvbg4TZVH34jVDhc1nkHbW_URXA@mail.gmail.com>
In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points
directly into the mmap'd TX ring buffer shared with userspace. The
kernel validates the header via __packet_snd_vnet_parse() but then
re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent
userspace thread can modify the vnet_hdr fields (gso_type, gso_size,
flags, csum_start, csum_offset) between validation and use, bypassing
all safety checks.
This can lead to:
- Out-of-bounds checksum writes via crafted csum_start/csum_offset
- Malicious GSO segmentation parameters
- Kernel memory corruption and potential local privilege escalation
The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr
to a stack-local variable. All other vnet_hdr consumers in the kernel
(tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX
path is the only caller of virtio_net_hdr_to_skb() that reads directly
from user-controlled shared memory.
Fix this by copying vnet_hdr from the mmap'd ring buffer to a
stack-local variable before validation and use, consistent with the
approach used in packet_snd() and all other callers.
Exploitation requires CAP_NET_RAW, which can be obtained without
special privileges via user namespaces.
Confirmed with a PoC on Linux 6.8.0 (Ubuntu): kprobe tracing on
skb_partial_csum_set captured 77 race wins in 500,000 iterations.
Affects all kernels since PACKET_VNET_HDR support was added to the
TPACKET TX path (~v3.14).
Fixes: 9ed988e5 ("packet: add vnet_hdr support for tpacket_snd")
Signed-off-by: Zero Mark <patzilla007@gmail.com>
---
net/packet/af_packet.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index abcdef012345..fedcba654321 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2725,7 +2725,8 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
{
struct sk_buff *skb = NULL;
struct net_device *dev;
- struct virtio_net_hdr *vnet_hdr = NULL;
+ struct virtio_net_hdr vnet_hdr;
+ bool has_vnet_hdr = false;
struct sockcm_cookie sockc;
__be16 proto;
int err, reserve = 0;
@@ -2828,16 +2829,17 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
if (po->has_vnet_hdr) {
- vnet_hdr = data;
- data += sizeof(*vnet_hdr);
- tp_len -= sizeof(*vnet_hdr);
+ memcpy(&vnet_hdr, data, sizeof(vnet_hdr));
+ data += sizeof(vnet_hdr);
+ tp_len -= sizeof(vnet_hdr);
if (tp_len < 0 ||
- __packet_snd_vnet_parse(vnet_hdr, tp_len)) {
+ __packet_snd_vnet_parse(&vnet_hdr, tp_len)) {
tp_len = -EINVAL;
goto tpacket_error;
}
copylen = __virtio16_to_cpu(vio_le(),
- vnet_hdr->hdr_len);
+ vnet_hdr.hdr_len);
+ has_vnet_hdr = true;
}
copylen = max_t(int, copylen, dev->hard_header_len);
skb = sock_alloc_send_skb(&po->sk,
@@ -2875,11 +2877,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
}
- if (po->has_vnet_hdr) {
- if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) {
+ if (has_vnet_hdr) {
+ if (virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le())) {
tp_len = -EINVAL;
goto tpacket_error;
}
- virtio_net_hdr_set_proto(skb, vnet_hdr);
+ virtio_net_hdr_set_proto(skb, &vnet_hdr);
}
skb->destructor = tpacket_destruct_skb;
--
2.43.0
^ permalink raw reply related
* Re: TCP default settings (bugzilla)
From: plantegg ren @ 2026-04-17 5:58 UTC (permalink / raw)
To: stephen; +Cc: netdev
Hi Stephen,
I'm the reporter of those two bugs. I'm a DBA and Linux SRE with over
10 years at Alibaba Cloud (Aliyun).
These come from real production pain, not just theory. During my time at
Alibaba Cloud, I pushed to change the default tcp_retries2 from 15 to 7
in Alibaba Cloud Linux 3 (ALinux3) — our in-house distro serving millions
of ECS instances. That change alone eliminated a whole class of prolonged
outages across the fleet.
The most memorable case: MySQL crashed and restarted in seconds, but the
application tier stayed down for ~16 minutes because all existing
connections were stuck in retransmission. After changing tcp_retries2 from
15 to 5, recovery time dropped from 957s to about 20s.
The tcp_keepalive_time issue bit us through LVS — connections silently
dropped after 900s of idle time, but TCP didn't notice until 7200s later.
We spent days chasing "random" Connection Reset errors across dozens of
services before tracing it to this mismatch.
Every ops team I've talked to ends up applying these tweaks independently
after getting burned. If a major cloud distro already ships tcp_retries2=7,
maybe it's time for upstream to reconsider the default too.
I did use AI to help format the bug reports (guilty as charged), but the
problems and the data are from years of production experience.
Thanks for forwarding to the list.
Xijun Ren
^ permalink raw reply
* Re: [PATCH net-next v4 5/5] selftests: net: bridge: add MRC and QQIC field encoding tests
From: Ujjal Roy @ 2026-04-17 5:57 UTC (permalink / raw)
To: Ido Schimmel
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260413084752.GD209364@shredder>
On Mon, Apr 13, 2026 at 2:18 PM Ido Schimmel <idosch@nvidia.com> wrote:
>
> See some comments below, but note that net-next is closed:
>
> https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
>
> So you can either wait with v5 until it is open again or post it as RFC
> so that we can at least review (but not merge) it while net-next is
> closed.
Let me clear the changes asked here inline, so that I will be prepared
with v5 until net-next is open. You can ask me to send it as RFC v5,
if you have doubts about inline answers.
>
> On Sun, Apr 12, 2026 at 11:10:47AM +0000, Ujjal Roy wrote:
> > Enhance vlmc_query_intvl_test and vlmc_query_response_intvl_test in
> > bridge_vlan_mcast.sh to validate IGMPv3/MLDv2 protocol compliance for
> > MRC and QQIC field encoding across both linear and exponential ranges.
> >
> > TEST: Vlan multicast snooping enable [ OK ]
> > TEST: Vlan mcast_query_interval global option default value [ OK ]
> > INFO: Vlan 10 mcast_query_interval (QQIC) test cases:
> > TEST: Number of tagged IGMPv2 general query [ OK ]
> > TEST: IGMPv3 QQIC linear value 60 [ OK ]
> > TEST: MLDv2 QQIC linear value 60 [ OK ]
> > TEST: IGMPv3 QQIC non linear value 160 [ OK ]
> > TEST: MLDv2 QQIC non linear value 160 [ OK ]
> > TEST: Vlan mcast_query_response_interval global option default value [ OK ]
> > INFO: Vlan 10 mcast_query_response_interval (MRC) test cases:
> > TEST: IGMPv3 MRC linear value 60 [ OK ]
> > TEST: IGMPv3 MRC non linear value 160 [ OK ]
> > TEST: MLDv2 MRC linear value 30000 [ OK ]
> > TEST: MLDv2 MRC non linear value 60000 [ OK ]
> >
> > Signed-off-by: Ujjal Roy <royujjal@gmail.com>
> > ---
> > .../net/forwarding/bridge_vlan_mcast.sh | 150 +++++++++++++++++-
> > 1 file changed, 142 insertions(+), 8 deletions(-)
> >
> > diff --git a/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh b/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> > index e8031f68200a..9f9f33d58286 100755
> > --- a/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> > +++ b/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> > @@ -162,14 +162,27 @@ vlmc_query_cnt_setup()
> > {
> > local type=$1
> > local dev=$2
> > + local match=$3
> >
> > if [[ $type == "igmp" ]]; then
> > - tc filter add dev $dev egress pref 10 prot 802.1Q \
> > + # This matches: IP Protocol 2 (IGMP)
> > + tc filter add dev "$dev" egress pref 10 prot 802.1Q \
> > flower vlan_id 10 vlan_ethtype ipv4 dst_ip 224.0.0.1 ip_proto 2 \
> > + action continue
> > + # AND Type 0x11 (Query) at offset 24 after IP
> > + # IP (20 byte IP + 4 bytes Option)
>
> Let's make it clearer: 20 bytes IPv4 header + 4 bytes Router Alert option
# 20 bytes IPv4 header + 4 bytes Router Alert option +
IGMP[offset 0] Query
>
> > + match=(match u8 0x11 0xff at 24 $match)
> > + tc filter add dev "$dev" egress pref 20 prot 802.1Q u32 "${match[@]}" \
> > action pass
> > else
> > - tc filter add dev $dev egress pref 10 prot 802.1Q \
> > + # This matches: ICMPv6
> > + tc filter add dev "$dev" egress pref 10 prot 802.1Q \
> > flower vlan_id 10 vlan_ethtype ipv6 dst_ip ff02::1 ip_proto icmpv6 \
> > + action continue
> > + # AND Type 0x82 (Query) at offset 48 after IPv6
> > + # IPv6 (40 bytes IPv6 + 2 bytes next HDR + 4 bytes Option + 2 byte pad)
>
> Same: 40 bytes IPv6 header + 8 bytes Hop-by-hop option
# 40 bytes IPv6 header + 8 bytes Hop-by-hop option +
MLD[offset 0] Query
>
> > + match=(match u8 0x82 0xff at 48 $match)
> > + tc filter add dev "$dev" egress pref 20 prot 802.1Q u32 "${match[@]}" \
> > action pass
> > fi
>
> Sashiko has a relevant comment:
>
> "
> Does this configuration evaluate all packets against the pref 20 filter,
> regardless of the pref 10 result?
>
> In tc, if a packet does not match a filter, classification automatically falls
> through to the next priority filter. By using "action continue" on pref 10,
> matching packets are also instructed to continue evaluation at the next filter.
>
> Because both matching and non-matching packets proceed to pref 20, pref 10
> seems to act as a no-op gate. Could this cause the u32 rules in pref 20 to
> inadvertently match unrelated background traffic on the interface?
>
> To implement a logical AND across different classifiers, should pref 10 use
> "action goto chain 1" with pref 20 placed inside chain 1?
> "
Answer: No, it should evaluate IGMP only by pref 10 filter AND IGMPv3
Query by pref 20 filter. Query filter may include additional match for
QQIC/MRC.
Here is my new filter:
tc filter add dev "$dev" egress pref 10 prot 802.1Q \
flower vlan_id 10 vlan_ethtype ipv4 dst_ip
224.0.0.1 ip_proto 2 \
action goto chain 1
>
> >
> > @@ -181,7 +194,53 @@ vlmc_query_cnt_cleanup()
> > local dev=$1
> >
> > ip link set dev br0 type bridge mcast_stats_enabled 0
> > - tc filter del dev $dev egress pref 10
> > + tc filter del dev "$dev" egress pref 20
> > + tc filter del dev "$dev" egress pref 10
> > +}
> > +
> > +vlmc_query_get_intvl_match()
> > +{
> > + local type=$1
> > + local version=$2
> > + local test=$3
> > + local interval=$4
> > +
> > + if [ "$test" = "qqic" ]; then
> > + # QQIC is 8-bit floating point encoding for IGMPv3 and MLDv2
> > + if [ "${type}v${version}" = "igmpv3" ]; then
> > + # IP 20 bytes + 4 bytes Option + IGMPv3[9]
> > + if [[ $interval -lt 128 ]]; then
> > + echo "match u8 0x3c 0xff at 33"
>
> Please pass the expected value as an argument instead of hard coding
> "0x3c" here. Same in other places in the function.
Will pass the expected code as an argument. Also will update the comments here.
# 20 bytes IPv4 header + 4 bytes Router Alert
option + IGMPv3[offset 9] QQIC
>
> > + else
> > + echo "match u8 0x84 0xff at 33"
> > + fi
> > + elif [ "${type}v${version}" = "mldv2" ]; then
> > + # IPv6 40 + 2 next HDR + 4 Option + 2 pad + MLDv2[25]
> > + if [[ $interval -lt 128 ]]; then
> > + echo "match u8 0x3c 0xff at 73"
> > + else
> > + echo "match u8 0x84 0xff at 73"
> > + fi
> > + fi
> > + elif [ "$test" = "mrc" ]; then
> > + if [ "${type}v${version}" = "igmpv3" ]; then
> > + # MRC is 8-bit floating point encoding for IGMPv3
> > + # IP 20 bytes + 4 bytes Option + IGMPv3[1]
> > + if [[ $interval -lt 128 ]]; then
> > + echo "match u8 0x3c 0xff at 25"
> > + else
> > + echo "match u8 0x84 0xff at 25"
> > + fi
> > + elif [ "${type}v${version}" = "mldv2" ]; then
> > + # MRC is 16-bit floating point encoding for MLDv2
> > + # IPv6 40 + 2 next HDR + 4 Option + 2 pad + MLDv2[4]
> > + if [[ $interval -lt 32768 ]]; then
> > + echo "match u16 0x7530 0xffff at 52"
> > + else
> > + echo "match u16 0x8d4c 0xffff at 52"
> > + fi
> > + fi
> > + fi
> > }
> >
> > vlmc_check_query()
> > @@ -191,9 +250,13 @@ vlmc_check_query()
> > local dev=$3
> > local expect=$4
> > local time=$5
> > + local test=$6
> > + local interval=$7
> > + local intvl_match=""
> > local ret=0
> >
> > - vlmc_query_cnt_setup $type $dev
> > + intvl_match="$(vlmc_query_get_intvl_match "$type" "$version" "$test" "$interval")"
> > + vlmc_query_cnt_setup "$type" "$dev" "$intvl_match"
> >
> > local pre_tx_xstats=$(vlmc_query_cnt_xstats $type $version $dev)
> > bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_querier 1
> > @@ -201,7 +264,7 @@ vlmc_check_query()
> > if [[ $ret -eq 0 ]]; then
> > sleep $time
> >
> > - local tcstats=$(tc_rule_stats_get $dev 10 egress)
> > + local tcstats=$(tc_rule_stats_get "$dev" 20 egress)
> > local post_tx_xstats=$(vlmc_query_cnt_xstats $type $version $dev)
> >
> > if [[ $tcstats != $expect || \
> > @@ -441,6 +504,7 @@ vlmc_query_intvl_test()
> > check_err $? "Wrong default mcast_query_interval global vlan option value"
> > log_test "Vlan mcast_query_interval global option default value"
> >
> > + log_info "Vlan 10 mcast_query_interval (QQIC) test cases:"
>
> Let's remove this as it makes the output confusing:
Sure, I will remove this line.
>
> INFO: Vlan 10 mcast_query_response_interval (MRC) test cases:
> TEST: IGMPv3 MRC linear value 60 [ OK ]
> [...]
> TEST: Flood unknown vlan multicast packets to router port only [ OK ]
> TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
>
> > RET=0
> > bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 0
> > bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 200
> > @@ -448,8 +512,42 @@ vlmc_query_intvl_test()
> > # 1 is sent immediately, then 2 more in the next 5 seconds
> > vlmc_check_query igmp 2 $swp1 3 5
> > check_err $? "Wrong number of tagged IGMPv2 general queries sent"
> > - log_test "Vlan 10 mcast_query_interval option changed to 200"
> > + log_test "Number of tagged IGMPv2 general query"
> >
> > + RET=0
> > + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 3
> > + check_err $? "Could not set mcast_igmp_version in vlan 10"
> > + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 2
> > + check_err $? "Could not set mcast_mld_version in vlan 10"
> > + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 6000
> > + check_err $? "Could not set mcast_query_interval in vlan 10"
> > + # 1 is sent immediately, IGMPv3 QQIC should match with linear value 60s
> > + vlmc_check_query igmp 3 $swp1 1 1 qqic 60
> > + check_err $? "Wrong QQIC in generated IGMPv3 general queries"
> > + log_test "IGMPv3 QQIC linear value 60"
> > +
> > + RET=0
> > + # 1 is sent immediately, MLDv2 QQIC should match with linear value 60s
> > + vlmc_check_query mld 2 $swp1 1 1 qqic 60
> > + check_err $? "Wrong QQIC in generated MLDv2 general queries"
> > + log_test "MLDv2 QQIC linear value 60"
> > +
> > + RET=0
> > + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 16000
> > + check_err $? "Could not set mcast_query_interval in vlan 10"
> > + # 1 is sent immediately, IGMPv3 QQIC should match with non linear value 160s
> > + vlmc_check_query igmp 3 $swp1 1 1 qqic 160
> > + check_err $? "Wrong QQIC in generated IGMPv3 general queries"
> > + log_test "IGMPv3 QQIC non linear value 160"
> > +
> > + RET=0
> > + # 1 is sent immediately, MLDv2 QQIC should match with non linear value 160s
> > + vlmc_check_query mld 2 $swp1 1 1 qqic 160
> > + check_err $? "Wrong QQIC in generated MLDv2 general queries"
> > + log_test "MLDv2 QQIC non linear value 160"
> > +
> > + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 2
> > + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 1
> > bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 2
> > bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 12500
> > }
> > @@ -468,11 +566,47 @@ vlmc_query_response_intvl_test()
> > check_err $? "Wrong default mcast_query_response_interval global vlan option value"
> > log_test "Vlan mcast_query_response_interval global option default value"
> >
> > + log_info "Vlan 10 mcast_query_response_interval (MRC) test cases:"
>
> Same
I will remove this line also.
[...]
^ permalink raw reply
* [PATCH] gtp: disable BH before calling udp_tunnel_xmit_skb()
From: David Carlier @ 2026-04-17 5:54 UTC (permalink / raw)
To: Pablo Neira Ayuso, Harald Welte, Andrew Lunn, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Weiming Shi, osmocom-net-gprs, netdev, linux-kernel,
David Carlier, stable
gtp_genl_send_echo_req() runs as a generic netlink doit handler in
process context with BH not disabled. It calls udp_tunnel_xmit_skb(),
which eventually invokes iptunnel_xmit() — that uses __this_cpu_inc/dec
on softnet_data.xmit.recursion to track the tunnel xmit recursion level.
Without local_bh_disable(), the task may migrate between
dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the
per-CPU counter pairing. The result is stale or negative recursion
levels that can later produce false-positive
SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU.
The other udp_tunnel_xmit_skb() call sites in gtp.c are unaffected:
the data path runs under ndo_start_xmit and the echo response handlers
run from the UDP encap rx softirq, both with BH already disabled.
Fix it by disabling BH around the udp_tunnel_xmit_skb() call, mirroring
commit 2cd7e6971fc2 ("sctp: disable BH before calling
udp_tunnel_xmit_skb()").
Fixes: 6f1a9140ecda ("net: add xmit recursion limit to tunnel xmit functions")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
drivers/net/gtp.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
index 70b9e58b9b78..5150f2e4f66b 100644
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -2400,6 +2400,7 @@ static int gtp_genl_send_echo_req(struct sk_buff *skb, struct genl_info *info)
return -ENODEV;
}
+ local_bh_disable();
udp_tunnel_xmit_skb(rt, sk, skb_to_send,
fl4.saddr, fl4.daddr,
inet_dscp_to_dsfield(fl4.flowi4_dscp),
@@ -2409,6 +2410,7 @@ static int gtp_genl_send_echo_req(struct sk_buff *skb, struct genl_info *info)
!net_eq(sock_net(sk),
dev_net(gtp->dev)),
false, 0);
+ local_bh_enable();
return 0;
}
--
2.53.0
^ permalink raw reply related
* [net-next v2 5/5] net: stmmac: starfive: Add STMMAC_FLAG_SPH_DISABLE flag
From: Minda Chen @ 2026-04-17 2:45 UTC (permalink / raw)
To: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, netdev
Cc: linux-kernel, linux-stm32, devicetree, Minda Chen
In-Reply-To: <20260417024523.107786-1-minda.chen@starfivetech.com>
Add default disable split header flag in all the starfive
soc.
Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
---
drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
index 91698c763dac..9146b498658d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
@@ -145,7 +145,7 @@ static int starfive_dwmac_probe(struct platform_device *pdev)
}
dwmac->dev = &pdev->dev;
- plat_dat->flags |= STMMAC_FLAG_EN_TX_LPI_CLK_PHY_CAP;
+ plat_dat->flags |= (STMMAC_FLAG_EN_TX_LPI_CLK_PHY_CAP | STMMAC_FLAG_SPH_DISABLE);
plat_dat->bsp_priv = dwmac;
plat_dat->dma_cfg->dche = true;
--
2.17.1
^ permalink raw reply related
* [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
From: Tariq Toukan @ 2026-04-17 5:02 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu,
Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev,
linux-rdma, linux-kernel, Gal Pressman
In-Reply-To: <20260417050201.192070-1-tariqt@nvidia.com>
From: Cosmin Ratiu <cratiu@nvidia.com>
priv->psp->psp is initialized with the PSP device as returned by
psp_dev_create(). This could also return an error, in which case a
future psp_dev_unregister() will result in unpleasantness.
Avoid that by using a local variable and only saving the PSP device when
registration succeeds.
Also apply some light refactoring of the functions managing the PSP
device in order to make them more readable/safe.
Fixes: 89ee2d92f66c ("net/mlx5e: Support PSP offload functionality")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../mellanox/mlx5/core/en_accel/psp.c | 36 ++++++++++---------
1 file changed, 20 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
index 6a50b6dec0fa..d9adb993e64d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/psp.c
@@ -1070,29 +1070,37 @@ static struct psp_dev_ops mlx5_psp_ops = {
void mlx5e_psp_unregister(struct mlx5e_priv *priv)
{
- if (!priv->psp || !priv->psp->psp)
+ struct mlx5e_psp *psp = priv->psp;
+
+ if (!psp || !psp->psp)
return;
- psp_dev_unregister(priv->psp->psp);
+ psp_dev_unregister(psp->psp);
+ psp->psp = NULL;
}
void mlx5e_psp_register(struct mlx5e_priv *priv)
{
+ struct mlx5e_psp *psp = priv->psp;
+ struct psp_dev *psd;
+
/* FW Caps missing */
if (!priv->psp)
return;
- priv->psp->caps.assoc_drv_spc = sizeof(u32);
- priv->psp->caps.versions = 1 << PSP_VERSION_HDR0_AES_GCM_128;
+ psp->caps.assoc_drv_spc = sizeof(u32);
+ psp->caps.versions = 1 << PSP_VERSION_HDR0_AES_GCM_128;
if (MLX5_CAP_PSP(priv->mdev, psp_crypto_esp_aes_gcm_256_encrypt) &&
MLX5_CAP_PSP(priv->mdev, psp_crypto_esp_aes_gcm_256_decrypt))
- priv->psp->caps.versions |= 1 << PSP_VERSION_HDR0_AES_GCM_256;
+ psp->caps.versions |= 1 << PSP_VERSION_HDR0_AES_GCM_256;
- priv->psp->psp = psp_dev_create(priv->netdev, &mlx5_psp_ops,
- &priv->psp->caps, NULL);
- if (IS_ERR(priv->psp->psp))
+ psd = psp_dev_create(priv->netdev, &mlx5_psp_ops, &psp->caps, NULL);
+ if (IS_ERR(psd)) {
mlx5_core_err(priv->mdev, "PSP failed to register due to %pe\n",
- priv->psp->psp);
+ psd);
+ return;
+ }
+ psp->psp = psd;
}
int mlx5e_psp_init(struct mlx5e_priv *priv)
@@ -1131,22 +1139,18 @@ int mlx5e_psp_init(struct mlx5e_priv *priv)
if (!psp)
return -ENOMEM;
- priv->psp = psp;
fs = mlx5e_accel_psp_fs_init(priv);
if (IS_ERR(fs)) {
err = PTR_ERR(fs);
- goto out_err;
+ kfree(psp);
+ return err;
}
psp->fs = fs;
+ priv->psp = psp;
mlx5_core_dbg(priv->mdev, "PSP attached to netdevice\n");
return 0;
-
-out_err:
- priv->psp = NULL;
- kfree(psp);
- return err;
}
void mlx5e_psp_cleanup(struct mlx5e_priv *priv)
--
2.44.0
^ permalink raw reply related
* [PATCH net 2/2] net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable
From: Tariq Toukan @ 2026-04-17 5:02 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu,
Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev,
linux-rdma, linux-kernel, Gal Pressman
In-Reply-To: <20260417050201.192070-1-tariqt@nvidia.com>
From: Cosmin Ratiu <cratiu@nvidia.com>
devlink reload while PSP connections are active does:
mlx5_unload_one_devl_locked() -> mlx5_detach_device()
-> _mlx5e_suspend()
-> mlx5e_detach_netdev()
-> profile->cleanup_rx
-> profile->cleanup_tx
-> mlx5e_destroy_mdev_resources() -> mlx5_core_dealloc_pd() fails:
...
mlx5_core 0000:08:00.0: mlx5_cmd_out_err:821:(pid 19722):
DEALLOC_PD(0x801) op_mod(0x0) failed, status bad resource state(0x9),
syndrome (0xef0c8a), err(-22)
...
The reason for failure is the existence of TX keys, which are removed by
the PSP dev unregistration happening in:
profile->cleanup() -> mlx5e_psp_unregister() -> mlx5e_psp_cleanup()
-> psp_dev_unregister()
...but this isn't invoked in the devlink reload flow, only when changing
the NIC profile (e.g. when transitioning to switchdev mode) or on dev
teardown.
Move PSP device registration into mlx5e_nic_enable(), and unregistration
into the corresponding mlx5e_nic_disable(). These functions are called
during netdev attach/detach after RX & TX are set up.
This ensures that the keys will be gone by the time the PD is destroyed.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: 89ee2d92f66c ("net/mlx5e: Support PSP offload functionality")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6c4eeb88588c..c3938a2dbbfe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6021,7 +6021,6 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
if (take_rtnl)
rtnl_lock();
- mlx5e_psp_register(priv);
/* update XDP supported features */
mlx5e_set_xdp_feature(priv);
@@ -6034,7 +6033,6 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
{
mlx5e_health_destroy_reporters(priv);
- mlx5e_psp_unregister(priv);
mlx5e_ktls_cleanup(priv);
mlx5e_psp_cleanup(priv);
mlx5e_fs_cleanup(priv->fs);
@@ -6158,6 +6156,7 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv)
mlx5e_fs_init_l2_addr(priv->fs, netdev);
mlx5e_ipsec_init(priv);
+ mlx5e_psp_register(priv);
err = mlx5e_macsec_init(priv);
if (err)
@@ -6228,6 +6227,7 @@ static void mlx5e_nic_disable(struct mlx5e_priv *priv)
mlx5_lag_remove_netdev(mdev, priv->netdev);
mlx5_vxlan_reset_to_default(mdev->vxlan);
mlx5e_macsec_cleanup(priv);
+ mlx5e_psp_unregister(priv);
mlx5e_ipsec_cleanup(priv);
}
--
2.44.0
^ permalink raw reply related
* [PATCH net 0/2] mlx5e PSP fixes
From: Tariq Toukan @ 2026-04-17 5:01 UTC (permalink / raw)
To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller
Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Daniel Zahka, Willem de Bruijn, Cosmin Ratiu,
Raed Salem, Rahul Rameshbabu, Dragos Tatulea, Kees Cook, netdev,
linux-rdma, linux-kernel, Gal Pressman
Hi,
This patchset provides bug fixes from Cosmin to the mlx5e PSP feature.
Thanks,
Tariq.
Cosmin Ratiu (2):
net/mlx5e: psp: Fix invalid access on PSP dev registration fail
net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable
.../mellanox/mlx5/core/en_accel/psp.c | 36 ++++++++++---------
.../net/ethernet/mellanox/mlx5/core/en_main.c | 4 +--
2 files changed, 22 insertions(+), 18 deletions(-)
base-commit: 82c21069028c5db3463f851ae8ac9cc2e38a3827
--
2.44.0
^ permalink raw reply
* Re: [net-next v2 3/5] dt-bindings: net: starfive,jh7110-dwmac: Add JHB100 sgmii rx clk
From: Rob Herring (Arm) @ 2026-04-17 4:36 UTC (permalink / raw)
To: Minda Chen
Cc: Paolo Abeni, Jakub Kicinski, Alexandre Torgue, Maxime Coquelin,
Emil Renner Berthing, Andrew Lunn, David S . Miller, linux-stm32,
devicetree, netdev, Krzysztof Kozlowski, Rob Herring,
linux-kernel, Eric Dumazet, Conor Dooley
In-Reply-To: <20260417024523.107786-4-minda.chen@starfivetech.com>
On Fri, 17 Apr 2026 10:45:21 +0800, Minda Chen wrote:
> JHB100 SGMII interface tx/rx mac clock is split and require to
> set clock rate in 10M/100M/1000M speed. So dts need to add a
> new rx clock in code, dts and dt binding doc.
>
> Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> ---
> .../bindings/net/starfive,jh7110-dwmac.yaml | 42 ++++++++++++++++---
> 1 file changed, 36 insertions(+), 6 deletions(-)
>
My bot found errors running 'make dt_binding_check' on your patch:
yamllint warnings/errors:
./Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml:56:8: [warning] wrong indentation: expected 8 but found 7 (indentation)
dtschema/dtc warnings/errors:
doc reference errors (make refcheckdocs):
See https://patchwork.kernel.org/project/devicetree/patch/20260417024523.107786-4-minda.chen@starfivetech.com
The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.
If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:
pip3 install dtschema --upgrade
Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.
^ permalink raw reply
* Re: [PATCH net] net/sched: act_ct: fix skb leak on fragment check failure
From: phx @ 2026-04-17 3:56 UTC (permalink / raw)
To: Jamal Hadi Salim; +Cc: netdev, jiri, horms
In-Reply-To: <CAM0EoMmi0tdhB9ECmcpPea7iSFm5AiLme71cw5zXK+WVUZGEMw@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 2826 bytes --]
Found it through code review. Reproduced it on a 7.0-rc6 kernel
using a veth pair with act_ct on ingress:
ip netns add ns_ct
ip link add veth0 type veth peer name veth1
ip link set veth1 netns ns_ct
ip link set veth0 up
ip netns exec ns_ct ip link set veth1 up
tc qdisc add dev veth0 clsact
tc filter add dev veth0 ingress protocol ip flower action ct zone 1
Then send truncated IP packets (10 bytes IP, need 20 minimum) from
the namespace via raw AF_PACKET socket on veth1. This hits
pskb_may_pull failure in tcf_ct_ipv4_is_fragment -> -EINVAL ->
out_frag -> TC_ACT_CONSUMED. net/core/dev.c handles TC_ACT_CONSUMED
by returning NULL without freeing the skb.
Result on unpatched kernel:
Sent: 222 packets
skbuff_head_cache: before=6439 after=6663 growth=224
FAIL: skb leak detected (224 objects leaked)
Attached a test script that automates this. With the fix applied
(TC_ACT_SHOT for non-EINPROGRESS errors), the skbs get freed and
the test passes. The test script is generated by AI.
On Thu, Apr 16, 2026 at 11:32 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On Thu, Apr 16, 2026 at 9:01 AM Dudu Lu <phx0fer@gmail.com> wrote:
> >
> > When tcf_ct_handle_fragments() returns an error other than -EINPROGRESS
> > (e.g. -EINVAL from malformed fragments), tcf_ct_act() jumps to out_frag
> > which unconditionally returns TC_ACT_CONSUMED. This tells the caller the
> > skb was consumed, but it was not freed, leaking one skb per malformed
> > fragment.
> >
> > TC_ACT_CONSUMED is only correct for -EINPROGRESS, where defragmentation
> > is genuinely in progress and the skb has been queued. For all other
> > errors the skb is still owned by the caller and must be freed via
> > TC_ACT_SHOT.
> >
> > Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo
> frags")
> > Signed-off-by: Dudu Lu <phx0fer@gmail.com>
>
> Do you have a reproducer? Always helps adding at least a tdc test.
> Also: How did you find this issue? was it AI? If yes, please add the
> tag "Assisted-by:<AI name here>"
>
> cheers,
> jamal
>
> > ---
> > net/sched/act_ct.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> > index 7d5e50c921a0..870655f682bd 100644
> > --- a/net/sched/act_ct.c
> > +++ b/net/sched/act_ct.c
> > @@ -1107,8 +1107,10 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff
> *skb, const struct tc_action *a,
> > return retval;
> >
> > out_frag:
> > - if (err != -EINPROGRESS)
> > + if (err != -EINPROGRESS) {
> > tcf_action_inc_drop_qstats(&c->common);
> > + return TC_ACT_SHOT;
> > + }
> > return TC_ACT_CONSUMED;
> >
> > drop:
> > --
> > 2.39.3 (Apple Git-145)
> >
>
[-- Attachment #1.2: Type: text/html, Size: 3592 bytes --]
[-- Attachment #2: act_ct_skb_leak.sh --]
[-- Type: text/x-sh, Size: 3349 bytes --]
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
#
# Test for skb leak in act_ct when tcf_ct_handle_fragments() fails.
#
# When a truncated IP packet hits act_ct, tcf_ct_ipv4_is_fragment()
# returns -EINVAL. The out_frag path returns TC_ACT_CONSUMED, but the
# skb was never actually consumed — it leaks.
#
# This test sends truncated IP packets through a veth pair with act_ct
# on ingress and checks skbuff_head_cache slab growth.
set -e
readonly NS="ns-act-ct-leak-$(mktemp -u XXXXXX)"
readonly DEV="veth-ct0"
readonly DEV_PEER="veth-ct1"
readonly NUM_PKTS=500
# Minimum slab growth to consider a leak (allow some noise)
readonly LEAK_THRESHOLD=200
cleanup() {
ip link del $DEV 2>/dev/null || true
ip netns del $NS 2>/dev/null || true
}
trap cleanup EXIT
get_skb_active() {
awk '/skbuff_head_cache/ {print $2}' /proc/slabinfo
}
# Build the packet sender
build_sender() {
local prog="$1"
local src="$prog.c"
cat > "$src" << 'EOF'
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <linux/if_packet.h>
#include <net/if.h>
#include <net/ethernet.h>
#include <arpa/inet.h>
#include <fcntl.h>
int main(int argc, char **argv) {
const char *ifname = argv[1];
int count = atoi(argv[2]);
int fd, i, sent = 0;
fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd < 0) { perror("socket"); return 1; }
fcntl(fd, F_SETFL, O_NONBLOCK);
struct sockaddr_ll sa = {};
sa.sll_family = AF_PACKET;
sa.sll_protocol = htons(ETH_P_IP);
sa.sll_ifindex = if_nametoindex(ifname);
sa.sll_halen = 6;
memset(sa.sll_addr, 0xff, 6);
/* Ethernet(14) + truncated IP(10) = 24 bytes.
* IP header needs 20 bytes minimum, so pskb_may_pull fails
* in tcf_ct_ipv4_is_fragment() -> -EINVAL.
*/
unsigned char pkt[24] = {};
memset(pkt, 0xff, 6); /* dst = broadcast */
pkt[12] = 0x08; pkt[13] = 0x00;/* ethertype = IPv4 */
pkt[14] = 0x45; /* ver=4, ihl=5 */
pkt[16] = 0x00; pkt[17] = 0x0a;/* total_len=10 */
pkt[23] = 0x06; /* proto=TCP */
for (i = 0; i < count; i++) {
if (sendto(fd, pkt, sizeof(pkt), 0,
(struct sockaddr *)&sa, sizeof(sa)) >= 0)
sent++;
}
printf("%d\n", sent);
close(fd);
return 0;
}
EOF
gcc -Wall -o "$prog" "$src"
}
echo "=== act_ct skb leak test ==="
# Setup veth pair with namespace
ip netns add $NS
ip link add $DEV type veth peer name $DEV_PEER
ip link set $DEV_PEER netns $NS
ip link set $DEV up
ip netns exec $NS ip link set $DEV_PEER up
# Add act_ct filter on ingress
tc qdisc add dev $DEV clsact
tc filter add dev $DEV ingress protocol ip flower action ct zone 1
# Build sender
SENDER=$(mktemp)
build_sender "$SENDER"
# Record slab state
BEFORE=$(get_skb_active)
# Send truncated packets from namespace
SENT=$(ip netns exec $NS "$SENDER" $DEV_PEER $NUM_PKTS)
sleep 1
# Record slab state again
AFTER=$(get_skb_active)
# Check tc stats
DROPPED=$(tc -s filter show dev $DEV ingress | \
awk '/dropped/ {for(i=1;i<=NF;i++) if($i=="dropped") print $(i+1)}' | \
tr -d ',')
GROWTH=$((AFTER - BEFORE))
echo "Sent: $SENT packets"
echo "TC dropped: $DROPPED"
echo "skbuff_head_cache: before=$BEFORE after=$AFTER growth=$GROWTH"
rm -f "$SENDER" "${SENDER}.c"
if [ "$GROWTH" -ge "$LEAK_THRESHOLD" ]; then
echo "FAIL: skb leak detected ($GROWTH objects leaked)"
exit 1
else
echo "PASS: no significant skb leak"
exit 0
fi
^ permalink raw reply
* [PATCH net] net: dsa: mt7530: fix .get_stats64 sleeping in atomic context
From: Daniel Golle @ 2026-04-17 3:55 UTC (permalink / raw)
To: Chester A. Unal, Daniel Golle, Andrew Lunn, Vladimir Oltean,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Matthias Brugger, AngeloGioacchino Del Regno, Russell King,
Christian Marangi, netdev, linux-kernel, linux-arm-kernel,
linux-mediatek
Cc: Frank Wunderlich, John Crispin
The .get_stats64 callback runs in atomic context, but on
MDIO-connected switches every register read acquires the MDIO bus
mutex, which can sleep:
[ 12.645973] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:609
[ 12.654442] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 759, name: grep
[ 12.663377] preempt_count: 0, expected: 0
[ 12.667410] RCU nest depth: 1, expected: 0
[ 12.671511] INFO: lockdep is turned off.
[ 12.675441] CPU: 0 UID: 0 PID: 759 Comm: grep Tainted: G S W 7.0.0+ #0 PREEMPT
[ 12.675453] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 12.675456] Hardware name: Bananapi BPI-R64 (DT)
[ 12.675459] Call trace:
[ 12.675462] show_stack+0x14/0x1c (C)
[ 12.675477] dump_stack_lvl+0x68/0x8c
[ 12.675487] dump_stack+0x14/0x1c
[ 12.675495] __might_resched+0x14c/0x220
[ 12.675504] __might_sleep+0x44/0x80
[ 12.675511] __mutex_lock+0x50/0xb10
[ 12.675523] mutex_lock_nested+0x20/0x30
[ 12.675532] mt7530_get_stats64+0x40/0x2ac
[ 12.675542] dsa_user_get_stats64+0x2c/0x40
[ 12.675553] dev_get_stats+0x44/0x1e0
[ 12.675564] dev_seq_printf_stats+0x24/0xe0
[ 12.675575] dev_seq_show+0x14/0x3c
[ 12.675583] seq_read_iter+0x37c/0x480
[ 12.675595] seq_read+0xd0/0xec
[ 12.675605] proc_reg_read+0x94/0xe4
[ 12.675615] vfs_read+0x98/0x29c
[ 12.675625] ksys_read+0x54/0xdc
[ 12.675633] __arm64_sys_read+0x18/0x20
[ 12.675642] invoke_syscall.constprop.0+0x54/0xec
[ 12.675653] do_el0_svc+0x3c/0xb4
[ 12.675662] el0_svc+0x38/0x200
[ 12.675670] el0t_64_sync_handler+0x98/0xdc
[ 12.675679] el0t_64_sync+0x158/0x15c
For MDIO-connected switches, poll MIB counters asynchronously using a
delayed workqueue every second and let .get_stats64 return the cached
values under a per-port spinlock. A mod_delayed_work() call on each
read triggers an immediate refresh so counters stay responsive when
queried more frequently.
MMIO-connected switches (MT7988, EN7581, AN7583) are not affected
because their regmap does not sleep, so they continue to read MIB
counters directly in .get_stats64.
Fixes: 88c810f35ed5 ("net: dsa: mt7530: implement .get_stats64")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
This bug highlights a bigger problem and the actual cause:
Locking in the mt7530 driver deserves a cleanup, and refactoring
towards cleanly and directly using the regmap API.
I've prepared this already and am going to submit a series doing
most of that using Coccinelle semantic patches once net-next opens
again.
drivers/net/dsa/mt7530.c | 54 +++++++++++++++++++++++++++++++++++++---
drivers/net/dsa/mt7530.h | 6 +++++
2 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index b9423389c2ef0..786d3a8492bcb 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -25,6 +25,8 @@
#include "mt7530.h"
+#define MT7530_STATS_POLL_INTERVAL (1 * HZ)
+
static struct mt753x_pcs *pcs_to_mt753x_pcs(struct phylink_pcs *pcs)
{
return container_of(pcs, struct mt753x_pcs, pcs);
@@ -906,10 +908,9 @@ static void mt7530_get_rmon_stats(struct dsa_switch *ds, int port,
*ranges = mt7530_rmon_ranges;
}
-static void mt7530_get_stats64(struct dsa_switch *ds, int port,
- struct rtnl_link_stats64 *storage)
+static void mt7530_read_port_stats64(struct mt7530_priv *priv, int port,
+ struct rtnl_link_stats64 *storage)
{
- struct mt7530_priv *priv = ds->priv;
uint64_t data;
/* MIB counter doesn't provide a FramesTransmittedOK but instead
@@ -951,6 +952,43 @@ static void mt7530_get_stats64(struct dsa_switch *ds, int port,
&storage->rx_crc_errors);
}
+static void mt7530_stats_poll(struct work_struct *work)
+{
+ struct mt7530_priv *priv = container_of(work, struct mt7530_priv,
+ stats_work.work);
+ struct rtnl_link_stats64 stats = {};
+ struct dsa_port *dp;
+ int port;
+
+ dsa_switch_for_each_user_port(dp, priv->ds) {
+ port = dp->index;
+
+ mt7530_read_port_stats64(priv, port, &stats);
+
+ spin_lock(&priv->stats_lock);
+ priv->ports[port].stats = stats;
+ spin_unlock(&priv->stats_lock);
+ }
+
+ schedule_delayed_work(&priv->stats_work,
+ MT7530_STATS_POLL_INTERVAL);
+}
+
+static void mt7530_get_stats64(struct dsa_switch *ds, int port,
+ struct rtnl_link_stats64 *storage)
+{
+ struct mt7530_priv *priv = ds->priv;
+
+ if (priv->bus) {
+ spin_lock(&priv->stats_lock);
+ *storage = priv->ports[port].stats;
+ spin_unlock(&priv->stats_lock);
+ mod_delayed_work(system_wq, &priv->stats_work, 0);
+ } else {
+ mt7530_read_port_stats64(priv, port, storage);
+ }
+}
+
static void mt7530_get_eth_ctrl_stats(struct dsa_switch *ds, int port,
struct ethtool_eth_ctrl_stats *ctrl_stats)
{
@@ -3137,6 +3175,13 @@ mt753x_setup(struct dsa_switch *ds)
if (ret && priv->irq_domain)
mt7530_free_mdio_irq(priv);
+ if (!ret && priv->bus) {
+ spin_lock_init(&priv->stats_lock);
+ INIT_DELAYED_WORK(&priv->stats_work, mt7530_stats_poll);
+ schedule_delayed_work(&priv->stats_work,
+ MT7530_STATS_POLL_INTERVAL);
+ }
+
return ret;
}
@@ -3404,6 +3449,9 @@ EXPORT_SYMBOL_GPL(mt7530_probe_common);
void
mt7530_remove_common(struct mt7530_priv *priv)
{
+ if (priv->bus)
+ cancel_delayed_work_sync(&priv->stats_work);
+
if (priv->irq_domain)
mt7530_free_mdio_irq(priv);
diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h
index 3e0090bed298d..44c1dc75baea8 100644
--- a/drivers/net/dsa/mt7530.h
+++ b/drivers/net/dsa/mt7530.h
@@ -796,6 +796,7 @@ struct mt7530_fdb {
* @pvid: The VLAN specified is to be considered a PVID at ingress. Any
* untagged frames will be assigned to the related VLAN.
* @sgmii_pcs: Pointer to PCS instance for SerDes ports
+ * @stats: Cached port statistics for MDIO-connected switches
*/
struct mt7530_port {
bool enable;
@@ -803,6 +804,7 @@ struct mt7530_port {
u32 pm;
u16 pvid;
struct phylink_pcs *sgmii_pcs;
+ struct rtnl_link_stats64 stats;
};
/* Port 5 mode definitions of the MT7530 switch */
@@ -875,6 +877,8 @@ struct mt753x_info {
* @create_sgmii: Pointer to function creating SGMII PCS instance(s)
* @active_cpu_ports: Holding the active CPU ports
* @mdiodev: The pointer to the MDIO device structure
+ * @stats_lock: Protects cached per-port stats from concurrent access
+ * @stats_work: Delayed work for polling MIB counters on MDIO switches
*/
struct mt7530_priv {
struct device *dev;
@@ -900,6 +904,8 @@ struct mt7530_priv {
int (*create_sgmii)(struct mt7530_priv *priv);
u8 active_cpu_ports;
struct mdio_device *mdiodev;
+ spinlock_t stats_lock; /* protects cached stats counters */
+ struct delayed_work stats_work;
};
struct mt7530_hw_vlan_entry {
--
2.53.0
^ permalink raw reply related
* [net-next v2 1/5] dt-bindings: net: starfive,jh7110-dwmac: Remove JH8100
From: Minda Chen @ 2026-04-17 2:45 UTC (permalink / raw)
To: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, netdev
Cc: linux-kernel, linux-stm32, devicetree, Minda Chen
In-Reply-To: <20260417024523.107786-1-minda.chen@starfivetech.com>
Remove JH8100 dt-bindings because do not support it now.
StarFive have stopped JH8100 developing and will release it
outside.
Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
---
.../bindings/net/starfive,jh7110-dwmac.yaml | 28 ++++---------------
1 file changed, 5 insertions(+), 23 deletions(-)
diff --git a/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml b/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml
index 313a15331661..0d1962980f57 100644
--- a/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml
+++ b/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml
@@ -30,10 +30,6 @@ properties:
- items:
- const: starfive,jh7110-dwmac
- const: snps,dwmac-5.20
- - items:
- - const: starfive,jh8100-dwmac
- - const: starfive,jh7110-dwmac
- - const: snps,dwmac-5.20
reg:
maxItems: 1
@@ -120,25 +116,11 @@ allOf:
minItems: 3
maxItems: 3
- if:
- properties:
- compatible:
- contains:
- const: starfive,jh8100-dwmac
- then:
- properties:
- resets:
- maxItems: 1
-
- reset-names:
- const: stmmaceth
- else:
- properties:
- resets:
- minItems: 2
-
- reset-names:
- minItems: 2
+ resets:
+ minItems: 2
+
+ reset-names:
+ minItems: 2
unevaluatedProperties: false
--
2.17.1
^ permalink raw reply related
* Re: [PATCH bpf v2 2/2] selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
From: KaFai Wan @ 2026-04-17 3:07 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: daniel, john.fastabend, sdf, ast, andrii, eddyz87, memxor, song,
yonghong.song, jolsa, davem, edumazet, kuba, pabeni, horms, shuah,
jiayuan.chen, bpf, netdev, linux-kernel, linux-kselftest
In-Reply-To: <2026416184330.-HAW.martin.lau@linux.dev>
On Thu, 2026-04-16 at 12:06 -0700, Martin KaFai Lau wrote:
> On Thu, Apr 16, 2026 at 07:23:08PM +0800, KaFai Wan wrote:
> > index 56685fc03c7e..2d738c0c4259 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> > @@ -513,6 +513,59 @@ static void misc(void)
> > bpf_link__destroy(link);
> > }
> >
> > +static void hdr_sockopt(void)
> > +{
> > + const char send_msg[] = "MISC!!!";
> > + char recv_msg[sizeof(send_msg)];
> > + const unsigned int nr_data = 2;
> > + struct bpf_link *link;
> > + struct sk_fds sk_fds;
> > + int i, ret, true_val = 1;
> > +
> > + lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map);
> > +
> > + link = bpf_program__attach_cgroup(misc_skel->progs.misc_hdr_sockopt, cg_fd);
> > + if (!ASSERT_OK_PTR(link, "attach_cgroup(misc_hdr_sockopt)"))
> > + return;
> > +
> > + if (sk_fds_connect(&sk_fds, false)) {
> > + bpf_link__destroy(link);
> > + return;
> > + }
> > +
> > + ret = setsockopt(sk_fds.active_fd, SOL_TCP, TCP_NODELAY, &true_val, sizeof(true_val));
> > + if (!ASSERT_OK(ret, "setsockopt(TCP_NODELAY) active"))
> > + goto check_linum;
> > +
> > + ret = setsockopt(sk_fds.passive_fd, SOL_TCP, TCP_NODELAY, &true_val, sizeof(true_val));
>
> Why are these two setsockopt(TCP_NODELAY) calls needed?
>
> Instead of creating a new "void hdr_sockopt(void)", can the test be done in the
> existing "void misc(void)" by doing bpf_setsockopt(TCP_NODELAY) in the
> misc_estab() bpf prog?
Oh, I see. I meant to test on both active and passive side. We can only test on active side in the
existing "void misc(void)".
>
> The PASSIVE_ESTABLISHED_CB can do the bpf_setsockopt(TCP_NODELAY, 0)
> if it wants to keep the same expectation on Nagle. The
> BPF_SOCK_OPS_HDR_OPT_LEN_CB and BPF_SOCK_OPS_WRITE_HDR_OPT_CB
> can do bpf_setsockopt(TCP_NODELAY, 1) to test recursion and
> the error return value.
>
> > void test_tcp_hdr_options(void)
> > diff --git a/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
> > b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
> > index d487153a839d..a8cf7c4e7ed2 100644
> > --- a/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
> > +++ b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
> > @@ -28,6 +28,12 @@ unsigned int nr_data = 0;
> > unsigned int nr_syn = 0;
> > unsigned int nr_fin = 0;
> > unsigned int nr_hwtstamp = 0;
> > +unsigned int nr_hdr_sockopt_estab = 0;
> > +unsigned int nr_hdr_sockopt_estab_err = 0;
> > +unsigned int nr_hdr_sockopt_len = 0;
> > +unsigned int nr_hdr_sockopt_len_err = 0;
> > +unsigned int nr_hdr_sockopt_write = 0;
> > +unsigned int nr_hdr_sockopt_write_err = 0;
>
> nr_hdr_sockopt_estab, nr_hdr_sockopt_len, and nr_hdr_sockopt_write
> are unnecessary. These tests have already been covered in some ways.
yes, they are unnecessary in existing misc_estab()
>
> Mostly a nit. The new counters are used in both connections. Note the
> existing nr_xxx is exclusively used in either active or passive,
> so there is no parallel counting in practice.
>
> Instead of counting, just use a bool nodelay_est_ok,
> nodelay_hdr_len_err, nodelay_write_err and assert them
> to be true in userspace.
indeed. will fix these in next version.
--
Thanks,
KaFai
^ permalink raw reply
* Re: [PATCH net v2] NFC: digital: bound SENSF response copy into nfc_target
From: Pengpeng Hou @ 2026-04-17 3:06 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Kees Cook, linux-kernel, pengpeng
In-Reply-To: <20260407120004.4-nfc-sensf-v2-pengpeng@iscas.ac.cn>
Hi Jakub,
Thanks, that makes sense.
I won't resend another bounds-only version as-is. I'll first dig into
why the digital path uses a 19-byte `struct digital_sensf_res` while
the core/UAPI path only carries 18 bytes in `sensf_res`, and follow up
once I have a clearer explanation for what the driver/core boundary
should be here. I'll also shorten the `Fixes:` hash in the next
revision.
Thanks,
Pengpeng
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox