* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Breno Leitao @ 2026-06-18 14:57 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Peter Zijlstra, Petr Mladek, Sebastian Andrzej Siewior,
John Ogness, Sergey Senozhatsky, Vlad Poenaru, Thomas Gleixner,
netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260617132127.645534d1@kernel.org>
On Wed, Jun 17, 2026 at 01:21:27PM -0700, Jakub Kicinski wrote:
> On Wed, 17 Jun 2026 07:56:50 -0700 Breno Leitao wrote:
> > As far as I can tell, there isn't a network driver today whose transmit
> > path is completely lockless, so, even if we make netpoll lockless.
> >
> > It's unlikely any NIC will ever achieve this, given that NIC TX
> > fundamentally relies on a shared DMA ring and doorbell register, which
> > inherently cannot be made lockless.
>
> The lock which protects the queue is maintained by the stack,
> and we trylock it. Maybe I lost the thread but if you're saying
> that writes to netconsole are impossible from arbitrary context,
> that is _not_ true, AFAIU. We can queue a packet and kick off
> the transfer on well-behaved drivers.
>
> Main problem is the opportunistic freeing up of the queue space.
> If we could avoid that in atomic context I think we'd be good.
Thanks for the clarification, this is quite valuable.
Let me verify my understanding: if we switched to __raise_softirq_irqoff()
in dev_kfree_skb_irq_reason(), the issue would be resolved since we'd
avoid waking ksoftirqd and therefore wouldn't touch the runqueue lock in this
code path.
However, while that would eliminate the nested lock problem, it could
increase memory pressure by delaying SKB garbage collection, which may
not be acceptable.
Naive question: What if we deferred SKB cleanup only during netpoll operations?
Such as tracking in_netpoll per cpu:
struct softnet_data {
....
+ bool in_netpoll;
}
and then choosing between __raise_softirq_irqoff() and raise_softirq_irqoff()?
@@ -3456,7 +3456,13 @@ void dev_kfree_skb_irq_reason(struct sk_buff *skb, enum skb_drop_reason reason)
local_irq_save(flags);
skb->next = __this_cpu_read(softnet_data.completion_queue);
__this_cpu_write(softnet_data.completion_queue, skb);
- raise_softirq_irqoff(NET_TX_SOFTIRQ);
+ if (__this_cpu_read(softnet_data.in_netpoll))
+ __raise_softirq_irqoff(NET_TX_SOFTIRQ);
+ else
+ raise_softirq_irqoff(NET_TX_SOFTIRQ);
local_irq_restore(flags);
}
Is it too hacky!?
Thanks,
--breno
^ permalink raw reply
* [PATCH net v3] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Doruk Tan Ozturk @ 2026-06-18 14:55 UTC (permalink / raw)
To: saeedm, leon, tariqt, mbloch, sd, andrew+netdev, davem, edumazet,
kuba, pabeni
Cc: borisp, raeds, ehakim, netdev, linux-rdma, linux-kernel, stable,
horms
When an offloaded MACsec RX SC is deleted, macsec_del_rxsc_ctx() released
the per-SC metadata_dst with metadata_dst_free(), which calls kfree()
unconditionally and ignores the dst reference count. The RX datapath in
mlx5e_macsec_offload_handle_rx_skb() looks up the SC under rcu_read_lock()
via xa_load() and, while still holding only the RCU read lock, takes a
reference with dst_hold() and attaches the dst to the skb with
skb_dst_set().
A reader that has already obtained the rx_sc pointer can therefore race
with the delete path:
CPU0 (del_rxsc) CPU1 (rx datapath)
-------------- ------------------
rcu_read_lock();
rx_sc = xa_load(...)->rx_sc;
xa_erase(...);
metadata_dst_free(rx_sc->md_dst); /* kfree(), ignores refcount */
dst_hold(&rx_sc->md_dst->dst); /* UAF */
skb_dst_set(skb, &rx_sc->md_dst->dst);
metadata_dst_free() frees the object even though the datapath still holds
(or is about to take) a reference, so the subsequent dst_hold() /
skb_dst_set() and the later skb free operate on freed memory.
Fix the owner side by dropping the reference with dst_release() instead of
freeing unconditionally. dst_release() only schedules the RCU-deferred
dst_destroy() once the reference count reaches zero, so a concurrent reader
that still holds a reference keeps the object alive.
Dropping the owner reference is not sufficient on its own: once the owner
reference is the last one, dst_release() drops the count to zero and the
destroy is merely RCU-deferred. A racing reader that runs plain dst_hold()
on that already-dead dst gets rcuref_get() == false but dst_hold() only
WARNs and attaches the dying dst to the skb anyway; the later skb free then
calls dst_release() on an object whose destroy is already scheduled, again
a use-after-free.
Convert the RX datapath to dst_hold_safe(), which returns false (without
warning) when the dst is already dead, and only attach it to the skb when a
reference was successfully taken. When the SC is being deleted the in-flight
packet simply proceeds without the offload metadata_dst: skb_metadata_dst()
returns NULL, the MACsec core sees !is_macsec_md_dst and skips this secy
(rx_uses_md_dst path), which is the correct behaviour for a packet whose SC
is going away.
While reworking the datapath lookup, also guard the two NULL dereferences
on the same path that an automated review (forwarded by Simon Horman)
flagged: xa_load() can return NULL when the fs_id has just been erased, and
mlx5e_macsec_add_rxsc() publishes sc_xarray_element via xa_alloc() before
rx_sc->md_dst is allocated, so a packet carrying a freshly recycled fs_id
can observe a non-NULL rx_sc whose md_dst is still NULL. Check both before
dereferencing.
Note: macsec_del_rxsc_ctx() also kfree()s rx_sc->sc_xarray_element without
an RCU grace period while the same datapath reads it under rcu_read_lock();
that is a separate pre-existing issue and is left to a follow-up patch.
Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
---
v3:
- Also guard the RX-datapath NULL dereferences flagged by the automated
review: NULL-check the xa_load() result and rx_sc->md_dst before use.
- Note the unrelated non-RCU kfree(sc_xarray_element) in the delete path
as a separate follow-up rather than folding it in here.
v2:
- Convert the RX datapath dst_hold() to dst_hold_safe() so a reader racing
the SC delete cannot attach a dst whose last reference was just dropped
(per the automated review forwarded by Simon Horman).
v1: https://lore.kernel.org/netdev/20260615140534.52691-1-doruk@0sec.ai/
.../net/ethernet/mellanox/mlx5/core/en_accel/macsec.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index 71b3a05..fb2c64d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -829,7 +829,7 @@ static void macsec_del_rxsc_ctx(struct mlx5e_macsec *macsec, struct mlx5e_macsec
*/
list_del_rcu(&rx_sc->rx_sc_list_element);
xa_erase(&macsec->sc_xarray, rx_sc->sc_xarray_element->fs_id);
- metadata_dst_free(rx_sc->md_dst);
+ dst_release(&rx_sc->md_dst->dst);
kfree(rx_sc->sc_xarray_element);
kfree_rcu_mightsleep(rx_sc);
}
@@ -1695,10 +1695,10 @@ void mlx5e_macsec_offload_handle_rx_skb(struct net_device *netdev,
rcu_read_lock();
sc_xarray_element = xa_load(&macsec->sc_xarray, fs_id);
- rx_sc = sc_xarray_element->rx_sc;
- if (rx_sc) {
- dst_hold(&rx_sc->md_dst->dst);
- skb_dst_set(skb, &rx_sc->md_dst->dst);
+ rx_sc = sc_xarray_element ? sc_xarray_element->rx_sc : NULL;
+ if (rx_sc && rx_sc->md_dst) {
+ if (dst_hold_safe(&rx_sc->md_dst->dst))
+ skb_dst_set(skb, &rx_sc->md_dst->dst);
}
rcu_read_unlock();
--
2.53.0
^ permalink raw reply related
* [PATCH net v2] ice: eswitch: fix use-after-free of metadata_dst in repr release
From: Doruk Tan Ozturk @ 2026-06-18 14:50 UTC (permalink / raw)
To: anthony.l.nguyen, przemyslaw.kitszel, andrew+netdev, davem,
edumazet, kuba, pabeni
Cc: michal.swiatkowski, wojciech.drewek, intel-wired-lan, netdev,
linux-kernel, stable, horms
ice_eswitch_release_repr() frees the port representor metadata_dst via
metadata_dst_free(), which directly kfree()s the object and ignores the
dst_entry refcount. The eswitch slow-path TX routine
ice_eswitch_port_start_xmit() takes a reference on this dst with
dst_hold() and attaches it to the skb via skb_dst_set(). If such an skb
is still in flight (e.g. queued in a qdisc) when the representor is torn
down, the metadata_dst is freed while the skb still points at it. When
the skb is later freed, dst_release() operates on already-freed memory.
Replace metadata_dst_free() with dst_release() so the metadata_dst is
freed only after the last reference is dropped. The dst subsystem frees
metadata_dst objects from dst_destroy() once the refcount reaches zero
(DST_METADATA is set by metadata_dst_alloc()).
Same class of bug and fix as commit c32b26aaa2f9 ("netfilter:
nft_tunnel: fix use-after-free on object destroy").
Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
Reviewed-by: Simon Horman <horms@kernel.org>
---
v2:
- Correct the Fixes: tag to 1a1c40df2e80 ("ice: set and release
switchdev environment"); the previously cited fff292b47ac1 only moved
the affected code rather than introducing the unbalanced free, and the
bug dates back to when switchdev support was added (Simon Horman).
- Add Simon Horman's Reviewed-by. No functional change.
drivers/net/ethernet/intel/ice/ice_eswitch.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f..41b30a7ca4a9 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -95,7 +95,7 @@ ice_eswitch_release_repr(struct ice_pf *pf, struct ice_repr *repr)
return;
ice_vsi_update_security(vsi, ice_vsi_ctx_set_antispoof);
- metadata_dst_free(repr->dst);
+ dst_release(&repr->dst->dst);
repr->dst = NULL;
ice_fltr_add_mac_and_broadcast(vsi, repr->parent_mac,
ICE_FWD_TO_VSI);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net] eth: fbnic: take netif_addr_lock_bh() around rx mode address programming
From: Simon Horman @ 2026-06-18 14:50 UTC (permalink / raw)
To: Daniel Zahka
Cc: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Sanman Pradhan,
netdev, linux-kernel
In-Reply-To: <20260617-linux-fbnic-hwaddr-v1-1-3f9f5dee7f99@gmail.com>
On Wed, Jun 17, 2026 at 03:39:49AM -0700, Daniel Zahka wrote:
> When __fbnic_set_rx_mode() is called from contexts other than
> .ndo_set_rx_mode_async(), the uc and mc addr lists are accessed
> without the addr lock that __hw_addr_sync_dev() and
> __hw_addr_unsync_dev() require. Wrap these unprotected accesses with
> netif_addr_lock_bh(). fbnic_clear_rx_mode() has similar issues.
>
> Fixes: eb690ef8d1c2 ("eth: fbnic: Add L2 address programming")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* [RFC v3 net-next] net: airoha: add HW GRO offload support
From: Lorenzo Bianconi @ 2026-06-18 14:42 UTC (permalink / raw)
To: andrew+netdev, davem, edumazet, kuba, pabeni
Cc: lorenzo, aleksander.lobakin, linux-arm-kernel, linux-mediatek,
netdev
Add hardware GRO offload support to the airoha_eth driver, leveraging
the EN7581/AN7583 SoC's 8 dedicated LRO hardware queues mapped to RX
queues 24-31. HW GRO offloading does not support Scatter-Gather (SG) so
it is required to increase the page_pool allocation order to 2 for RX
queues 24-31 (LRO queues).
Since HW GRO is configured per-QDMA and shared across all devices using
it, HW GRO is mutually exclusive with multiple active devices on the
same QDMA block. Call netdev_update_features() on sibling devices in
ndo_open/ndo_stop so that NETIF_F_GRO_HW availability is re-evaluated
when the QDMA user count changes.
Set CHECKSUM_PARTIAL with pseudo-header checksum on aggregated packets
so that L3-forwarded traffic is correctly handled by the GSO/TSO path
on the egress device.
Performance comparison between GRO and HW GRO has been carried out using
a 10Gbps NIC:
GRO: ~2.7 Gbps
HW GRO: ~8.1 Gbps
Tested-by: Madhur Agrawal <madhur.agrawal@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
Changes in RFC v3:
- Add missing TCP header length check.
- Fix TCP checkum calculation.
- Disable LRO running ndo_stop callback.
- Implement packet header split in order to support HW-GRO
- Link to v2: https://lore.kernel.org/r/20260610-airoha-eth-lro-v2-1-54be99b9a2d5@kernel.org
Changes in v2:
- Rebase on top of net-next main branch.
- Link to v1: https://lore.kernel.org/r/20260606-airoha-eth-lro-v1-1-0ebceb0eafc3@kernel.org
Changes in v1:
- Please note this patch depends on the following patch not applied yet
to net-next
https://lore.kernel.org/netdev/20260606-airoha_qdma_users-no-atomic-v1-1-86e2d6a1bfaf@kernel.org/T/#u
- Restrict LRO to single user QDMA.
- Introduce some more sanity checks.
- Disable scatter-gather for LRO queues.
- Run netif_receive_skb() for LRO packets.
- Link to v3: https://lore.kernel.org/r/20260528-airoha-eth-lro-v3-1-dd09c1fb000e@kernel.org
Changes in RFC v3:
- Fix double-free of the page_pool of airoha_qdma_lro_rx_process()
fails.
- Set AIROHA_LRO_PAGE_ORDER according to PAGE_SIZE.
- Add missig gso metadata for the LRO packet.
- Link to v2: https://lore.kernel.org/r/20260526-airoha-eth-lro-v2-1-24e2a9e7a397@kernel.org
Changes in RFC v2:
- Improve performances fixing buf_size computation.
- Fix possible overflow in REG_CDM_LRO_LIMIT() register configuration.
- Require the device to be not running before configuring LRO.
- Fix configuration order in airoha_fe_lro_is_enabled().
- Check skb header length in airoha_qdma_lro_rx_process().
- Do not check net_device feature in airoha_qdma_rx_process() before
executing airoha_qdma_lro_rx_process() but rely on
airoha_qdma_lro_rx_process() logic.
- Fix possible double recycle in airoha_qdma_rx_process() for LRO
packets.
- Always use AIROHA_RXQ_LRO_MAX_AGG_COUNT macro for max LRO aggregated
fragments in airoha_fe_lro_init_rx_queue().
- Link to v1: https://lore.kernel.org/r/20260520-airoha-eth-lro-v1-1-129cc33766e9@kernel.org
---
drivers/net/ethernet/airoha/airoha_eth.c | 364 ++++++++++++++++++++--
drivers/net/ethernet/airoha/airoha_eth.h | 24 ++
drivers/net/ethernet/airoha/airoha_regs.h | 22 +-
3 files changed, 386 insertions(+), 24 deletions(-)
diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f3f..2aa6915d424e 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -10,8 +10,10 @@
#include <linux/tcp.h>
#include <linux/u64_stats_sync.h>
#include <net/dst_metadata.h>
+#include <net/ip6_checksum.h>
#include <net/page_pool/helpers.h>
#include <net/pkt_cls.h>
+#include <net/tcp.h>
#include <uapi/linux/ppp_defs.h>
#include "airoha_regs.h"
@@ -486,6 +488,73 @@ static void airoha_fe_crsn_qsel_init(struct airoha_eth *eth)
CDM_CRSN_QSEL_Q1));
}
+static void airoha_fe_lro_rxq_enable(struct airoha_eth *eth, int qdma_id,
+ int lro_queue_index, int qid,
+ int buf_size)
+{
+ int id = qdma_id + 1;
+
+ airoha_fe_rmw(eth, REG_CDM_LRO_LIMIT(id),
+ CDM_LRO_AGG_NUM_MASK | CDM_LRO_AGG_SIZE_MASK,
+ FIELD_PREP(CDM_LRO_AGG_SIZE_MASK, buf_size) |
+ FIELD_PREP(CDM_LRO_AGG_NUM_MASK,
+ AIROHA_RXQ_LRO_MAX_AGG_COUNT));
+ airoha_fe_rmw(eth, REG_CDM_LRO_AGE_TIME(id),
+ CDM_LRO_AGE_TIME_MASK | CDM_LRO_AGG_TIME_MASK,
+ FIELD_PREP(CDM_LRO_AGE_TIME_MASK,
+ AIROHA_RXQ_LRO_MAX_AGE_TIME) |
+ FIELD_PREP(CDM_LRO_AGG_TIME_MASK,
+ AIROHA_RXQ_LRO_MAX_AGG_TIME));
+ airoha_fe_rmw(eth, REG_CDM_LRO_RXQ(id, lro_queue_index),
+ LRO_RXQ_MASK(lro_queue_index),
+ __field_prep(LRO_RXQ_MASK(lro_queue_index), qid));
+ airoha_fe_set(eth, REG_CDM_LRO_EN(id), BIT(lro_queue_index));
+}
+
+static void airoha_fe_lro_disable(struct airoha_eth *eth, int qdma_id)
+{
+ int i, id = qdma_id + 1;
+
+ airoha_fe_clear(eth, REG_CDM_LRO_EN(id), LRO_RXQ_EN_MASK);
+ airoha_fe_clear(eth, REG_CDM_LRO_LIMIT(id),
+ CDM_LRO_AGG_NUM_MASK | CDM_LRO_AGG_SIZE_MASK);
+ airoha_fe_clear(eth, REG_CDM_LRO_AGE_TIME(id),
+ CDM_LRO_AGE_TIME_MASK | CDM_LRO_AGG_TIME_MASK);
+ for (i = 0; i < AIROHA_MAX_NUM_LRO_QUEUES; i++)
+ airoha_fe_clear(eth, REG_CDM_LRO_RXQ(id, i), LRO_RXQ_MASK(i));
+}
+
+static bool airoha_fe_lro_is_enabled(struct airoha_eth *eth, int qdma_id)
+{
+ return airoha_fe_get(eth, REG_CDM_LRO_EN(qdma_id + 1),
+ LRO_RXQ_EN_MASK);
+}
+
+static void airoha_dev_lro_enable(struct airoha_gdm_dev *dev)
+{
+ struct airoha_qdma *qdma = dev->qdma;
+ struct airoha_eth *eth = qdma->eth;
+ int qdma_id = qdma - ð->qdma[0];
+ int i, lro_queue_index = 0;
+
+ for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
+ struct airoha_queue *q = &qdma->q_rx[i];
+ u32 size;
+
+ if (!q->ndesc)
+ continue;
+
+ if (!airoha_qdma_is_lro_queue(q))
+ continue;
+
+ size = SKB_WITH_OVERHEAD(AIROHA_RX_LEN(q->buf_size));
+ size = min_t(u32, size, CDM_LRO_AGG_SIZE_MASK);
+ airoha_fe_lro_rxq_enable(eth, qdma_id, lro_queue_index, i,
+ size);
+ lro_queue_index++;
+ }
+}
+
static int airoha_fe_init(struct airoha_eth *eth)
{
airoha_fe_maccr_init(eth);
@@ -611,6 +680,7 @@ static int airoha_qdma_fill_rx_queue(struct airoha_queue *q)
e->dma_addr = page_pool_get_dma_addr(page) + offset;
e->dma_len = SKB_WITH_OVERHEAD(AIROHA_RX_LEN(q->buf_size));
+ WRITE_ONCE(desc->tcp_ts_reply, 0);
val = FIELD_PREP(QDMA_DESC_LEN_MASK, e->dma_len);
WRITE_ONCE(desc->ctrl, cpu_to_le32(val));
WRITE_ONCE(desc->addr, cpu_to_le32(e->dma_addr));
@@ -652,12 +722,173 @@ airoha_qdma_get_gdm_dev(struct airoha_eth *eth, struct airoha_qdma_desc *desc)
return port->devs[d] ? port->devs[d] : ERR_PTR(-ENODEV);
}
+static struct sk_buff *airoha_qdma_lro_rx_skb(struct airoha_queue *q,
+ struct airoha_qdma_desc *desc,
+ struct airoha_queue_entry *e)
+{
+ u32 len, th_off, tcp_ack_seq, agg_count, data_off, data_len;
+ u32 desc_ctrl = le32_to_cpu(READ_ONCE(desc->ctrl));
+ u32 msg1 = le32_to_cpu(READ_ONCE(desc->msg1));
+ u32 msg2 = le32_to_cpu(READ_ONCE(desc->msg2));
+ u32 msg3 = le32_to_cpu(READ_ONCE(desc->msg3));
+ struct skb_shared_info *shinfo;
+ u16 tcp_win, l2_len;
+ struct sk_buff *skb;
+ struct tcphdr *th;
+ struct page *page;
+ bool ipv4, ipv6;
+
+ ipv4 = FIELD_GET(QDMA_ETH_RXMSG_IP4_MASK, msg1);
+ ipv6 = FIELD_GET(QDMA_ETH_RXMSG_IP6_MASK, msg1);
+ if (!ipv4 && !ipv6)
+ return NULL;
+
+ l2_len = FIELD_GET(QDMA_ETH_RXMSG_L2_LEN_MASK, msg2);
+ len = FIELD_GET(QDMA_DESC_LEN_MASK, desc_ctrl);
+
+ if (ipv4) {
+ struct iphdr *iph;
+
+ if (len < l2_len + sizeof(*iph))
+ return NULL;
+
+ iph = (struct iphdr *)(e->buf + l2_len);
+ if (iph->protocol != IPPROTO_TCP)
+ return NULL;
+
+ if (iph->ihl < 5)
+ return NULL;
+
+ th_off = l2_len + (iph->ihl << 2);
+ if (len < th_off)
+ return NULL;
+
+ iph->tot_len = cpu_to_be16(len - l2_len);
+ iph->check = 0;
+ iph->check = ip_fast_csum((void *)iph, iph->ihl);
+ } else {
+ struct ipv6hdr *ip6h;
+
+ th_off = l2_len + sizeof(*ip6h);
+ if (len < th_off)
+ return NULL;
+
+ ip6h = (struct ipv6hdr *)(e->buf + l2_len);
+ if (ip6h->nexthdr != NEXTHDR_TCP)
+ return NULL;
+
+ ip6h->payload_len = cpu_to_be16(len - th_off);
+ }
+
+ if (len < th_off + sizeof(*th))
+ return NULL;
+
+ th = (struct tcphdr *)(e->buf + th_off);
+ if (th->doff < 5)
+ return NULL;
+
+ data_off = th_off + (th->doff << 2);
+ if (len < data_off)
+ return NULL;
+
+ tcp_win = FIELD_GET(QDMA_ETH_RXMSG_TCP_WIN_MASK, msg3);
+ tcp_ack_seq = le32_to_cpu(READ_ONCE(desc->data));
+ th->ack_seq = cpu_to_be32(tcp_ack_seq);
+ th->window = cpu_to_be16(tcp_win);
+
+ /* Check tcp timestamp option */
+ if (th->doff == (sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4) {
+ u32 topt = get_unaligned_be32(th + 1);
+
+ if (topt == ((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
+ (TCPOPT_TIMESTAMP << 8) | TCPOLEN_TIMESTAMP)) {
+ u8 *ptr = (u8 *)th + sizeof(*th) + 2 * sizeof(__be32);
+ __le32 tcp_ts_reply = READ_ONCE(desc->tcp_ts_reply);
+
+ put_unaligned_be32(le32_to_cpu(tcp_ts_reply), ptr);
+ }
+ }
+
+ if (ipv4) {
+ struct iphdr *iph = (struct iphdr *)(e->buf + l2_len);
+
+ th->check = ~tcp_v4_check(len - th_off, iph->saddr,
+ iph->daddr, 0);
+ } else {
+ struct ipv6hdr *ip6h = (struct ipv6hdr *)(e->buf + l2_len);
+
+ th->check = ~tcp_v6_check(len - th_off, &ip6h->saddr,
+ &ip6h->daddr, 0);
+ }
+
+ /* Split network headers and payload to rely on GRO.
+ * We need to do it in the driver since the NIC does
+ * not support it.
+ */
+ skb = napi_alloc_skb(&q->napi, data_off);
+ if (!skb)
+ return NULL;
+
+ __skb_put(skb, data_off);
+ memcpy(skb->data, e->buf, data_off);
+
+ page = virt_to_head_page(e->buf);
+ data_len = len - data_off;
+ shinfo = skb_shinfo(skb);
+ skb_add_rx_frag(skb, shinfo->nr_frags, page,
+ e->buf + data_off - page_address(page), data_len,
+ q->buf_size);
+
+ shinfo->gso_type = ipv4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
+ agg_count = FIELD_GET(QDMA_ETH_RXMSG_AGG_COUNT_MASK, msg2);
+ shinfo->gso_size = DIV_ROUND_UP(data_len, agg_count);
+ shinfo->gso_segs = agg_count;
+
+ skb->csum_start = skb_headroom(skb) + th_off;
+ skb->csum_offset = offsetof(struct tcphdr, check);
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ return skb;
+}
+
+static struct sk_buff *airoha_qdma_build_rx_skb(struct airoha_queue *q,
+ struct airoha_qdma_desc *desc,
+ struct airoha_queue_entry *e,
+ struct net_device *dev)
+{
+ u32 msg2 = le32_to_cpu(READ_ONCE(desc->msg2));
+ int qid = q - &q->qdma->q_rx[0];
+ struct sk_buff *skb;
+
+ if (FIELD_GET(QDMA_ETH_RXMSG_AGG_COUNT_MASK, msg2) > 1) { /* LRO */
+ skb = airoha_qdma_lro_rx_skb(q, desc, e);
+ if (!skb)
+ return NULL;
+ } else {
+ u32 desc_ctrl = le32_to_cpu(READ_ONCE(desc->ctrl));
+ u32 len = FIELD_GET(QDMA_DESC_LEN_MASK, desc_ctrl);
+
+ skb = napi_build_skb(e->buf - AIROHA_RX_HEADROOM, q->buf_size);
+ if (!skb)
+ return NULL;
+
+ skb_reserve(skb, AIROHA_RX_HEADROOM);
+ __skb_put(skb, len);
+ skb->ip_summed = CHECKSUM_UNNECESSARY;
+ }
+
+ skb_mark_for_recycle(skb);
+ skb->dev = dev;
+ skb_record_rx_queue(skb, qid);
+ skb->protocol = eth_type_trans(skb, dev);
+
+ return skb;
+}
+
static int airoha_qdma_rx_process(struct airoha_queue *q, int budget)
{
enum dma_data_direction dir = page_pool_get_dma_dir(q->page_pool);
- struct airoha_qdma *qdma = q->qdma;
- struct airoha_eth *eth = qdma->eth;
- int qid = q - &qdma->q_rx[0];
+ struct airoha_eth *eth = q->qdma->eth;
int done = 0;
while (done < budget) {
@@ -693,18 +924,9 @@ static int airoha_qdma_rx_process(struct airoha_queue *q, int budget)
netdev = netdev_from_priv(dev);
if (!q->skb) { /* first buffer */
- q->skb = napi_build_skb(e->buf - AIROHA_RX_HEADROOM,
- q->buf_size);
+ q->skb = airoha_qdma_build_rx_skb(q, desc, e, netdev);
if (!q->skb)
goto free_frag;
-
- skb_reserve(q->skb, AIROHA_RX_HEADROOM);
- __skb_put(q->skb, len);
- skb_mark_for_recycle(q->skb);
- q->skb->dev = netdev;
- q->skb->protocol = eth_type_trans(q->skb, netdev);
- q->skb->ip_summed = CHECKSUM_UNNECESSARY;
- skb_record_rx_queue(q->skb, qid);
} else { /* scattered frame */
struct skb_shared_info *shinfo = skb_shinfo(q->skb);
int nr_frags = shinfo->nr_frags;
@@ -795,12 +1017,10 @@ static int airoha_qdma_rx_napi_poll(struct napi_struct *napi, int budget)
static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
struct airoha_qdma *qdma, int ndesc)
{
- const struct page_pool_params pp_params = {
- .order = 0,
+ struct page_pool_params pp_params = {
.pool_size = 256,
.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
.dma_dir = DMA_FROM_DEVICE,
- .max_len = PAGE_SIZE,
.nid = NUMA_NO_NODE,
.dev = qdma->eth->dev,
.napi = &q->napi,
@@ -808,9 +1028,10 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
struct airoha_eth *eth = qdma->eth;
int qid = q - &qdma->q_rx[0], thr;
dma_addr_t dma_addr;
+ bool lro_q;
- q->buf_size = PAGE_SIZE / 2;
q->qdma = qdma;
+ lro_q = airoha_qdma_is_lro_queue(q);
q->entry = devm_kzalloc(eth->dev, ndesc * sizeof(*q->entry),
GFP_KERNEL);
@@ -822,6 +1043,9 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
if (!q->desc)
return -ENOMEM;
+ pp_params.order = lro_q ? AIROHA_LRO_PAGE_ORDER : 0;
+ pp_params.max_len = PAGE_SIZE << pp_params.order;
+
q->page_pool = page_pool_create(&pp_params);
if (IS_ERR(q->page_pool)) {
int err = PTR_ERR(q->page_pool);
@@ -830,6 +1054,7 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
return err;
}
+ q->buf_size = lro_q ? pp_params.max_len : pp_params.max_len / 2;
q->ndesc = ndesc;
netif_napi_add(eth->napi_dev, &q->napi, airoha_qdma_rx_napi_poll);
@@ -843,7 +1068,12 @@ static int airoha_qdma_init_rx_queue(struct airoha_queue *q,
FIELD_PREP(RX_RING_THR_MASK, thr));
airoha_qdma_rmw(qdma, REG_RX_DMA_IDX(qid), RX_RING_DMA_IDX_MASK,
FIELD_PREP(RX_RING_DMA_IDX_MASK, q->head));
- airoha_qdma_set(qdma, REG_RX_SCATTER_CFG(qid), RX_RING_SG_EN_MASK);
+ if (lro_q)
+ airoha_qdma_clear(qdma, REG_RX_SCATTER_CFG(qid),
+ RX_RING_SG_EN_MASK);
+ else
+ airoha_qdma_set(qdma, REG_RX_SCATTER_CFG(qid),
+ RX_RING_SG_EN_MASK);
airoha_qdma_fill_rx_queue(q);
@@ -865,6 +1095,7 @@ static void airoha_qdma_cleanup_rx_queue(struct airoha_queue *q)
page_pool_get_dma_dir(q->page_pool));
page_pool_put_full_page(q->page_pool, page, false);
/* Reset DMA descriptor */
+ WRITE_ONCE(desc->tcp_ts_reply, 0);
WRITE_ONCE(desc->ctrl, 0);
WRITE_ONCE(desc->addr, 0);
WRITE_ONCE(desc->data, 0);
@@ -1802,6 +2033,37 @@ static void airoha_update_hw_stats(struct airoha_gdm_dev *dev)
spin_unlock(&port->stats_lock);
}
+static void airoha_update_netdev_features(struct airoha_gdm_dev *dev)
+{
+ struct airoha_eth *eth = dev->eth;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
+ struct airoha_gdm_port *port = eth->ports[i];
+ int j;
+
+ if (!port)
+ continue;
+
+ for (j = 0; j < ARRAY_SIZE(port->devs); j++) {
+ struct airoha_gdm_dev *iter_dev = port->devs[j];
+ struct net_device *netdev;
+
+ if (!iter_dev || iter_dev == dev)
+ continue;
+
+ if (iter_dev->qdma != dev->qdma)
+ continue;
+
+ netdev = netdev_from_priv(iter_dev);
+ if (netdev->reg_state != NETREG_REGISTERED)
+ continue;
+
+ netdev_update_features(netdev);
+ }
+ }
+}
+
static int airoha_dev_open(struct net_device *netdev)
{
int err, len = ETH_HLEN + netdev->mtu + ETH_FCS_LEN;
@@ -1809,6 +2071,17 @@ static int airoha_dev_open(struct net_device *netdev)
struct airoha_gdm_port *port = dev->port;
u32 cur_len, pse_port = FE_PSE_PORT_PPE1;
struct airoha_qdma *qdma = dev->qdma;
+ int qdma_id = qdma - &qdma->eth->qdma[0];
+
+ /* HW GRO is configured on the QDMA and it is shared between
+ * all the devices using it. Refuse to open a second device on
+ * the same QDMA if HW GRO is enabled on any device sharing it.
+ */
+ if (qdma->users && airoha_fe_lro_is_enabled(qdma->eth, qdma_id)) {
+ netdev_warn(netdev, "required to disable HW GRO on QDMA%d\n",
+ qdma_id);
+ return -EBUSY;
+ }
netif_tx_start_all_queues(netdev);
err = airoha_set_vip_for_gdm_port(dev, true);
@@ -1848,6 +2121,11 @@ static int airoha_dev_open(struct net_device *netdev)
airoha_set_gdm_port_fwd_cfg(qdma->eth, REG_GDM_FWD_CFG(port->id),
pse_port);
+ if (netdev->features & NETIF_F_GRO_HW)
+ airoha_dev_lro_enable(dev);
+
+ airoha_update_netdev_features(dev);
+
return 0;
}
@@ -1895,6 +2173,9 @@ static int airoha_dev_stop(struct net_device *netdev)
FE_PSE_PORT_DROP);
if (!--qdma->users) {
+ int qdma_id = qdma - &qdma->eth->qdma[0];
+
+ airoha_fe_lro_disable(qdma->eth, qdma_id);
airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
GLOBAL_CFG_TX_DMA_EN_MASK |
GLOBAL_CFG_RX_DMA_EN_MASK);
@@ -1907,6 +2188,8 @@ static int airoha_dev_stop(struct net_device *netdev)
}
}
+ airoha_update_netdev_features(dev);
+
return 0;
}
@@ -2176,6 +2459,41 @@ int airoha_get_fe_port(struct airoha_gdm_dev *dev)
}
}
+static netdev_features_t airoha_dev_fix_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+ struct airoha_gdm_dev *dev = netdev_priv(netdev);
+ struct airoha_qdma *qdma = dev->qdma;
+
+ if (qdma->users > 1)
+ features &= ~NETIF_F_GRO_HW;
+
+ return features;
+}
+
+static int airoha_dev_set_features(struct net_device *netdev,
+ netdev_features_t features)
+{
+ netdev_features_t diff = netdev->features ^ features;
+ struct airoha_gdm_dev *dev = netdev_priv(netdev);
+
+ if (!(diff & NETIF_F_GRO_HW))
+ return 0;
+
+ if (!netif_running(netdev))
+ return 0;
+
+ if (features & NETIF_F_GRO_HW) {
+ airoha_dev_lro_enable(dev);
+ } else {
+ int qdma_id = dev->qdma - &dev->eth->qdma[0];
+
+ airoha_fe_lro_disable(dev->eth, qdma_id);
+ }
+
+ return 0;
+}
+
static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
struct net_device *netdev)
{
@@ -3102,6 +3420,8 @@ static const struct net_device_ops airoha_netdev_ops = {
.ndo_stop = airoha_dev_stop,
.ndo_change_mtu = airoha_dev_change_mtu,
.ndo_select_queue = airoha_dev_select_queue,
+ .ndo_fix_features = airoha_dev_fix_features,
+ .ndo_set_features = airoha_dev_set_features,
.ndo_start_xmit = airoha_dev_xmit,
.ndo_get_stats64 = airoha_dev_get_stats64,
.ndo_set_mac_address = airoha_dev_set_macaddr,
@@ -3189,11 +3509,9 @@ static int airoha_alloc_gdm_device(struct airoha_eth *eth,
netdev->ethtool_ops = &airoha_ethtool_ops;
netdev->max_mtu = AIROHA_MAX_MTU;
netdev->watchdog_timeo = 5 * HZ;
- netdev->hw_features = NETIF_F_IP_CSUM | NETIF_F_RXCSUM | NETIF_F_TSO6 |
- NETIF_F_IPV6_CSUM | NETIF_F_SG | NETIF_F_TSO |
- NETIF_F_HW_TC;
- netdev->features |= netdev->hw_features;
- netdev->vlan_features = netdev->hw_features;
+ netdev->hw_features = AIROHA_HW_FEATURES | NETIF_F_GRO_HW;
+ netdev->features |= AIROHA_HW_FEATURES;
+ netdev->vlan_features = AIROHA_HW_FEATURES;
SET_NETDEV_DEV(netdev, eth->dev);
/* reserve hw queues for HTB offloading */
diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
index 41d2e7a1f9fb..c13757a88aba 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.h
+++ b/drivers/net/ethernet/airoha/airoha_eth.h
@@ -44,6 +44,18 @@
(_n) == 15 ? 128 : \
(_n) == 0 ? 1024 : 16)
+#define AIROHA_LRO_PAGE_ORDER order_base_2(SZ_16K / PAGE_SIZE)
+#define AIROHA_MAX_NUM_LRO_QUEUES 8
+#define AIROHA_RXQ_LRO_EN_MASK GENMASK(31, 24)
+#define AIROHA_RXQ_LRO_MAX_AGG_COUNT 64
+#define AIROHA_RXQ_LRO_MAX_AGG_TIME 100
+#define AIROHA_RXQ_LRO_MAX_AGE_TIME 2000
+
+#define AIROHA_HW_FEATURES \
+ (NETIF_F_IP_CSUM | NETIF_F_RXCSUM | \
+ NETIF_F_TSO6 | NETIF_F_IPV6_CSUM | \
+ NETIF_F_SG | NETIF_F_TSO | NETIF_F_HW_TC)
+
#define PSE_RSV_PAGES 128
#define PSE_QUEUE_RSV_PAGES 64
@@ -673,6 +685,18 @@ static inline bool airoha_is_7583(struct airoha_eth *eth)
return eth->soc->version == 0x7583;
}
+static inline bool airoha_qdma_is_lro_queue(struct airoha_queue *q)
+{
+ struct airoha_qdma *qdma = q->qdma;
+ int qid = q - &qdma->q_rx[0];
+
+ /* EN7581 SoC supports at most 8 LRO rx queues */
+ BUILD_BUG_ON(hweight32(AIROHA_RXQ_LRO_EN_MASK) >
+ AIROHA_MAX_NUM_LRO_QUEUES);
+
+ return !!(AIROHA_RXQ_LRO_EN_MASK & BIT(qid));
+}
+
int airoha_get_fe_port(struct airoha_gdm_dev *dev);
bool airoha_is_valid_gdm_dev(struct airoha_eth *eth,
struct airoha_gdm_dev *dev);
diff --git a/drivers/net/ethernet/airoha/airoha_regs.h b/drivers/net/ethernet/airoha/airoha_regs.h
index 436f3c8779c1..dfc786583774 100644
--- a/drivers/net/ethernet/airoha/airoha_regs.h
+++ b/drivers/net/ethernet/airoha/airoha_regs.h
@@ -122,6 +122,20 @@
#define CDM_CRSN_QSEL_REASON_MASK(_n) \
GENMASK(4 + (((_n) % 4) << 3), (((_n) % 4) << 3))
+#define REG_CDM_LRO_RXQ(_n, _m) (CDM_BASE(_n) + 0x78 + ((_m) & 0x4))
+#define LRO_RXQ_MASK(_n) GENMASK(4 + (((_n) & 0x3) << 3), ((_n) & 0x3) << 3)
+
+#define REG_CDM_LRO_EN(_n) (CDM_BASE(_n) + 0x80)
+#define LRO_RXQ_EN_MASK GENMASK(7, 0)
+
+#define REG_CDM_LRO_LIMIT(_n) (CDM_BASE(_n) + 0x84)
+#define CDM_LRO_AGG_NUM_MASK GENMASK(23, 16)
+#define CDM_LRO_AGG_SIZE_MASK GENMASK(15, 0)
+
+#define REG_CDM_LRO_AGE_TIME(_n) (CDM_BASE(_n) + 0x88)
+#define CDM_LRO_AGE_TIME_MASK GENMASK(31, 16)
+#define CDM_LRO_AGG_TIME_MASK GENMASK(15, 0)
+
#define REG_GDM_FWD_CFG(_n) GDM_BASE(_n)
#define GDM_PAD_EN_MASK BIT(28)
#define GDM_DROP_CRC_ERR_MASK BIT(23)
@@ -883,9 +897,15 @@
#define QDMA_ETH_RXMSG_SPORT_MASK GENMASK(25, 21)
#define QDMA_ETH_RXMSG_CRSN_MASK GENMASK(20, 16)
#define QDMA_ETH_RXMSG_PPE_ENTRY_MASK GENMASK(15, 0)
+/* RX MSG2 */
+#define QDMA_ETH_RXMSG_AGG_COUNT_MASK GENMASK(31, 24)
+#define QDMA_ETH_RXMSG_L2_LEN_MASK GENMASK(6, 0)
+/* RX MSG3 */
+#define QDMA_ETH_RXMSG_AGG_LEN_MASK GENMASK(31, 16)
+#define QDMA_ETH_RXMSG_TCP_WIN_MASK GENMASK(15, 0)
struct airoha_qdma_desc {
- __le32 rsv;
+ __le32 tcp_ts_reply;
__le32 ctrl;
__le32 addr;
__le32 data;
--
2.54.0
^ permalink raw reply related
* [PATCH v4] net: mvneta_bm: add suspend/resume support to prevent crash after resume
From: Yun Zhou @ 2026-06-18 14:35 UTC (permalink / raw)
To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni
Cc: netdev, linux-kernel, yun.zhou
The mvneta driver uses the hardware Buffer Manager (BM) for RX buffer
allocation. During suspend, mvneta disables its clock, causing BM to
lose all buffer address state. On resume, mvneta_bm_port_init() re-
attaches the BM pool to the NIC, but BM hardware returns stale/garbage
buffer addresses. When NAPI poll processes these buffers, DMA cache
sync hits an invalid virtual address causing a kernel panic:
Unable to handle kernel paging request at virtual address b0000080
PC is at v7_dma_inv_range
Call trace:
v7_dma_inv_range from arch_sync_dma_for_cpu+0x94/0x158
arch_sync_dma_for_cpu from __dma_sync_single_for_cpu+0xc4/0x15c
__dma_sync_single_for_cpu from mvneta_rx_swbm+0x6c8/0xf48
mvneta_rx_swbm from mvneta_poll+0x6fc/0x70c
mvneta_poll from __napi_poll.constprop.0+0x2c/0x1e0
__napi_poll.constprop.0 from net_rx_action+0x160/0x2c4
net_rx_action from handle_softirqs+0xd8/0x2b8
handle_softirqs from run_ksoftirqd+0x30/0x94
run_ksoftirqd from smpboot_thread_fn+0x100/0x204
smpboot_thread_fn from kthread+0xf4/0x110
kthread from ret_from_fork+0x14/0x28
Fix by adding suspend/resume callbacks to the BM driver:
- suspend: drain all buffers (with DMA unmapping), free the BPPE
regions, and reset pool state to FREE before stopping BM and gating
the clock.
- resume: enable the clock, reinitialize BM defaults, and restore pool
read/write pointers and size registers. Pool allocation and buffer
refill are handled by mvneta_resume() through the normal
mvneta_bm_port_init() path, which sees pools as FREE and performs
full initialization identical to probe.
Add a device_link (DL_FLAG_AUTOREMOVE_CONSUMER) in mvneta_probe to
guarantee BM resumes before mvneta and suspends after mvneta. If the
link cannot be created, fall back to SW buffer management to avoid a
potential crash on resume due to unordered PM transitions.
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v4:
- On device_link_add() failure, fall back to SW buffer management
(destroy pools, put BM reference, clear bm_priv) instead of merely
emitting a warning. Without the link, suspend/resume ordering is
not guaranteed and the original crash can still occur.
v3:
- Restore per-pool POOL_SIZE_REG, POOL_READ_PTR_REG, and
POOL_WRITE_PTR_REG in resume, since clock gating loses all BM
register state.
- Check device_link_add() return value and emit dev_warn on failure.
- Replace SIMPLE_DEV_PM_OPS (deprecated) with
DEFINE_SIMPLE_DEV_PM_OPS and pm_sleep_ptr(), removing the
#ifdef CONFIG_PM_SLEEP guard.
- Add dev_warn in suspend if not all buffers could be freed.
v2:
- Drain buffers via mvneta_bm_bufs_free() in suspend instead of only
stopping BM and gating the clock. This ensures proper DMA unmapping
and avoids buffer leaks.
- Free the BPPE DMA-coherent region in suspend so that resume takes
the full probe-time initialization path (alloc + fill), eliminating
the need to modify mvneta_bm_pool_create().
- Reset pool type to MVNETA_BM_FREE in suspend so mvneta_bm_pool_use()
correctly re-creates and refills pools on resume.
- Check clk_prepare_enable() return value in resume.
- Add device_link between mvneta (consumer) and mvneta_bm (supplier)
to guarantee correct suspend/resume ordering.
drivers/net/ethernet/marvell/mvneta.c | 18 ++++++++
drivers/net/ethernet/marvell/mvneta_bm.c | 58 ++++++++++++++++++++++++
2 files changed, 76 insertions(+)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 744d6585a949..543e566425c1 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5678,6 +5678,24 @@ static int mvneta_probe(struct platform_device *pdev)
"use SW buffer management\n");
mvneta_bm_put(pp->bm_priv);
pp->bm_priv = NULL;
+ } else if (!device_link_add(&pdev->dev,
+ &pp->bm_priv->pdev->dev,
+ DL_FLAG_AUTOREMOVE_CONSUMER)) {
+ /*
+ * Link guarantees BM resumes before mvneta.
+ * Without it, BM may not be ready when
+ * mvneta_bm_port_init() runs on resume,
+ * causing stale buffer addresses and a crash.
+ * Fall back to SW management to be safe.
+ */
+ dev_warn(&pdev->dev,
+ "failed to link to BM, use SW buffer management\n");
+ mvneta_bm_pool_destroy(pp->bm_priv,
+ pp->pool_long, 1 << pp->id);
+ mvneta_bm_pool_destroy(pp->bm_priv,
+ pp->pool_short, 1 << pp->id);
+ mvneta_bm_put(pp->bm_priv);
+ pp->bm_priv = NULL;
}
}
/* Set RX packet offset correction for platforms, whose
diff --git a/drivers/net/ethernet/marvell/mvneta_bm.c b/drivers/net/ethernet/marvell/mvneta_bm.c
index 6bb380494919..85162a43eaf6 100644
--- a/drivers/net/ethernet/marvell/mvneta_bm.c
+++ b/drivers/net/ethernet/marvell/mvneta_bm.c
@@ -477,6 +477,63 @@ static void mvneta_bm_remove(struct platform_device *pdev)
clk_disable_unprepare(priv->clk);
}
+static int mvneta_bm_suspend(struct device *dev)
+{
+ struct mvneta_bm *priv = dev_get_drvdata(dev);
+ int i;
+
+ /* Drain buffers and free pool resources while BM is still clocked */
+ for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+ struct mvneta_bm_pool *bm_pool = &priv->bm_pools[i];
+ int size_bytes;
+
+ if (bm_pool->type == MVNETA_BM_FREE)
+ continue;
+
+ mvneta_bm_bufs_free(priv, bm_pool, bm_pool->port_map);
+ if (bm_pool->hwbm_pool.buf_num)
+ dev_warn(&priv->pdev->dev,
+ "pool %d: %d buffers not freed\n",
+ bm_pool->id, bm_pool->hwbm_pool.buf_num);
+
+ size_bytes = sizeof(u32) * bm_pool->hwbm_pool.size;
+ dma_free_coherent(&priv->pdev->dev, size_bytes,
+ bm_pool->virt_addr, bm_pool->phys_addr);
+ bm_pool->virt_addr = NULL;
+ bm_pool->type = MVNETA_BM_FREE;
+ }
+
+ mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_STOP_MASK);
+ clk_disable_unprepare(priv->clk);
+ return 0;
+}
+
+static int mvneta_bm_resume(struct device *dev)
+{
+ struct mvneta_bm *priv = dev_get_drvdata(dev);
+ int i, err;
+
+ err = clk_prepare_enable(priv->clk);
+ if (err)
+ return err;
+
+ /* Reinitialize BM hardware; pools are refilled by mvneta_resume() */
+ mvneta_bm_default_set(priv);
+
+ /* Restore pool registers lost during clock gating */
+ for (i = 0; i < MVNETA_BM_POOLS_NUM; i++) {
+ mvneta_bm_write(priv, MVNETA_BM_POOL_READ_PTR_REG(i), 0);
+ mvneta_bm_write(priv, MVNETA_BM_POOL_WRITE_PTR_REG(i), 0);
+ mvneta_bm_write(priv, MVNETA_BM_POOL_SIZE_REG(i),
+ priv->bm_pools[i].hwbm_pool.size);
+ }
+
+ mvneta_bm_write(priv, MVNETA_BM_COMMAND_REG, MVNETA_BM_START_MASK);
+ return 0;
+}
+
+static DEFINE_SIMPLE_DEV_PM_OPS(mvneta_bm_pm_ops, mvneta_bm_suspend, mvneta_bm_resume);
+
static const struct of_device_id mvneta_bm_match[] = {
{ .compatible = "marvell,armada-380-neta-bm" },
{ }
@@ -489,6 +546,7 @@ static struct platform_driver mvneta_bm_driver = {
.driver = {
.name = MVNETA_BM_DRIVER_NAME,
.of_match_table = mvneta_bm_match,
+ .pm = pm_sleep_ptr(&mvneta_bm_pm_ops),
},
};
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Dust Li @ 2026-06-18 14:29 UTC (permalink / raw)
To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-1-551fa514257e@proton.me>
On 2026-06-14 03:23:30, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>smc_cdc_cursor_to_host() (SMC-R) and smcd_cdc_msg_to_host() (SMC-D)
>import the peer's producer cursor from the wire into the local
>connection cursor with no upper bound against the receive buffer (RMB).
>The urgent path then uses that count as a raw index:
>
> base = conn->rmb_desc->cpu_addr + conn->rx_off;
> conn->urg_rx_byte = *(base + conn->urg_curs.count - 1);
>
>so a peer that advertises a producer cursor past rmb_desc->len reads
>out of bounds of the RMB allocation in the receive tasklet (softirq).
>
>Bound the producer cursor count to rmb_desc->len at the conversion
>boundary, for both SMC-R and SMC-D. Apply the bound to the producer
>cursor only: the consumer cursor indexes the peer's RMB and is bounded
>by peer_rmbe_size, so clamping it to our rmb_desc->len would
>under-credit peer_rmbe_space and stall transmit to a peer whose RMB is
>larger than ours.
>
>Fixes: de8474eb9d50 ("net/smc: urgent data support")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_cdc.h | 27 ++++++++++++++++++++++++---
> 1 file changed, 24 insertions(+), 3 deletions(-)
>
>diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
>index 696cc11f2303..ca76ef630356 100644
>--- a/net/smc/smc_cdc.h
>+++ b/net/smc/smc_cdc.h
>@@ -221,7 +221,8 @@ static inline void smc_host_msg_to_cdc(struct smc_cdc_msg *peer,
>
> static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
> union smc_cdc_cursor *peer,
>- struct smc_connection *conn)
>+ struct smc_connection *conn,
>+ int max_count)
> {
> union smc_host_cursor temp, old;
> union smc_cdc_cursor net;
>@@ -235,6 +236,15 @@ static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
> if ((old.wrap == temp.wrap) &&
> (old.count > temp.count))
> return;
>+ /* The peer producer cursor is wire-controlled and is later used as a
>+ * raw index into our RMB by the urgent path; bound its count to the
>+ * RMB. max_count == 0 leaves the consumer cursor unbounded here: it
>+ * indexes the peer's RMB (bounded by peer_rmbe_size, not our
>+ * rmb_desc->len), so clamping it to rmb_desc->len would under-credit
>+ * peer_rmbe_space and stall transmit to peers with a larger RMB.
>+ */
>+ if (max_count && temp.count > max_count)
>+ temp.count = max_count;
> smc_curs_copy(local, &temp, conn);
> }
>
>@@ -246,8 +256,13 @@ static inline void smcr_cdc_msg_to_host(struct smc_host_cdc_msg *local,
> local->len = peer->len;
> local->seqno = ntohs(peer->seqno);
> local->token = ntohl(peer->token);
>- smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn);
>- smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn);
>+ /* bound the wire-controlled producer cursor to our RMB (used as a raw
>+ * index by the urgent path); leave the consumer cursor unbounded -- it
>+ * indexes the peer's RMB and is bounded by peer_rmbe_size.
>+ */
>+ smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn,
>+ conn->rmb_desc->len);
>+ smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn, 0);
> local->prod_flags = peer->prod_flags;
> local->conn_state_flags = peer->conn_state_flags;
> }
>@@ -260,6 +275,12 @@ static inline void smcd_cdc_msg_to_host(struct smc_host_cdc_msg *local,
>
> temp.wrap = peer->prod.wrap;
> temp.count = peer->prod.count;
>+ /* the peer producer cursor is wire-controlled and is used as a raw
>+ * index into our RMB by the urgent path; bound it to the RMB. The
>+ * consumer cursor below indexes the peer's RMB and is left unbounded.
>+ */
>+ if (temp.count > conn->rmb_desc->len)
>+ temp.count = conn->rmb_desc->len;
> smc_curs_copy(&local->prod, &temp, conn);
>
> temp.wrap = peer->cons.wrap;
Hi Bryam,
I agree the issue is real. SMC-R's original design didn't fully
account for misbehaving peers, which is the root cause behind a
number of similar issues we've seen. The good news is that this
class of problem isn't easy to hit in practice, so it isn't
particularly urgent.
On the approach itself: once we detect that the peer is misbehaving,
I think the right action is to abort the connection and record the
event, rather than silently clamp. An invalid CDC means the whole
communication state can no longer be trusted, so continuing on a
clamped value just papers over a peer bug.
I'd suggest we add a dedicated CDC message check, and route any
failure through the existing abort path, maybe something like bellow:
static bool smc_cdc_msg_check(struct smc_connection *conn,
struct smc_cdc_msg *cdc)
{
u32 prod_count = ntohs(cdc->prod.count);
u32 cons_count = ntohs(cdc->cons.count);
if (prod_count > conn->rmb_desc->len ||
cons_count > conn->peer_rmbe_size ||
cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {
this_cpu_inc(net->smc.smc_stats->...cdc_inval);
net_ratelimited_function(pr_warn,
"smc: invalid CDC from peer (token=%u)\n",
ntohl(cdc->token));
return false;
}
return true;
}
For -stable, your current minimal patch is fine. For net-next, though, I'd prefer
the approach above: validate at the wire boundary, abort on violation, and
make the event observable via smc_stats and a ratelimited warning.
Best regards,
Dust
^ permalink raw reply
* [PATCH net] net: au1000: move free_irq out of the close-time spinlocked section
From: Runyu Xiao @ 2026-06-18 14:19 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, jianhao.xu, runyu.xiao, stable
au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.
This issue was found by our static analysis tool and then manually
reviewed against the current tree.
The grounded PoC kept the ndo_stop carrier and the au1000_close() ->
free_irq(dev->irq, dev) path while the driver lock was held. Lockdep
reported:
BUG: sleeping function called from invalid context
1 lock held by exploit/192:
#0: (&aup->lock){....}-{2:2}, at: au1000_close+0x23/0x83 [vuln_msv]
[ BUG: Invalid wait context ]
exploit/192 is trying to lock:
(&desc->request_mutex){+.+.}-{3:3}, at: free_irq+0x63/0x360
free_irq+0x63/0x360
au1000_close+0x65/0x83 [vuln_msv]
Drop aup->lock before freeing the IRQ. The protected close-time work
still stops the device and queue before IRQ teardown, but the sleepable
IRQ core path now runs outside the spinlocked section.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/net/ethernet/amd/au1000_eth.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/amd/au1000_eth.c b/drivers/net/ethernet/amd/au1000_eth.c
index 9d35ac348ebe..5a04056e38fa 100644
--- a/drivers/net/ethernet/amd/au1000_eth.c
+++ b/drivers/net/ethernet/amd/au1000_eth.c
@@ -943,9 +943,10 @@ static int au1000_close(struct net_device *dev)
/* stop the device */
netif_stop_queue(dev);
+ spin_unlock_irqrestore(&aup->lock, flags);
+
/* disable the interrupt */
free_irq(dev->irq, dev);
- spin_unlock_irqrestore(&aup->lock, flags);
return 0;
}
--
2.34.1
^ permalink raw reply related
* [PATCH iwl-next v2] ixgbe: Implement PCI reset handler
From: Sergey Temerkhanov @ 2026-06-18 14:22 UTC (permalink / raw)
To: intel-wired-lan; +Cc: netdev, pmenzel
Implement PCI device reset handler to allow the network device to
get re-initialized and function after a PCI-level reset.
This is necessary for the adapter to avoid TX queue timeouts
occurring when the PCI reset is initiated via sysfs during
the operation
Signed-off-by: Sergey Temerkhanov <sergey.temerkhanov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
Previous version: https://lore.kernel.org/netdev/MW4PR11MB6864BC9CA84F060AF7E0248480E42@MW4PR11MB6864.namprd11.prod.outlook.com/
v1->v2 changes: Rearranged the order of operations, switched to poll_timeout_us() macro
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 +
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 82 +++++++++++++++++++
2 files changed, 83 insertions(+)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 594ccb28da20..c4b0c5bb89c6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -912,6 +912,7 @@ enum ixgbe_state_t {
__IXGBE_PTP_TX_IN_PROGRESS,
__IXGBE_RESET_REQUESTED,
__IXGBE_PHY_INIT_COMPLETE,
+ __IXGBE_PCIE_RESET_IN_PROGRESS,
};
struct ixgbe_cb {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2ac274c73d61..0fb64aef223e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -12352,6 +12352,86 @@ static pci_ers_result_t ixgbe_io_slot_reset(struct pci_dev *pdev)
return result;
}
+/* 1500 us poll interval */
+#define IXGBE_RESET_PREP_POLL_INTERVAL_US 1500
+/* 2 second timeout to acquire reset lock before proceeding */
+#define IXGBE_RESET_PREP_TIMEOUT_US 2000000
+
+/**
+ * ixgbe_reset_prep - called before the pci bus is reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Prepare the card for a reset, preventing the service task from running.
+ */
+static void ixgbe_reset_prep(struct pci_dev *pdev)
+{
+ struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
+
+ if (!adapter)
+ return;
+
+ if (poll_timeout_us(test_and_set_bit(__IXGBE_RESETTING, &adapter->state),
+ test_bit(__IXGBE_RESETTING, &adapter->state),
+ IXGBE_RESET_PREP_POLL_INTERVAL_US,
+ IXGBE_RESET_PREP_TIMEOUT_US, false)) {
+ /* ixgbe_reset_done() will exit early if this happens.
+ * A retry will be needed
+ */
+ e_err(drv, "Timed out waiting for __IXGBE_RESETTING to be released. Reset is needed\n");
+ return;
+ }
+
+ /* Sync __IXGBE_RESETTING */
+ smp_mb__after_atomic();
+
+ if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
+ /* Prevent the service task from being requeued in the timer callback */
+ timer_delete_sync(&adapter->service_timer);
+ /* Cancel any possibly queued service task */
+ cancel_work_sync(&adapter->service_task);
+ }
+
+ pci_clear_master(pdev);
+
+ set_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state);
+}
+
+/**
+ * ixgbe_reset_done - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Allow the service task to run and schedule re-initialization.
+ */
+static void ixgbe_reset_done(struct pci_dev *pdev)
+{
+ struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
+
+ if (!adapter)
+ return;
+
+ if (!test_and_clear_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state)) {
+ /* Should never get here */
+ e_err(drv, "Reset done called without PCIe reset in progress\n");
+ return;
+ }
+
+ pci_set_master(pdev);
+
+ /* Allow the service task to run */
+ if (!test_bit(__IXGBE_REMOVING, &adapter->state)) {
+ clear_bit(__IXGBE_RESETTING, &adapter->state);
+ /* Sync __IXGBE_RESETTING */
+ smp_mb__after_atomic();
+ }
+
+ /* Schedule re-initialization */
+ if (!test_bit(__IXGBE_DOWN, &adapter->state)) {
+ set_bit(__IXGBE_RESET_REQUESTED, &adapter->state);
+ if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state))
+ mod_timer(&adapter->service_timer, jiffies + 1);
+ }
+}
+
/**
* ixgbe_io_resume - called when traffic can start flowing again.
* @pdev: Pointer to PCI device
@@ -12384,6 +12464,8 @@ static const struct pci_error_handlers ixgbe_err_handler = {
.error_detected = ixgbe_io_error_detected,
.slot_reset = ixgbe_io_slot_reset,
.resume = ixgbe_io_resume,
+ .reset_prepare = ixgbe_reset_prep,
+ .reset_done = ixgbe_reset_done,
};
static DEFINE_SIMPLE_DEV_PM_OPS(ixgbe_pm_ops, ixgbe_suspend, ixgbe_resume);
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-18 14:16 UTC (permalink / raw)
To: Mahanta Jambigi, D. Wythe, Dust Li, Sidraya Jayagond,
Wenjia Zhang, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
linux-s390, netdev, linux-kernel, jianhao.xu, runyu.xiao
In-Reply-To: <f7e36176-d00a-4471-94ed-d385e579b43d@linux.ibm.com>
Hi,
Thanks for taking a look.
The exact Lockdep stack I have is from the grounded reproducer, not from
a production SMC setup. The reproducer keeps the same callback shape:
the close/flush side holds sk_callback_lock and invokes the installed
sk_data_ready callback, which re-enters smc_clcsock_data_ready() and tries
to take sk_callback_lock again.
The relevant Lockdep report is:
WARNING: possible recursive locking detected
kworker/u4:3/39 is trying to acquire lock:
(sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d
but task is already holding lock:
(sk_callback_lock) at smc_close_flush_work+0xc/0x30
Possible unsafe locking scenario:
CPU0
----
lock(sk_callback_lock);
lock(sk_callback_lock);
*** DEADLOCK ***
Workqueue: smc_close_wq smc_close_flush_work
Call Trace:
dump_stack_lvl
__lock_acquire
lock_acquire
_raw_read_lock_bh
smc_clcsock_data_ready+0xa/0x4d
smc_close_flush_work+0x1f/0x30
process_one_work
worker_thread
kthread
ret_from_fork
The nvmet change I referred to is:
2fa8961d3a6a ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
The stable/backport patch I originally used as the reference is:
1c90f930e7b4 ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
Its commit message says that when the socket is closed while in
TCP_LISTEN, the flush callback can call nvmet_tcp_listen_data_ready()
with sk_callback_lock already held, so nvmet moved the TCP_LISTEN check
before taking sk_callback_lock.
For the TCP_LISTEN check: my reasoning was that smc_clcsock_data_ready()
is installed by smc_listen() on the underlying TCP listen socket and only
queues smc_tcp_listen_work() for the SMC listen/accept path. Once that
underlying socket is no longer in TCP_LISTEN, there should be no SMC
listen accept work to queue from this callback. TCP_SYN_RECV and
TCP_ESTABLISHED are not listen-socket states for this callback path, so I
did not intend the callback to queue listen work for those states.
That said, if SMC expects smc_clcsock_data_ready() to handle a non-LISTEN
state during fallback or another transition, then the proposed check is
too strict and I should rework the fix.
Thanks,
Runyu
^ permalink raw reply
* [PATCH net] net: dsa: realtek: fix memory leak in rtl8366rb_setup_led()
From: David Yang @ 2026-06-18 14:01 UTC (permalink / raw)
To: netdev
Cc: David Yang, Linus Walleij, Alvin Šipraga, Andrew Lunn,
Vladimir Oltean, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Luiz Angelo Daros de Luca, linux-kernel
led_classdev_register_ext() only reads init_data.devicename - it never
stores the pointer. However, the caller allocated devicename with
kasprintf() but never freed it, leaking the string memory.
Fix it with a stack buffer to avoid dynamic buffers completely.
Fixes: 32d617005475 ("net: dsa: realtek: add LED drivers for rtl8366rb")
Signed-off-by: David Yang <mmyangfl@gmail.com>
---
drivers/net/dsa/realtek/rtl8366rb-leds.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/realtek/rtl8366rb-leds.c b/drivers/net/dsa/realtek/rtl8366rb-leds.c
index 509ffd3f8db5..ba50d311cb15 100644
--- a/drivers/net/dsa/realtek/rtl8366rb-leds.c
+++ b/drivers/net/dsa/realtek/rtl8366rb-leds.c
@@ -89,6 +89,7 @@ static int rtl8366rb_setup_led(struct realtek_priv *priv, struct dsa_port *dp,
struct led_init_data init_data = { };
enum led_default_state state;
struct rtl8366rb_led *led;
+ char name[64];
u32 led_group;
int ret;
@@ -129,10 +130,9 @@ static int rtl8366rb_setup_led(struct realtek_priv *priv, struct dsa_port *dp,
init_data.fwnode = led_fwnode;
init_data.devname_mandatory = true;
- init_data.devicename = kasprintf(GFP_KERNEL, "Realtek-%d:0%d:%d",
- dp->ds->index, dp->index, led_group);
- if (!init_data.devicename)
- return -ENOMEM;
+ snprintf(name, sizeof(name), "Realtek-%d:0%d:%d",
+ dp->ds->index, dp->index, led_group);
+ init_data.devicename = name;
ret = devm_led_classdev_register_ext(priv->dev, &led->cdev, &init_data);
if (ret) {
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net] octeontx2-af: npc: cn20k: fix NPC defrag
From: Simon Horman @ 2026-06-18 14:00 UTC (permalink / raw)
To: Ratheesh Kannoth
Cc: kuba, linux-kernel, netdev, andrew+netdev, davem, edumazet,
pabeni, sgoutham
In-Reply-To: <20260617102149.1309913-1-rkannoth@marvell.com>
On Wed, Jun 17, 2026 at 03:51:49PM +0530, Ratheesh Kannoth wrote:
> npc_defrag_alloc_free_slots() always passed NPC_MCAM_KEY_X2 into
> __npc_subbank_alloc(), which must match sb->key_type, so defrag never
> allocated replacement slots on X4 banks. Pass the subbank key type for
> bank 0, and only extend the search into bank 1 for X2 (X4 MCAM indices
> are confined to b0b..b0t).
>
> Fixes: 645c6e3c1999 ("octeontx2-af: npc: cn20k: virtual index support")
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Zhou, Yun @ 2026-06-18 13:56 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
maxime.chevallier, netdev, linux-kernel, yun.zhou
In-Reply-To: <20260618125128.h5g-StPH@linutronix.de>
On 6/18/2026 8:51 PM, Sebastian Andrzej Siewior wrote:
>
> On 2026-06-18 18:43:51 [+0800], Yun Zhou wrote:
>> --- a/drivers/net/ethernet/marvell/mvneta.c
>> +++ b/drivers/net/ethernet/marvell/mvneta.c
>> @@ -5907,6 +5907,9 @@ static int mvneta_resume(struct device *device)
>> rtnl_unlock();
>> mvneta_set_rx_mode(dev);
>>
>> + if (!pp->neta_armada3700)
>> + on_each_cpu(mvneta_percpu_enable, pp, true);
>> +
>> return 0;
>> }
>> #endif
>
> This does not look symmetrical. I wouldn't mind if mvneta_suspend()
> would have the matching disable but this isn't the case.
> But if the thread is idle then you have one enable too many, don't you?
> Well you have the NAPI callback which does disable on the local CPU and
> this resume which enables it on every CPU. So this does not look right.
>
The enable in resume is intentionally unconditional and idempotent
(writing MPIC_INT_CLEAR_MASK on an already unmasked IRQ is a no-op).
> The interesting question is what happens to the enable_percpu_irq() from
> the mvneta_poll(). Is it lost? And if so, how/ why?
>
The enable_percpu_irq() from mvneta_poll is not "lost" — it never
gets a chance to execute. The sequence is:
1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
2. PM freezes kthreads (on PREEMPT_RT, softirq runs in kthread)
3. NAPI poll cannot run → enable_percpu_irq() is never called
4. mvneta_stop_dev → napi_disable(): cancels the scheduled poll
but does NOT execute the completion path (no enable_percpu_irq)
5. Resume → napi_enable(): resets NAPI state but MPIC stays masked
The unconditional enable in resume covers this case. When NAPI was
idle at suspend time, the extra enable is harmless.
BR,
Yun
^ permalink raw reply
* [PATCH net v2 10/10] rxrpc: Fix leak of released call in recvmsg(MSG_PEEK)
From: David Howells @ 2026-06-18 13:48 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix rxrpc_recvmsg() to also drop the ref it holds on an already-released
call if MSG_PEEK is in force (the function holds a ref on the call
irrespective of whether MSG_PEEK is specified or not).
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Fixes: 962fb1f651c2 ("rxrpc: Fix recv-recv race of completed call")
---
net/rxrpc/recvmsg.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 9962e135cb73..efcba4b2e74f 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -529,8 +529,7 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
if (test_bit(RXRPC_CALL_RELEASED, &call->flags)) {
rxrpc_see_call(call, rxrpc_call_see_already_released);
mutex_unlock(&call->user_mutex);
- if (!(flags & MSG_PEEK))
- rxrpc_put_call(call, rxrpc_call_put_recvmsg);
+ rxrpc_put_call(call, rxrpc_call_put_recvmsg);
goto try_again;
}
^ permalink raw reply related
* [PATCH net v2 09/10] rxrpc: Fix socket notification race
From: David Howells @ 2026-06-18 13:48 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
There's a race between rxrpc_recvmsg() and rxrpc_notify_socket(), whereby
the latter's attempt to avoid disabling interrupts and taking the socket's
recvmsg_lock if the call is already queued may happen simultaneously with
the former's discarding of a call that has nothing queued.
Fix this by removing the shortcut. Note that this only affects userspace's
use of AF_RXRPC; the AFS filesystem driver doesn't use the socket queue.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
net/rxrpc/recvmsg.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index f382a47c6eb0..9962e135cb73 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -27,8 +27,6 @@ void rxrpc_notify_socket(struct rxrpc_call *call)
_enter("%d", call->debug_id);
- if (!list_empty(&call->recvmsg_link))
- return;
if (test_bit(RXRPC_CALL_RELEASED, &call->flags)) {
rxrpc_see_call(call, rxrpc_call_see_notify_released);
return;
^ permalink raw reply related
* [PATCH net v2 08/10] rxrpc: Fix potential infinite loop in rxrpc_recvmsg()
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix the wait in rxrpc_recvmsg() also take check the oob queue.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
net/rxrpc/recvmsg.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 39a03684432d..f382a47c6eb0 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -438,7 +438,8 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
return -EAGAIN;
}
- if (list_empty(&rx->recvmsg_q)) {
+ if (list_empty(&rx->recvmsg_q) &&
+ skb_queue_empty_lockless(&rx->recvmsg_oobq)) {
ret = -EWOULDBLOCK;
if (timeo == 0) {
call = NULL;
^ permalink raw reply related
* [PATCH net v2 07/10] rxrpc: Fix oob challenge leak in cleanup after notification failure
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix rxrpc_notify_socket_oob() to return an indication of failure in the
event that it failed to queue a packet and fix rxrpc_post_challenge() to
clean up the connection ref in such an event.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
net/rxrpc/ar-internal.h | 4 ++--
net/rxrpc/conn_event.c | 9 +++++++--
net/rxrpc/oob.c | 7 +++++--
3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 98f2165159d7..ead3419f08b7 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -1355,9 +1355,9 @@ static inline struct rxrpc_net *rxrpc_net(struct net *net)
}
/*
- * out_of_band.c
+ * oob.c
*/
-void rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb);
+bool rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb);
void rxrpc_add_pending_oob(struct rxrpc_sock *rx, struct sk_buff *skb);
int rxrpc_sendmsg_oob(struct rxrpc_sock *rx, struct msghdr *msg, size_t len);
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index c96ca615b787..611c790bc6d0 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -436,7 +436,7 @@ static bool rxrpc_post_challenge(struct rxrpc_connection *conn,
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
struct rxrpc_call *call = NULL;
struct rxrpc_sock *rx;
- bool respond = false;
+ bool respond = false, queued = false;
sp->chall.conn =
rxrpc_get_connection(conn, rxrpc_conn_get_challenge_input);
@@ -472,8 +472,13 @@ static bool rxrpc_post_challenge(struct rxrpc_connection *conn,
}
if (call)
- rxrpc_notify_socket_oob(call, skb);
+ queued = rxrpc_notify_socket_oob(call, skb);
rcu_read_unlock();
+ if (call && !queued) {
+ rxrpc_put_connection(conn, rxrpc_conn_put_challenge_input);
+ sp->chall.conn = NULL;
+ return false;
+ }
if (!call)
rxrpc_post_packet_to_conn(conn, skb);
diff --git a/net/rxrpc/oob.c b/net/rxrpc/oob.c
index 3318c8bd82ad..c80ee2487d09 100644
--- a/net/rxrpc/oob.c
+++ b/net/rxrpc/oob.c
@@ -32,11 +32,12 @@ struct rxrpc_oob_params {
* Post an out-of-band message for attention by the socket or kernel service
* associated with a reference call.
*/
-void rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb)
+bool rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb)
{
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
struct rxrpc_sock *rx;
struct sock *sk;
+ bool queued = false;
rcu_read_lock();
@@ -49,6 +50,7 @@ void rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb)
skb->skb_mstamp_ns = rx->oob_id_counter++;
rxrpc_get_skb(skb, rxrpc_skb_get_post_oob);
skb_queue_tail(&rx->recvmsg_oobq, skb);
+ queued = true;
trace_rxrpc_notify_socket(call->debug_id, sp->hdr.serial);
if (rx->app_ops)
@@ -56,11 +58,12 @@ void rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb)
}
spin_unlock_irq(&rx->recvmsg_lock);
- if (!rx->app_ops && !sock_flag(sk, SOCK_DEAD))
+ if (queued && !rx->app_ops && !sock_flag(sk, SOCK_DEAD))
sk->sk_data_ready(sk);
}
rcu_read_unlock();
+ return queued;
}
/*
^ permalink raw reply related
* [PATCH net v2 06/10] rxrpc: Fix the reception of a reply packet before data transmission
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix rxrpc_receiving_reply() to handle the reception of an apparent reply
DATA packet before rxrpc has had a chance to send any request DATA packets
on a client call by checking to see if the call has been exposed yet by
sending the first packet.
Without this, rxrpc_rotate_tx_window() might oops.
Also fix rxrpc_rotate_tx_window() to handle the Tx queue being empty by
changing the do...while loop into a while loop, just in case a call is
abnormally terminated by an early reply before the last request packet is
transmitted.
Fixes: b341a0263b1b ("rxrpc: Implement progressive transmission queue struct")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
net/rxrpc/input.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 37881dffa898..01ccd2d2fe92 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -247,7 +247,7 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
tq = call->tx_queue;
}
- do {
+ while (before_eq(seq, to)) {
unsigned int ix = seq - call->tx_qbase;
_debug("tq=%x seq=%x i=%d f=%x", tq->qbase, seq, ix, tq->bufs[ix]->flags);
@@ -317,8 +317,7 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
break;
}
}
-
- } while (before_eq(seq, to));
+ }
if (trace)
trace_rxrpc_rack_update(call, summary);
@@ -392,6 +391,14 @@ static bool rxrpc_receiving_reply(struct rxrpc_call *call)
trace_rxrpc_timer_can(call, rxrpc_timer_trace_delayed_ack);
}
+ /* Deal with an apparent reply coming in before we've got the request
+ * queued or transmitted.
+ */
+ if (!test_bit(RXRPC_CALL_EXPOSED, &call->flags)) {
+ rxrpc_proto_abort(call, top, rxrpc_eproto_early_reply);
+ return false;
+ }
+
if (!test_bit(RXRPC_CALL_TX_LAST, &call->flags)) {
if (!rxrpc_rotate_tx_window(call, top, &summary)) {
rxrpc_proto_abort(call, top, rxrpc_eproto_early_reply);
^ permalink raw reply related
* [PATCH net v2 04/10] afs: Fix further netns teardown to cancel the preallocation charger
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Li Daming, Ren Wei, Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
When an afs network namespace is torn down, it cancels and waits for the
work item that keeps the preallocated rxrpc call/conn/peer queue charged
before disabling incoming (i.e. listen 0), but there's a small window in
which it can be requeued by an incoming call wending through the I/O
thread.
Fix this by flushing the workqueue on which the charger runs after reducing
the listen backlog to zero.
Fixes: 47694fbc9d24 ("afs: Fix netns teardown to cancel the preallocation charger")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
fs/afs/rxrpc.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index d5cfd24e815b..6714a189d58f 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -128,8 +128,13 @@ void afs_close_socket(struct afs_net *net)
_enter("");
cancel_work_sync(&net->charge_preallocation_work);
+ /* Future work items should now see ->live is false. */
+
kernel_listen(net->socket, 0);
+
+ /* Make sure work items are no longer running. */
flush_workqueue(afs_async_calls);
+ cancel_work_sync(&net->charge_preallocation_work);
if (net->spare_incoming_call) {
afs_put_call(net->spare_incoming_call);
^ permalink raw reply related
* [PATCH net v2 05/10] afs: Fix uncancelled rxrpc OOB message handler
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Li Daming, Ren Wei, Jeffrey Altman, stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix AFS to cancel its OOB message processing (typically to respond to
security challenges). Also move OOB message processing to afs_wq so that
it's also waited for and make the OOB handler just return if the net
namespace is no longer live.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
fs/afs/cm_security.c | 3 ++-
fs/afs/rxrpc.c | 5 ++++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/afs/cm_security.c b/fs/afs/cm_security.c
index edcbd249d202..103168c70dd4 100644
--- a/fs/afs/cm_security.c
+++ b/fs/afs/cm_security.c
@@ -101,7 +101,8 @@ void afs_process_oob_queue(struct work_struct *work)
struct sk_buff *oob;
enum rxrpc_oob_type type;
- while ((oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
+ while (READ_ONCE(net->live) &&
+ (oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
switch (type) {
case RXRPC_OOB_CHALLENGE:
afs_respond_to_challenge(oob);
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 6714a189d58f..e8af2a661440 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -128,6 +128,7 @@ void afs_close_socket(struct afs_net *net)
_enter("");
cancel_work_sync(&net->charge_preallocation_work);
+ cancel_work_sync(&net->rx_oob_work);
/* Future work items should now see ->live is false. */
kernel_listen(net->socket, 0);
@@ -148,6 +149,7 @@ void afs_close_socket(struct afs_net *net)
kernel_sock_shutdown(net->socket, SHUT_RDWR);
flush_workqueue(afs_async_calls);
+ cancel_work_sync(&net->rx_oob_work);
net->socket->sk->sk_user_data = NULL;
sock_release(net->socket);
key_put(net->fs_cm_token_key);
@@ -989,5 +991,6 @@ static void afs_rx_notify_oob(struct sock *sk, struct sk_buff *oob)
{
struct afs_net *net = sk->sk_user_data;
- schedule_work(&net->rx_oob_work);
+ if (net->live)
+ queue_work(afs_wq, &net->rx_oob_work);
}
^ permalink raw reply related
* [PATCH net v2 03/10] rxrpc: Fix double unlock in rxrpc_recvmsg()
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix a double unlock in rxrpc_recvmsg() when dealing with OOB messages.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
net/rxrpc/recvmsg.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 82614cbdb60f..39a03684432d 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -471,7 +471,7 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
release_sock(&rx->sk);
if (ret == -EAGAIN)
goto try_again;
- goto error_no_call;
+ goto error_trace;
}
/* Find the next call and dequeue it if we're not just peeking. If we
^ permalink raw reply related
* [PATCH net v2 02/10] rxrpc: Fix leak of connection from OOB challenge
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
stable
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
Fix leak of connection object from OOB challenge queue when response is
provided by userspace.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
net/rxrpc/oob.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/rxrpc/oob.c b/net/rxrpc/oob.c
index 05ca9c1faa57..3318c8bd82ad 100644
--- a/net/rxrpc/oob.c
+++ b/net/rxrpc/oob.c
@@ -210,6 +210,11 @@ static int rxrpc_respond_to_oob(struct rxrpc_sock *rx,
break;
}
+ switch (skb->mark) {
+ case RXRPC_OOB_CHALLENGE:
+ rxrpc_put_connection(sp->chall.conn, rxrpc_conn_put_oob);
+ break;
+ }
rxrpc_free_skb(skb, rxrpc_skb_put_oob);
return ret;
}
^ permalink raw reply related
* [PATCH net v2 01/10] rxrpc: input: reject ACKALL outside transmit phase
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
Wyatt Feng, stable, Yuan Tan, Yifan Wu, Juefei Pu,
Zhengchuan Liang, Xin Liu, Ren Wei
In-Reply-To: <20260618134802.2477777-1-dhowells@redhat.com>
From: Wyatt Feng <bronzed_45_vested@icloud.com>
rxrpc_input_ackall() accepts ACKALL packets without checking whether
the call is in a state that can legitimately have outstanding transmit
buffers. A forged ACKALL can therefore reach a new service call in
RXRPC_CALL_SERVER_RECV_REQUEST before any reply packets have been
queued.
In that state call->tx_top is zero and call->tx_queue is NULL, so
rxrpc_rotate_tx_window() dereferences a NULL txqueue and triggers a
null-pointer dereference.
Fix rxrpc_input_ackall() to mirror the transmit-state gating already
used for normal ACK processing, and ignore ACKALL when there is no
outstanding transmit window to rotate.
Fixes: b341a0263b1b ("rxrpc: Implement progressive transmission queue struct")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
---
net/rxrpc/input.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index ce761466b02d..37881dffa898 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -1214,8 +1214,22 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb)
static void rxrpc_input_ackall(struct rxrpc_call *call, struct sk_buff *skb)
{
struct rxrpc_ack_summary summary = { 0 };
+ rxrpc_seq_t top = READ_ONCE(call->tx_top);
+
+ switch (__rxrpc_call_state(call)) {
+ case RXRPC_CALL_CLIENT_SEND_REQUEST:
+ case RXRPC_CALL_CLIENT_AWAIT_REPLY:
+ case RXRPC_CALL_SERVER_SEND_REPLY:
+ case RXRPC_CALL_SERVER_AWAIT_ACK:
+ break;
+ default:
+ return;
+ }
+
+ if (call->tx_bottom == top)
+ return;
- if (rxrpc_rotate_tx_window(call, call->tx_top, &summary))
+ if (rxrpc_rotate_tx_window(call, top, &summary))
rxrpc_end_tx_phase(call, false, rxrpc_eproto_unexpected_ackall);
}
^ permalink raw reply related
* [PATCH net v2 00/10] rxrpc: Miscellaneous fixes
From: David Howells @ 2026-06-18 13:47 UTC (permalink / raw)
To: netdev
Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel
Here are some miscellaneous AF_RXRPC fixes for more stuff found by Sashiko[1][2]:
(1) Reject ACKALL packets for calls not in Tx or immediate post-Tx state.
(2) Fix connection leak from AF_RXRPC recvmsg userspace OOB handling.
(3) Fix double unlock in AF_RXRPC recvmsg userspace OOB handling.
(4) Fix AFS preallocate charge to flush the waitqueue after unlistening
the socket so that any charging thread that does manage to get started
will be waited for before socket destruction.
(5) Fix AFS OOB notify handling to cancel in-progress OOB notification
handling and then to flush the workqueue it's on.
(6) Fix handling of apparent reply reception before initial transmission
starts in client call.
(7) Fix OOB challenge leak in cleanup on notification failure.
(8) Fix infinite loop in recvmsg if OOB packet available, but no calls.
(9) Fix notify vs recvmsg race where notify thinks the call is already
queued.
(10) Fix MSG_PEEK call leak for calls with no content.
David
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes
[1] https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
[2] https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Changes
=======
ver #2)
- Addressed the Sashiko review[2] of ver #1.
- Added patches to fix more bugs that it found.
- Adjusted AFS preallocate charge cleanup to only cancel the preallocate
work item after unlistening rather than flushing the entire waitqueue
(which may be waiting on DNS lookup).
-
David Howells (9):
rxrpc: Fix leak of connection from OOB challenge
rxrpc: Fix double unlock in rxrpc_recvmsg()
afs: Fix further netns teardown to cancel the preallocation charger
afs: Fix uncancelled rxrpc OOB message handler
rxrpc: Fix the reception of a reply packet before data transmission
rxrpc: Fix oob challenge leak in cleanup after notification failure
rxrpc: Fix potential infinite loop in rxrpc_recvmsg()
rxrpc: Fix socket notification race
rxrpc: Fix leak of released call in recvmsg(MSG_PEEK)
Wyatt Feng (1):
rxrpc: input: reject ACKALL outside transmit phase
fs/afs/cm_security.c | 3 ++-
fs/afs/rxrpc.c | 10 +++++++++-
net/rxrpc/ar-internal.h | 4 ++--
net/rxrpc/conn_event.c | 9 +++++++--
net/rxrpc/input.c | 29 +++++++++++++++++++++++++----
net/rxrpc/oob.c | 12 ++++++++++--
net/rxrpc/recvmsg.c | 10 ++++------
7 files changed, 59 insertions(+), 18 deletions(-)
^ permalink raw reply
* building ynl afaics requires updating the UAPI headers first
From: Thorsten Leemhuis @ 2026-06-18 13:39 UTC (permalink / raw)
To: Jakub Kicinski, Donald Hunter; +Cc: netdev, Riana Tauro
Hi Jakub, Donald! During the past few weeks I two or three times ran
into errors when building ynl as part of my daily -mainline/-next builds
based on a srpm pretty close to the kernel srpm used in Fedora rawhide.
Today I for example ran into this:
"""
> [...]
> GEN_RST team.rst
> GEN_RST wireguard.rst
> CC binder-user.o
> CC dev-energymodel-user.o
> CC devlink-user.o
> CC dpll-user.o
> CC drm_ras-user.o
> drm_ras-user.c:19:10: error: ‘DRM_RAS_CMD_CLEAR_ERROR_COUNTER’ undeclared here (not in a function); did you mean ‘DRM_RAS_CMD_GET_ER
> ROR_COUNTER’?
> 19 | [DRM_RAS_CMD_CLEAR_ERROR_COUNTER] = "clear-error-counter",
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> | DRM_RAS_CMD_GET_ERROR_COUNTER
> drm_ras-user.c:19:10: error: array index in initializer not of integer type
> drm_ras-user.c:19:10: note: (near initialization for ‘drm_ras_op_strmap’)
> make[1]: *** [Makefile:52: drm_ras-user.o] Error 1
> make[1]: *** Waiting for unfinished jobs....
> make[1]: Leaving directory '/home/kbuilder/ark-vanilla/linux-knurd42/tools/net/ynl/generated'
> make: *** [Makefile:28: generated] Error 2
> make: Leaving directory '/home/kbuilder/ark-vanilla/linux-knurd42/tools/net/ynl
"""
DRM_RAS_CMD_CLEAR_ERROR_COUNTER was introduced to mainline yesterday as
ee18d39a087792 ("drm/drm_ras: Add clear-error-counter netlink command to
drm_ras") [v7.1-post].
I finally looked closer today and noticed how to prevent this: update
the kernel's UAPI files (e.g. the stuff that lives in /usr/include/) on
the builder. Thing is: that's basically impossible to do from a srpm, as
those should not change the build environment and can't even when
working as non-root.
Note sure if relevant and just a shot in the dark, so maybe ignore the
following:
While investigating this I noticed this comment in
tools/net/ynl/Makefile.deps:
"""
> # Try to include uAPI headers from the kernel uapi/ path.
> # Most code under tools/ requires the respective kernel uAPI headers
> # to be copied to tools/include. The duplication is annoying.
> # All the family headers should be self-contained. We avoid the copying
> # by selectively including just the uAPI header of the family directly
> # from the kernel sources.
"""
Is that maybe not the case anymore with the recent changes to ynl?
Ciao, Thorsten
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox