* [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>
The too-many-frags test describes a packet that is valid from the Tx
ring ownership point of view, but invalid for transmission because it
exceeds the supported number of fragments.
Keep the generated Tx descriptors valid so that __send_pkts() accounts
them as outstanding descriptors that must be reclaimed through the CQ.
Then mark the corresponding Rx packet invalid so the test still does
not expect the oversized packet to appear on the receive side.
Add a valid synchronization packet after the oversized packet so the
test can verify that the Tx path drains the bad packet and resumes at
the next packet boundary.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
.../selftests/bpf/prog_tests/test_xsk.c | 20 +++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 72875071d4f1..de1e63c3fdf6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2258,7 +2258,7 @@ int testapp_too_many_frags(struct test_spec *test)
max_frags += 1;
}
- pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+ pkts = calloc(2 * max_frags + 3, sizeof(struct pkt));
if (!pkts)
return TEST_FAILURE;
@@ -2279,21 +2279,29 @@ int testapp_too_many_frags(struct test_spec *test)
/* An invalid packet with the max amount of frags but signals packet
* continues on the last frag
*/
- for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
+ for (i = max_frags + 1; i < 2 * max_frags + 2; i++) {
pkts[i].len = MIN_PKT_SIZE;
pkts[i].options = XDP_PKT_CONTD;
- pkts[i].valid = false;
+ pkts[i].valid = true;
}
+ pkts[2 * max_frags + 1].options = 0;
/* Valid packet for synch */
- pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
- pkts[2 * max_frags + 1].valid = true;
+ pkts[2 * max_frags + 2].len = MIN_PKT_SIZE;
+ pkts[2 * max_frags + 2].valid = true;
- if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+ if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 3)) {
free(pkts);
return TEST_FAILURE;
}
+ /* The generated Tx stream must keep the too-big packet valid so that
+ * __send_pkts() accounts its descriptors in outstanding_tx. The Rx
+ * stream, however, must not expect this packet on the wire.
+ */
+ test->ifobj_rx->xsk->pkt_stream->pkts[2].valid = false;
+ test->ifobj_rx->xsk->pkt_stream->nb_valid_entries--;
+
ret = testapp_validate_traffic(test);
free(pkts);
return ret;
--
2.43.0
^ permalink raw reply related
* [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>
Currently, the zero-copy Tx batching path stops when it encounters an
invalid descriptor. For multi-buffer packets this can leave descriptors
consumed from the Tx ring without returning their buffers to userspace
through the completion ring.
Handle invalid multi-buffer packets as a packet-sized unit. Keep
descriptors that are valid for transmission separate from descriptors
that are consumed only because they belong to an invalid multi-buffer
packet. The former are returned to the driver as Tx work, while the
latter are written to the CQ address area so they can be reclaimed by
userspace.
The batched path can retain drain state when the producer has not yet
supplied the end of an invalid packet. Do not allow a second Tx socket to
join the pool while such state exists. Gate the batched data path while a
same-pool bind waits for pre-existing readers, then either add the new
socket or fail the bind with -EAGAIN. This guarantees that drain state is
handled only by the singular batched path and avoids teaching the shared
UMEM fallback path about multi-buffer packet draining.
The reclaim-only descriptors must not be submitted to the completion
ring immediately when they follow real Tx descriptors in the same batch.
Drivers may complete only part of the Tx work returned by
xsk_tx_peek_release_desc_batch(), and publishing the reclaim descriptors
too early would also publish earlier real Tx descriptors that the driver
has not completed yet.
Track the number of driver-visible Tx descriptors that precede pending
reclaim descriptors. xsk_tx_completed() first advances through the real
Tx completions and submits the reclaim descriptors only after all earlier
Tx descriptors in the CQ address order have been completed. If a batch
contains only reclaim descriptors, complete them immediately because
there is no driver-visible Tx work in front of them.
This preserves CQ ordering while ensuring that every descriptor consumed
as part of an invalid multi-buffer packet is eventually returned to
userspace.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
include/net/xsk_buff_pool.h | 6 ++++
net/xdp/xsk.c | 62 +++++++++++++++++++++++++++++++---
net/xdp/xsk_buff_pool.c | 66 +++++++++++++++++++++++++++++++++++++
net/xdp/xsk_queue.h | 66 +++++++++++++++++++++++++++----------
4 files changed, 177 insertions(+), 23 deletions(-)
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ccb3b350001f..4e5abacfcbb7 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -78,9 +78,12 @@ struct xsk_buff_pool {
u32 chunk_size;
u32 chunk_shift;
u32 frame_len;
+ u32 reclaim_descs;
+ u32 tx_zc_pending_descs;
u32 xdp_zc_max_segs;
u8 tx_metadata_len; /* inherited from umem */
u8 cached_need_wakeup;
+ bool tx_share_pending;
bool uses_need_wakeup;
bool unaligned;
bool tx_sw_csum;
@@ -113,6 +116,9 @@ void xp_get_pool(struct xsk_buff_pool *pool);
bool xp_put_pool(struct xsk_buff_pool *pool);
void xp_clear_dev(struct xsk_buff_pool *pool);
void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+ bool *pending);
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool);
void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
/* AF_XDP, and XDP core. */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 43791647cf18..2dda854c6590 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -499,6 +499,18 @@ void __xsk_map_flush(struct list_head *flush_list)
void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
{
+ if (unlikely(pool->reclaim_descs)) {
+ if (nb_entries < pool->tx_zc_pending_descs) {
+ pool->tx_zc_pending_descs -= nb_entries;
+ xskq_prod_submit_n(pool->cq, nb_entries);
+ return;
+ }
+
+ pool->tx_zc_pending_descs = 0;
+ nb_entries += pool->reclaim_descs;
+ pool->reclaim_descs = 0;
+ }
+
xskq_prod_submit_n(pool->cq, nb_entries);
}
EXPORT_SYMBOL(xsk_tx_completed);
@@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
{
+ struct xsk_tx_batch batch = {};
struct xdp_sock *xs;
+ u32 cq_cached_prod;
rcu_read_lock();
+
+ /* Pairs with the release stores in xp_prepare_xsk_tx_share() and
+ * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
+ * to shared, do not enter the singular batched path.
+ */
+ if (smp_load_acquire(&pool->tx_share_pending))
+ goto out;
+ if (unlikely(pool->reclaim_descs))
+ goto out;
if (!list_is_singular(&pool->xsk_tx_list)) {
/* Fallback to the non-batched version */
rcu_read_unlock();
@@ -586,10 +609,8 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
}
xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
- if (!xs) {
- nb_pkts = 0;
+ if (!xs)
goto out;
- }
nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
@@ -603,19 +624,38 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
if (!nb_pkts)
goto out;
- nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
+ batch = xskq_cons_read_desc_batch(xs, pool, nb_pkts);
+ nb_pkts = xsk_tx_batch_cq_descs(&batch);
if (!nb_pkts) {
xs->tx->queue_empty_descs++;
goto out;
}
__xskq_cons_release(xs->tx);
+ cq_cached_prod = pool->cq->cached_prod;
+
xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
+
+ if (unlikely(batch.reclaim_descs)) {
+ u32 cq_pending_descs;
+
+ /* CQ is positional. Descriptors already written but not
+ * submitted must complete before any reclaim-only descriptors
+ * appended below.
+ */
+ cq_pending_descs = cq_cached_prod - xskq_get_prod(pool->cq);
+
+ pool->tx_zc_pending_descs = batch.tx_descs + cq_pending_descs;
+ pool->reclaim_descs = batch.reclaim_descs;
+ if (unlikely(!pool->tx_zc_pending_descs))
+ xsk_tx_completed(pool, 0);
+ }
+
xs->sk.sk_write_space(&xs->sk);
out:
rcu_read_unlock();
- return nb_pkts;
+ return batch.tx_descs;
}
EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
@@ -1442,6 +1482,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
struct sock *sk = sock->sk;
struct xdp_sock *xs = xdp_sk(sk);
+ bool tx_share_pending = false;
struct net_device *dev;
int bound_dev_if;
u32 flags, qid;
@@ -1549,6 +1590,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
goto out_unlock;
}
+ err = xp_prepare_xsk_tx_share(umem_xs->pool, xs,
+ &tx_share_pending);
+ if (err) {
+ sockfd_put(sock);
+ goto out_unlock;
+ }
+
xp_get_pool(umem_xs->pool);
xs->pool = umem_xs->pool;
@@ -1559,6 +1607,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
if (xs->tx && !xs->pool->tx_descs) {
err = xp_alloc_tx_descs(xs->pool, xs);
if (err) {
+ if (tx_share_pending)
+ xp_finish_xsk_tx_share(xs->pool);
xp_put_pool(xs->pool);
xs->pool = NULL;
sockfd_put(sock);
@@ -1598,6 +1648,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
xs->sg = !!(xs->umem->flags & XDP_UMEM_SG_FLAG);
xs->queue_id = qid;
xp_add_xsk(xs->pool, xs);
+ if (tx_share_pending)
+ xp_finish_xsk_tx_share(xs->pool);
if (qid < dev->real_num_rx_queues) {
struct netdev_rx_queue *rxq;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 1f28a9641571..6fa732a843a9 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -22,6 +22,72 @@ void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
spin_unlock(&pool->xsk_tx_list_lock);
}
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+ bool *pending)
+{
+ struct xdp_sock *tmp;
+ int err = 0;
+
+ *pending = false;
+ if (!xs->tx)
+ return 0;
+
+ spin_lock(&pool->xsk_tx_list_lock);
+ if (!list_is_singular(&pool->xsk_tx_list)) {
+ spin_unlock(&pool->xsk_tx_list_lock);
+ return 0;
+ }
+
+ if (pool->tx_share_pending) {
+ spin_unlock(&pool->xsk_tx_list_lock);
+ return -EAGAIN;
+ }
+
+ /* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+ * Stop new singular batched Tx readers before synchronize_net()
+ * waits for readers that may already have observed a singular list.
+ */
+ smp_store_release(&pool->tx_share_pending, true);
+ *pending = true;
+ spin_unlock(&pool->xsk_tx_list_lock);
+
+ /* A batch that observed a singular Tx socket list before the gate was
+ * armed may set drain_cont. Wait for all such readers before checking
+ * whether the pool can safely become shared.
+ */
+ synchronize_net();
+
+ spin_lock(&pool->xsk_tx_list_lock);
+ list_for_each_entry(tmp, &pool->xsk_tx_list, tx_list) {
+ if (READ_ONCE(tmp->drain_cont)) {
+ err = -EAGAIN;
+ break;
+ }
+ }
+
+ if (err) {
+ /* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+ * No socket was added; clear the gate so Tx can resume.
+ */
+ smp_store_release(&pool->tx_share_pending, false);
+ *pending = false;
+ }
+ spin_unlock(&pool->xsk_tx_list_lock);
+
+ return err;
+}
+
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool)
+{
+ spin_lock(&pool->xsk_tx_list_lock);
+ /* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+ * Publish the preceding xp_add_xsk() list update before allowing Tx
+ * to observe that the share transition has finished.
+ */
+ smp_store_release(&pool->tx_share_pending, false);
+ spin_unlock(&pool->xsk_tx_list_lock);
+}
+
void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
{
if (!xs->tx)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3e3fbb73d23e..99fa62e0d337 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -58,6 +58,16 @@ struct parsed_desc {
u32 valid;
};
+struct xsk_tx_batch {
+ u32 tx_descs;
+ u32 reclaim_descs;
+};
+
+static inline u32 xsk_tx_batch_cq_descs(const struct xsk_tx_batch *batch)
+{
+ return batch->tx_descs + batch->reclaim_descs;
+}
+
/* The structure of the shared state of the rings are a simple
* circular buffer, as outlined in
* Documentation/core-api/circular-buffers.rst. For the Rx and
@@ -263,17 +273,19 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
parsed->mb = xp_mb_desc(desc);
}
-static inline
-u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
- u32 max)
+static inline struct xsk_tx_batch
+xskq_cons_read_desc_batch(struct xdp_sock *xs, struct xsk_buff_pool *pool,
+ u32 max)
{
- u32 cached_cons = q->cached_cons, nb_entries = 0;
struct xdp_desc *descs = pool->tx_descs;
- u32 total_descs = 0, nr_frags = 0;
+ bool drain = READ_ONCE(xs->drain_cont);
+ u32 cached_cons, nb_entries = 0;
+ struct xsk_tx_batch batch = {};
+ struct xsk_queue *q = xs->tx;
+ u32 nr_frags = 0;
+
+ cached_cons = q->cached_cons;
- /* track first entry, if stumble upon *any* invalid descriptor, rewind
- * current packet that consists of frags and stop the processing
- */
while (cached_cons != q->cached_prod && nb_entries < max) {
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
u32 idx = cached_cons & q->ring_mask;
@@ -282,26 +294,44 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
descs[nb_entries] = ring->desc[idx];
cached_cons++;
parse_desc(q, pool, &descs[nb_entries], &parsed);
- if (unlikely(!parsed.valid))
- break;
+ if (unlikely(!parsed.valid)) {
+ if (!drain && !nr_frags && !parsed.mb)
+ break;
+
+ drain = true;
+ }
+
+ nr_frags++;
+ nb_entries++;
if (likely(!parsed.mb)) {
- total_descs += (nr_frags + 1);
- nr_frags = 0;
- } else {
- nr_frags++;
- if (nr_frags == pool->xdp_zc_max_segs) {
+ if (unlikely(drain)) {
+ batch.reclaim_descs = nr_frags;
+ WRITE_ONCE(xs->drain_cont, false);
nr_frags = 0;
break;
}
+
+ batch.tx_descs += nr_frags;
+ nr_frags = 0;
+ continue;
}
- nb_entries++;
+
+ if (nr_frags == pool->xdp_zc_max_segs)
+ drain = true;
}
- cached_cons -= nr_frags;
+ if (nr_frags) {
+ if (drain) {
+ batch.reclaim_descs = nr_frags;
+ WRITE_ONCE(xs->drain_cont, true);
+ } else {
+ cached_cons -= nr_frags;
+ }
+ }
/* Release valid plus any invalid entries */
xskq_cons_release_n(q, cached_cons - q->cached_cons);
- return total_descs;
+ return batch;
}
/* Functions for consumers */
--
2.43.0
^ permalink raw reply related
* [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>
After an invalid descriptor is found in __xsk_generic_xmit(),
xskq_cons_peek_desc() returns false and the loop body is not entered.
Jason's drain fixes reclaim descriptors already attached to xs->skb and
later continuation descriptors handled through drain_cont, but the
offending descriptor that made peek fail is only released from the Tx
ring.
This loses one completion for each invalid multi-buffer packet in the
generic path. Userspace then waits forever for a descriptor that has
already been consumed by the kernel.
If the failed descriptor belongs to an already-started or already-draining
multi-buffer packet, publish its address to the completion ring before
releasing it. Standalone invalid descriptors keep the existing behavior.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
net/xdp/xsk.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c489fadc3608..43791647cf18 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1125,8 +1125,22 @@ static int __xsk_generic_xmit(struct sock *sk)
}
if (xskq_has_descs(xs->tx)) {
+ bool reclaim_desc = xs->skb || xs->drain_cont;
+
+ if (reclaim_desc) {
+ err = xsk_cq_reserve_locked(xs->pool);
+ if (err) {
+ err = -EAGAIN;
+ goto out;
+ }
+ }
+
if (xs->skb)
xsk_drop_skb(xs->skb);
+
+ if (reclaim_desc)
+ xsk_cq_submit_addr_single_locked(xs->pool, &desc);
+
xskq_cons_release(xs->tx);
xs->drain_cont = xp_mb_desc(&desc);
}
--
2.43.0
^ permalink raw reply related
* [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Jason Xing
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>
From: Jason Xing <kernelxing@tencent.com>
When the TX loop in __xsk_generic_xmit() encounters an invalid
descriptor mid-packet (e.g. an out-of-bounds address), the partial
skb is dropped and the offending descriptor is released. However,
remaining continuation descriptors belonging to the same multi-buffer
packet still sit in the TX ring. Since xs->skb becomes NULL after the
drop, the next iteration treats the leftover continuation fragment as
a brand-new packet, corrupting the packet stream.
Fix this by setting the drain_cont flag when the released descriptor
has XDP_PKT_CONTD set. On the next call to __xsk_generic_xmit(), the
drain logic introduced in the previous patch handles the remaining
fragments with normal CQ backpressure.
There is one subtle case: if a subsequent continuation descriptor also
has an invalid address, xskq_cons_peek_desc() rejects it and the
while loop is never entered, so the in-loop drain path cannot clear
drain_cont. The post-loop code already handles this: it sees
xskq_has_descs() is true (the failed descriptor was read but not
released by peek), releases it, and checks its XDP_PKT_CONTD flag.
Add an else branch so that when the released descriptor is the
last fragment (no XDP_PKT_CONTD), drain_cont is cleared. This
prevents the next valid packet from being incorrectly drained.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e80c035a7af5..c489fadc3608 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1128,6 +1128,7 @@ static int __xsk_generic_xmit(struct sock *sk)
if (xs->skb)
xsk_drop_skb(xs->skb);
xskq_cons_release(xs->tx);
+ xs->drain_cont = xp_mb_desc(&desc);
}
out:
--
2.43.0
^ permalink raw reply related
* [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb()
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Jason Xing, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>
From: Jason Xing <kernelxing@tencent.com>
When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
only the current descriptor is released from the TX ring. The remaining
continuation descriptors of the same packet stay in the ring. Since
xs->skb is set to NULL after the drop, the TX loop picks up these
leftover frags and misinterprets each one as the beginning of a new
packet, corrupting the packet stream.
Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
so we have a chance to examine and handle the potential remaining descs
of this big overflow'ed skb.
When the last fragment (without XDP_PKT_CONTD) is processed, the flag
is cleared and the loop continues to process subsequent descriptors
with the remaining budget. This behavior follows how previous xmit path
treats overflow packets.
Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # wrapped cq addr submission onto routine
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
include/net/xdp_sock.h | 1 +
net/xdp/xsk.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ebac60a3d8a1..8b51876efbed 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -80,6 +80,7 @@ struct xdp_sock {
* call of __xsk_generic_xmit().
*/
struct sk_buff *skb;
+ bool drain_cont;
struct list_head map_list;
/* Protects map_list */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a7a83dc4546a..e80c035a7af5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -737,6 +737,19 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
}
+static void xsk_cq_submit_addr_single_locked(struct xsk_buff_pool *pool,
+ struct xdp_desc *desc)
+{
+ unsigned long flags;
+ u32 idx;
+
+ spin_lock_irqsave(&pool->cq_prod_lock, flags);
+ idx = xskq_get_prod(pool->cq);
+ xskq_prod_write_addr(pool->cq, idx, desc->addr);
+ xskq_prod_submit_n(pool->cq, 1);
+ spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
+}
+
static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
{
spin_lock(&pool->cq->cq_cached_prod_lock);
@@ -1063,11 +1076,22 @@ static int __xsk_generic_xmit(struct sock *sk)
goto out;
}
+ if (unlikely(xs->drain_cont)) {
+ xsk_cq_submit_addr_single_locked(xs->pool, &desc);
+
+ xs->tx->invalid_descs++;
+ xskq_cons_release(xs->tx);
+ xs->drain_cont = xp_mb_desc(&desc);
+ continue;
+ }
+
skb = xsk_build_skb(xs, &desc);
if (IS_ERR(skb)) {
err = PTR_ERR(skb);
if (err != -EOVERFLOW)
goto out;
+ if (xp_mb_desc(&desc))
+ xs->drain_cont = true;
err = 0;
continue;
}
--
2.43.0
^ permalink raw reply related
* [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Jason Xing
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>
From: Jason Xing <kernelxing@tencent.com>
This patch is inspired by the check[1] from sashiko. It says when
overflow happens, the address of cq to be published is invalid.
Actually the severer thing is the whole process of publishing the
address of cq in this particular case is not right: it should truely
publish the address and advance the cached_prod in cq as long as it
reads descriptors from txq.
The following is the full analysis.
xsk_drop_skb() is called in three places, which all discard a partially
built multi-buffer skb:
1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
the TX ring prevents the partial packet from completing
3) xsk_release(): socket close while xs->skb holds an incomplete packet
In all three cases, the TX descriptors for the already-processed frags
have been consumed from the TX ring (xskq_cons_release), and CQ slots
have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
the buffer addresses never appear in the completion queue, userspace
permanently loses track of these buffers.
Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
destructor, which already submits buffer addresses to the CQ via
xsk_cq_submit_addr_locked().
Note that cancelling the descriptors back to the TX ring (via
xskq_cons_cancel_n) is not a appropriate option because an oversized
packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
which is an obviously deadlock bug in the TX path.
Also move the desc->addr assignment in xsk_build_skb() above the
overflow check so that the current descriptor's address is recorded
before a potential -EOVERFLOW jump to free_err, consistent with the
zerocopy path in xsk_build_skb_zerocopy().
[1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b970f30ea9b9..a7a83dc4546a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff *skb)
static void xsk_drop_skb(struct sk_buff *skb)
{
- xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
- xsk_consume_skb(skb);
+ struct xdp_sock *xs = xdp_sk(skb->sk);
+
+ xs->tx->invalid_descs += xsk_get_num_desc(skb);
+ consume_skb(skb);
+ xs->skb = NULL;
}
static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
@@ -877,7 +880,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
return ERR_PTR(-ENOMEM);
/* in case of -EOVERFLOW that could happen below,
- * xsk_consume_skb() will release this node as whole skb
+ * xsk_drop_skb() will release this node as whole skb
* would be dropped, which implies freeing all list elements
*/
xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
@@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
goto free_err;
}
+ xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
+
if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
err = -EOVERFLOW;
goto free_err;
@@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
-
- xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
}
}
--
2.43.0
^ permalink raw reply related
* [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
To: netdev
Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
kerneljasonxing, bjorn, Maciej Fijalkowski
Hi,
This series fixes several AF_XDP multi-buffer Tx paths where descriptors
consumed from the Tx ring are not consistently returned to userspace
through the completion ring when the packet is later dropped as invalid.
The affected cases are invalid or oversized multi-buffer Tx packets in
both the generic and zero-copy paths. In these cases, the kernel can
consume one or more Tx descriptors while building or validating a
multi-buffer packet, then drop the packet before it reaches the device.
Userspace still owns the UMEM buffers only after the corresponding
addresses are returned through the CQ. Missing completions therefore
make userspace lose track of those buffers.
The generic path fixes cover three related cases:
* partially built multi-buffer skbs dropped by xsk_drop_skb();
continuation descriptors left in the Tx ring after xsk_build_skb()
reports overflow;
* invalid descriptors encountered in the middle of a multi-buffer
packet, including the offending invalid descriptor itself.
The zero-copy path is handled separately. The batched Tx parser now
distinguishes descriptors that can be passed to the driver from
descriptors that are consumed only because they belong to an invalid
multi-buffer packet. Reclaim-only descriptors are written to the CQ
address area and published in completion order, after any earlier
driver-visible Tx descriptors.
The ZC batching path can also retain drain state when userspace has not
yet provided the end of an invalid multi-buffer packet. To keep this
state local to the singular batched path, the series prevents a second
Tx socket from joining the same pool while such drain state exists.
During the singular-to-shared transition, Tx batching is gated,
pre-existing readers are waited out, and bind fails with -EAGAIN if the
existing socket still has pending drain state. This avoids adding
multi-buffer drain handling to the shared-UMEM fallback path.
The last two patches update xskxceiver so the tests account invalid
multi-buffer Tx packets as descriptors that must be reclaimed, while
still not expecting those invalid packets on the Rx side.
This is a follow-up to Jason's changes [0] which were addressing generic
xmit only and this set allows me to pass full xskxceiver test suite run
against ice driver.
Thanks,
Maciej
[0]: https://lore.kernel.org/netdev/20260520004244.55663-1-kerneljasonxing@gmail.com/
Jason Xing (3):
xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
xsk: drain continuation descs after overflow in xsk_build_skb()
xsk: drain continuation descs on invalid descriptor in
__xsk_generic_xmit()
Maciej Fijalkowski (4):
xsk: reclaim offending invalid desc in generic multi-buffer Tx
xsk: reclaim invalid multi-buffer Tx descs in ZC path
selftests/xsk: fix too-many-frags multi-buffer Tx test
selftests/xsk: account invalid multi-buffer Tx descriptors
include/net/xdp_sock.h | 1 +
include/net/xsk_buff_pool.h | 6 +
net/xdp/xsk.c | 114 ++++++++++++++++--
net/xdp/xsk_buff_pool.c | 66 ++++++++++
net/xdp/xsk_queue.h | 66 +++++++---
.../selftests/bpf/prog_tests/test_xsk.c | 44 ++++---
6 files changed, 254 insertions(+), 43 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH] net: liquidio: Check soft command allocation in lio_main setup_nic_devices()
From: Breno Leitao @ 2026-06-23 13:31 UTC (permalink / raw)
To: Haoxiang Li
Cc: andrew+netdev, davem, kuba, pabeni, kory.maincent, zilin, petrm,
u.kleine-koenig, marco.crivellari, vadim.fedorenko,
Aleksey.Makarov, satananda.burla, felix.manlunas, derek.chickles,
rvatsavayi, netdev, linux-kernel, stable
In-Reply-To: <20260623125611.2228149-1-haoxiang_li2024@163.com>
On Tue, Jun 23, 2026 at 08:56:11PM +0800, Haoxiang Li wrote:
> octeon_alloc_soft_command() returns NULL when the soft command buffer
> pool is empty. setup_nic_devices() dereferences the returned pointer
> immediately when preparing the interface configuration command, which
> can lead to a NULL pointer dereference if the pool is exhausted.
>
> Return -ENOMEM when the allocation fails and let the existing NIC init
> failure path handle the error.
>
> Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
> ---
> drivers/net/ethernet/cavium/liquidio/lio_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> index 0db08ac3d098..5077129656e8 100644
> --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
> +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> @@ -3363,6 +3363,9 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
> sc = (struct octeon_soft_command *)
> octeon_alloc_soft_command(octeon_dev, data_size,
> resp_size, 0);
> + if (!sc)
> + return -ENOMEM;
Is it fine to return in here, given that the
octeon_register_reqtype_free_fn() and octeon_register_dispatch_fn()
functions succeed above? Do you need to clean any side effect by them?
--breno
^ permalink raw reply
* [PATCH] net: liquidio: Check soft command allocation in lio_main setup_nic_devices()
From: Haoxiang Li @ 2026-06-23 12:56 UTC (permalink / raw)
To: andrew+netdev, davem, kuba, pabeni, kory.maincent, zilin, petrm,
u.kleine-koenig, marco.crivellari, vadim.fedorenko,
Aleksey.Makarov, satananda.burla, felix.manlunas, derek.chickles,
rvatsavayi
Cc: netdev, linux-kernel, Haoxiang Li, stable
octeon_alloc_soft_command() returns NULL when the soft command buffer
pool is empty. setup_nic_devices() dereferences the returned pointer
immediately when preparing the interface configuration command, which
can lead to a NULL pointer dereference if the pool is exhausted.
Return -ENOMEM when the allocation fails and let the existing NIC init
failure path handle the error.
Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
drivers/net/ethernet/cavium/liquidio/lio_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 0db08ac3d098..5077129656e8 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3363,6 +3363,9 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
sc = (struct octeon_soft_command *)
octeon_alloc_soft_command(octeon_dev, data_size,
resp_size, 0);
+ if (!sc)
+ return -ENOMEM;
+
resp = (struct liquidio_if_cfg_resp *)sc->virtrptr;
vdata = (struct lio_version *)sc->virtdptr;
--
2.25.1
^ permalink raw reply related
* [PATCH] af_unix: move proto info out of CONFIG_BPF_SYSCALL
From: Ben Dooks @ 2026-06-23 12:49 UTC (permalink / raw)
To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, netdev, linux-kernel
Cc: Ben Dooks
These two structs are defined even if CONFIG_BPF_SYSCALL but
the header does not export them, so declare them anyway and
move the check for CONFIG_BPF_SYSCALL lower into the file.
This removes the two sparse warnings:
net/unix/af_unix.c:1060:14: warning: symbol 'unix_dgram_proto' was not declared. Should it be static?
net/unix/af_unix.c:1071:14: warning: symbol 'unix_stream_proto' was not declared. Should it be static?
This change is less complicated than trying to make those two
structs static based on the CONFIG_BPF_SYSCALL configuration.
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
net/unix/af_unix.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/unix/af_unix.h b/net/unix/af_unix.h
index 8119dbeef3a3..2a6a26b3a2db 100644
--- a/net/unix/af_unix.h
+++ b/net/unix/af_unix.h
@@ -55,10 +55,10 @@ static inline void unix_sysctl_unregister(struct net *net)
int __unix_dgram_recvmsg(struct sock *sk, struct msghdr *msg, size_t size, int flags);
int __unix_stream_recvmsg(struct sock *sk, struct msghdr *msg, size_t size, int flags);
-#ifdef CONFIG_BPF_SYSCALL
extern struct proto unix_dgram_proto;
extern struct proto unix_stream_proto;
+#ifdef CONFIG_BPF_SYSCALL
int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
void __init unix_bpf_build_proto(void);
--
2.37.2.352.g3c44437643
^ permalink raw reply related
* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Bastien Nocera @ 2026-06-23 12:44 UTC (permalink / raw)
To: Eric Biggers, linux-crypto, Herbert Xu, Marcel Holtmann,
Luiz Augusto von Dentz
Cc: linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
linux-bluetooth, ell
In-Reply-To: <20260430011544.31823-1-ebiggers@kernel.org>
Hey,
Replying to this older patch.
On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
<snip>
> This isn't intended to change anything overnight. After all, most Linux
> distros won't be able to disable the kconfig options quite yet, mainly
> because of iwd. But this should create a bit more impetus for these
> userspace programs to be fixed, and the documentation update should also
> help prevent more users from appearing.
There are 2 other users that I know of: bluez, and the ell library
(used by iwd and bluez).
From what I could tell, bluetoothd uses AF_ALG for cryptography:
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c
It uses "ecb(aes)" and "cmac(aes)" as algorithms.
Finally, it also uses them both again:
https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
through ell:
https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c
Because that's a question that also came up, bluetoothd also uses the
CAP_NET_ADMIN capability.
I'll let Luiz and Marcel take it over from here.
Cheers
^ permalink raw reply
* Re: [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
From: kernel test robot @ 2026-06-23 12:36 UTC (permalink / raw)
To: Avinash Duduskar, ast, daniel, andrii
Cc: llvm, oe-kbuild-all, eddyz87, memxor, martin.lau, song,
yonghong.song, jolsa, emil, john.fastabend, sdf, davem, edumazet,
kuba, pabeni, horms, shuah, hawk, yatsenko, leon.hwang, kpsingh,
a.s.protopopov, ameryhung, rongtao, eyal.birger, bpf, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260623025147.1001664-4-avinash.duduskar@gmail.com>
Hi Avinash,
kernel test robot noticed the following build warnings:
[auto build test WARNING on a975094bf98ca97be9146f9d3b5681a6f9cf5ce3]
url: https://github.com/intel-lab-lkp/linux/commits/Avinash-Duduskar/bpf-Add-BPF_FIB_LOOKUP_VLAN-flag-to-bpf_fib_lookup-helper/20260623-105336
base: a975094bf98ca97be9146f9d3b5681a6f9cf5ce3
patch link: https://lore.kernel.org/r/20260623025147.1001664-4-avinash.duduskar%40gmail.com
patch subject: [PATCH bpf-next v4 3/3] selftests/bpf: Add bpf_fib_lookup() VLAN flag tests
config: i386-buildonly-randconfig-006-20260623 (https://download.01.org/0day-ci/archive/20260623/202606232057.G1fSw98N-lkp@intel.com/config)
compiler: clang version 22.1.3 (https://github.com/llvm/llvm-project e9846648fd6183ee6d8cbdb4502213fcf902a211)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260623/202606232057.G1fSw98N-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606232057.G1fSw98N-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> net/core/filter.c:6567:14: warning: unused variable 'net' [-Wunused-variable]
6567 | struct net *net = dev_net(skb->dev);
| ^~~
1 warning generated.
vim +/net +6567 net/core/filter.c
87f5fc7e48dd31 David Ahern 2018-05-09 6563
87f5fc7e48dd31 David Ahern 2018-05-09 6564 BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
87f5fc7e48dd31 David Ahern 2018-05-09 6565 struct bpf_fib_lookup *, params, int, plen, u32, flags)
87f5fc7e48dd31 David Ahern 2018-05-09 6566 {
4f74fede40df8d David Ahern 2018-05-21 @6567 struct net *net = dev_net(skb->dev);
f392520e15094e Avinash Duduskar 2026-06-23 6568 struct net_device *fwd_dev = NULL;
4c79579b44b187 David Ahern 2018-06-26 6569 int rc = -EAFNOSUPPORT;
2c0a10af688c02 Jesper Dangaard Brouer 2021-02-09 6570 bool check_mtu = false;
4f74fede40df8d David Ahern 2018-05-21 6571
87f5fc7e48dd31 David Ahern 2018-05-09 6572 if (plen < sizeof(*params))
87f5fc7e48dd31 David Ahern 2018-05-09 6573 return -EINVAL;
87f5fc7e48dd31 David Ahern 2018-05-09 6574
31de4105f00d64 Martin KaFai Lau 2023-02-17 6575 if (flags & ~BPF_FIB_LOOKUP_MASK)
9ce64f192d161a David Ahern 2018-05-29 6576 return -EINVAL;
9ce64f192d161a David Ahern 2018-05-29 6577
2c0a10af688c02 Jesper Dangaard Brouer 2021-02-09 6578 if (params->tot_len)
2c0a10af688c02 Jesper Dangaard Brouer 2021-02-09 6579 check_mtu = true;
2c0a10af688c02 Jesper Dangaard Brouer 2021-02-09 6580
87f5fc7e48dd31 David Ahern 2018-05-09 6581 switch (params->family) {
87f5fc7e48dd31 David Ahern 2018-05-09 6582 #if IS_ENABLED(CONFIG_INET)
87f5fc7e48dd31 David Ahern 2018-05-09 6583 case AF_INET:
f392520e15094e Avinash Duduskar 2026-06-23 6584 rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu,
f392520e15094e Avinash Duduskar 2026-06-23 6585 &fwd_dev);
4f74fede40df8d David Ahern 2018-05-21 6586 break;
87f5fc7e48dd31 David Ahern 2018-05-09 6587 #endif
87f5fc7e48dd31 David Ahern 2018-05-09 6588 #if IS_ENABLED(CONFIG_IPV6)
87f5fc7e48dd31 David Ahern 2018-05-09 6589 case AF_INET6:
f392520e15094e Avinash Duduskar 2026-06-23 6590 rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu,
f392520e15094e Avinash Duduskar 2026-06-23 6591 &fwd_dev);
4f74fede40df8d David Ahern 2018-05-21 6592 break;
87f5fc7e48dd31 David Ahern 2018-05-09 6593 #endif
87f5fc7e48dd31 David Ahern 2018-05-09 6594 }
4f74fede40df8d David Ahern 2018-05-21 6595
2c0a10af688c02 Jesper Dangaard Brouer 2021-02-09 6596 if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
f392520e15094e Avinash Duduskar 2026-06-23 6597 /* without tot_len, check the skb against the FIB-result
f392520e15094e Avinash Duduskar 2026-06-23 6598 * device's MTU
2c0a10af688c02 Jesper Dangaard Brouer 2021-02-09 6599 */
f392520e15094e Avinash Duduskar 2026-06-23 6600 if (!is_skb_forwardable(fwd_dev, skb))
4c79579b44b187 David Ahern 2018-06-26 6601 rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
e1850ea9bd9eca Jesper Dangaard Brouer 2021-02-09 6602
f392520e15094e Avinash Duduskar 2026-06-23 6603 params->mtu_result = fwd_dev->mtu; /* union with tot_len */
4f74fede40df8d David Ahern 2018-05-21 6604 }
4f74fede40df8d David Ahern 2018-05-21 6605
4c79579b44b187 David Ahern 2018-06-26 6606 return rc;
87f5fc7e48dd31 David Ahern 2018-05-09 6607 }
87f5fc7e48dd31 David Ahern 2018-05-09 6608
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH v4 9/9] rust: macros: remove `THIS_MODULE` static from `module!`
From: Andreas Hindborg @ 2026-06-23 12:28 UTC (permalink / raw)
To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
linux-pci, Alvin Sun
In-Reply-To: <20260623-fix-fops-owner-v4-9-0daf5f077d5c@linux.dev>
Alvin Sun <alvin.sun@linux.dev> writes:
> All users have been migrated to `ModuleMetadata::THIS_MODULE` const or
> `this_module::<LocalModule>()` helper. The `static THIS_MODULE`
> generated by the `module!` macro is no longer referenced anywhere,
> so remove it to avoid having two sources of the same `ThisModule`
> pointer.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Best regards,
Andreas Hindborg
^ permalink raw reply
* Re: [PATCH v4 2/9] rust: module: add `THIS_MODULE` const to `ModuleMetadata` trait
From: Andreas Hindborg @ 2026-06-23 12:28 UTC (permalink / raw)
To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
Arve Hjønnevåg, Todd Kjos, Christian Brauner,
Carlos Llamas
Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
linux-pci, Alvin Sun
In-Reply-To: <20260623-fix-fops-owner-v4-2-0daf5f077d5c@linux.dev>
Alvin Sun <alvin.sun@linux.dev> writes:
> Since `const_refs_to_static` has been stable as of the MSRV bump, a
> `ThisModule` pointer can now be used in const contexts.
>
> Add a `THIS_MODULE` const to the `ModuleMetadata` trait so that modules
> can provide their `ThisModule` pointer in const contexts such as static
> `file_operations`.
>
> Add a `this_module()` helper to retrieve the `THIS_MODULE` pointer of a
> given module type, and update `__init` to use it instead of the
> `THIS_MODULE` static generated by the `module!` macro.
>
> The `static THIS_MODULE` generated by the `module!` macro is retained
> for backwards compatibility with existing users and removed in a later
> patch once all references have been migrated.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Best regards,
Andreas Hindborg
^ permalink raw reply
* Re: [PATCH net] net: do not acquire dev->tx_global_lock in netdev_watchdog_up()
From: Simon Horman @ 2026-06-23 12:22 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
eric.dumazet, Marek Szyprowski
In-Reply-To: <20260622110108.69541-1-edumazet@google.com>
On Mon, Jun 22, 2026 at 11:01:08AM +0000, Eric Dumazet wrote:
> Marek Szyprowski reported a deadlock during system resume when virtio_net
> driver is used.
>
> The deadlock occurs because netif_device_attach() is called while holding
> dev->tx_global_lock (via netif_tx_lock_bh() in virtnet_restore_up()).
> netif_device_attach() calls __netdev_watchdog_up(), which now also tries
> to acquire dev->tx_global_lock to synchronize with dev_watchdog().
>
> This recursive lock acquisition results in a deadlock.
>
> Fix this by removing the tx_global_lock acquisition from netdev_watchdog_up().
>
> The critical state (watchdog_timer and watchdog_ref_held) is already
> protected by dev->watchdog_lock, which was introduced in the blamed commit.
>
> Fixes: 8eed5519e496 ("net: watchdog: fix refcount tracking races")
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> Closes: https://lore.kernel.org/netdev/a443376e-5187-4268-93b3-58047ef113a8@samsung.com/
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH v1 0/3] thunderbold: A few cleanups
From: Mika Westerberg @ 2026-06-23 12:17 UTC (permalink / raw)
To: Uwe Kleine-König (The Capable Hub)
Cc: Mika Westerberg, Yehezkel Bernat, Andreas Noever, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev, linux-kernel, linux-usb
In-Reply-To: <cover.1781776904.git.u.kleine-koenig@baylibre.com>
Hi,
On Thu, Jun 18, 2026 at 12:14:49PM +0200, Uwe Kleine-König (The Capable Hub) wrote:
> Hello,
>
> I'm currently working on a project that includes looking at all device
> ID structures from <linux/mod_devicetable.h>. While doing that for
> tb_service_id, I spotted these patch opportunities.
>
> These are all non-critical and also my quest doesn't depend on this, so
> there is no urge to apply these patches. My suggestion is to apply them
> via the thunderbold tree during the next merge window with an ack from
> the network guys.
>
> The first patch touches drivers/net and drivers/thunderbold. It could
> theretically be split, but then this results in at least 3 commits which
> seems excessive to handle three drivers, so I kept it as a single patch.
>
> The third patch is a style change and so is subjective. Drop it, if you
> don't like it. Here splitting would be easy, but given that patch #1
> already touches the same files, letting these go in together without
> splitting seems to be sensible.
>
> Best regards
> Uwe
>
> Uwe Kleine-König (The Capable Hub) (3):
> thunderbold: Stop passing matched device ID to .probe()
> thunderbold: Assert that a service driver has a probe callback
> thunderbold: Drop comma after device id array terminator
Fixed the typo "thunderbold" -> "thunderbolt" and applied all to
thunderbolt.git/next. I also took the networking patch, let me know if
that's not okay (I'm the maintainer of that driver too and it looked fine).
Thanks!
^ permalink raw reply
* Re: [PATCH bpf-next v4 2/3] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-23 12:00 UTC (permalink / raw)
To: Avinash Duduskar, ast, daniel, andrii
Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
dsahern
In-Reply-To: <20260623025147.1001664-3-avinash.duduskar@gmail.com>
Avinash Duduskar <avinash.duduskar@gmail.com> writes:
> BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
> useful: an XDP program receiving a VLAN-tagged frame on a physical
> device wants the lookup to behave as if the packet had arrived on the
> corresponding VLAN subinterface, so iif-based policy routing and VRF
> table selection use the right ingress.
>
> Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
> params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
> device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
> The device must be up and in the same network namespace as
> params->ifindex (a VLAN device can be moved to another netns while
> registered on its parent; receive would deliver into that other
> namespace, which a lookup here cannot represent). If params->ifindex
> is itself a VLAN device, its inner (QinQ) subinterface is matched.
> For a bond or team, a tag on a port matches no device and returns
> NOT_FWDED; pass the master's ifindex.
> The lookup then runs with the resolved device as the ingress;
> params->ifindex itself is not modified on the input side. When the
> resolved device is enslaved to a VRF, both the full lookup (via the
> l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
> select the VRF's table from the resolved ingress. That follows from
> feeding the resolved device to the flow as the ingress
> (fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
> the VRF master from the subinterface rather than from
> params->ifindex.
>
> The two failure classes get different treatment on purpose. A
> h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
> -EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
> with a program-controlled value. An unmatched VID, a device that is
> down, or one in another namespace is a data outcome and returns
> BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
> fib_get_table() finds no table and mirroring real ingress, where the
> receive path drops such frames. A VID of 0 (a priority tag) is looked
> up literally and normally fails the same way; receive instead
> processes such frames untagged, so callers should not set the flag for
> priority tags. Proceeding on the physical device for any of these
> would be fail-open for the policy-routing cases above.
>
> The h_vlan fields share a union with tbid, so the flag cannot be
> combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
> cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
> return -EINVAL; restricting now keeps a later relaxation backward
> compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
> consumed on the ingress side and the egress tag is written on
> success.
>
> Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
> NULL, so every lookup with the flag returns NOT_FWDED, which is
> correct since no VLAN device can exist.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-23 11:58 UTC (permalink / raw)
To: Avinash Duduskar, ast, daniel, andrii
Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
dsahern
In-Reply-To: <20260623025147.1001664-2-avinash.duduskar@gmail.com>
Avinash Duduskar <avinash.duduskar@gmail.com> writes:
> bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
> from the fib result. When the egress is a VLAN device, the returned
> ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
> programs that want to forward the frame (e.g. xdp-forward) must
> instead target the underlying physical device and push the VLAN tag
> themselves. Today the program has no way to learn either the
> underlying ifindex or the VLAN tag without maintaining its own
> VLAN-to-ifindex map in userspace and refreshing it on netlink
> events.
>
> Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
> result is a VLAN device whose immediate parent is a real (non-VLAN)
> device in the same network namespace, populate the existing output
> fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
> device and replace params->ifindex with the parent's ifindex.
> params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
> consumer wanting to set egress priority writes PCP itself.
> params->smac is the VLAN device's own address, which can differ from
> the parent's.
>
> Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
> and not vlan_dev_real_dev(), which walks to the bottom of a stack. When
> the immediate parent is not a real device in the same namespace, the
> lookup returns BPF_FIB_LKUP_RET_VLAN_FAILURE and leaves params->ifindex
> at the input. This covers a stacked VLAN (QinQ), where the immediate
> parent is itself a VLAN device and one h_vlan_proto/h_vlan_TCI pair
> cannot describe two tags, and a parent in another network namespace (a
> VLAN device can be moved while its parent stays), whose ifindex would
> be meaningless in the caller's namespace. A program that wants the VLAN
> device's own ifindex re-issues the lookup without BPF_FIB_LOOKUP_VLAN,
> so the unreducible case stays distinct from a physical egress. That
> distinction matters for XDP: a program cannot xmit on a VLAN device, so
> a success carrying the VLAN ifindex would make it redirect to a device
> with no ndo_xdp_xmit and drop the frame at xdp_do_flush(). The swap and
> the vlan fields are written only on the reduce path; other output
> fields keep their existing behaviour, so a frag-needed result still
> reports the route mtu in params->mtu_result.
>
> On the skb path without tot_len the deferred mtu check is done against
> the resolved egress device. To keep that the VLAN device rather than
> the parent after the swap, bpf_ipv4_fib_lookup()/bpf_ipv6_fib_lookup()
> hand the FIB-result device back to the caller; the XDP path always
> runs the route-mtu check and passes NULL. When the flag is not set,
> behaviour is unchanged: h_vlan_proto and h_vlan_TCI are zeroed and
> ifindex is left at the FIB result.
>
> The new block is compiled only under CONFIG_VLAN_8021Q since
> vlan_dev_priv() is not defined otherwise; without that config
> is_vlan_dev() is constant false and the flag is accepted but never
> acts. That is safe because no VLAN device can exist there, so every
> egress is already physical.
>
> This lets an XDP redirect target the physical device and learn the
> tag to push in a single lookup, which xdp-forward's optional VLAN
> mode (xdp-project/xdp-tools#504) wants from the kernel side.
>
> The helper's input semantics are unchanged; the reverse direction
> (supplying a tag as lookup input) is added in the following patch.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
> include/uapi/linux/bpf.h | 28 +++++++++++++-
> net/core/filter.c | 69 ++++++++++++++++++++++++----------
> tools/include/uapi/linux/bpf.h | 28 +++++++++++++-
> 3 files changed, 104 insertions(+), 21 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 89b36de5fdbb..8d0058d88eb2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3532,6 +3532,26 @@ union bpf_attr {
> * Use the mark present in *params*->mark for the fib lookup.
> * This option should not be used with BPF_FIB_LOOKUP_DIRECT,
> * as it only has meaning for full lookups.
> + * **BPF_FIB_LOOKUP_VLAN**
> + * If the fib lookup resolves to a VLAN device whose
> + * parent is a real (non-VLAN) device, set
> + * *params*->h_vlan_proto and *params*->h_vlan_TCI from
> + * the VLAN device and replace *params*->ifindex with the
> + * parent's ifindex. *params*->h_vlan_TCI carries the VID
> + * only, with PCP and DEI bits zero; a consumer wanting to
> + * set egress priority writes PCP itself. *params*->smac is
> + * the VLAN device's own address, which can differ from the
> + * parent's. Only the immediate parent is resolved; if it
> + * is itself a VLAN device (QinQ) or in another namespace,
> + * the egress cannot be reduced to a physical device plus
> + * one tag and the lookup returns
> + * **BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
> + * left at the input. Re-issue without
> + * **BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
> + * ifindex. The swap and the vlan fields
> + * are written only on success; other output fields keep
> + * the helper's existing behaviour, so a frag-needed result
> + * still reports the route mtu in *params*->mtu_result.
> *
> * *ctx* is either **struct xdp_md** for XDP programs or
> * **struct sk_buff** tc cls_act programs.
> @@ -7327,6 +7347,7 @@ enum {
> BPF_FIB_LOOKUP_TBID = (1U << 3),
> BPF_FIB_LOOKUP_SRC = (1U << 4),
> BPF_FIB_LOOKUP_MARK = (1U << 5),
> + BPF_FIB_LOOKUP_VLAN = (1U << 6),
> };
>
> enum {
> @@ -7340,6 +7361,7 @@ enum {
> BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
> BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */
> BPF_FIB_LKUP_RET_NO_SRC_ADDR, /* failed to derive IP src addr */
> + BPF_FIB_LKUP_RET_VLAN_FAILURE, /* VLAN egress, parent unresolvable */
> };
>
> struct bpf_fib_lookup {
> @@ -7393,7 +7415,11 @@ struct bpf_fib_lookup {
>
> union {
> struct {
> - /* output */
> + /*
> + * output with BPF_FIB_LOOKUP_VLAN: set from the
> + * resolved egress VLAN device (see the flag); zeroed
> + * on other successful lookups.
> + */
> __be16 h_vlan_proto;
> __be16 h_vlan_TCI;
> };
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e96b4b847ce..8345295d84de 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6201,10 +6201,28 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
> #endif
>
> #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
> -static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
> +static int bpf_fib_set_fwd_params(struct net_device *dev,
> + struct bpf_fib_lookup *params,
> + u32 flags, u32 mtu)
> {
> params->h_vlan_TCI = 0;
> params->h_vlan_proto = 0;
> +
> +#if IS_ENABLED(CONFIG_VLAN_8021Q)
> + if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
If you move the ifdef into the if statement, the if statement can have
an else-branch that assigns params->ifindex, so you don't need the
restore dance (see below).
> + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
> +
> + if (!is_vlan_dev(real_dev) &&
> + net_eq(dev_net(real_dev), dev_net(dev))) {
> + params->h_vlan_proto = vlan_dev_vlan_proto(dev);
> + params->h_vlan_TCI = htons(vlan_dev_vlan_id(dev));
> + params->ifindex = real_dev->ifindex;
> + } else {
> + return BPF_FIB_LKUP_RET_VLAN_FAILURE;
> + }
> + }
> +#endif
> +
> if (mtu)
> params->mtu_result = mtu; /* union with tot_len */
>
> @@ -6214,8 +6232,10 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
>
> #if IS_ENABLED(CONFIG_INET)
> static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> - u32 flags, bool check_mtu)
> + u32 flags, bool check_mtu,
> + struct net_device **fwd_dev)
> {
> + u32 in_ifindex = params->ifindex;
> struct neighbour *neigh = NULL;
> struct fib_nh_common *nhc;
> struct in_device *in_dev;
> @@ -6347,16 +6367,23 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>
> set_fwd_params:
> - return bpf_fib_set_fwd_params(params, mtu);
> + if (fwd_dev)
> + *fwd_dev = dev;
> + err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> + if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
> + params->ifindex = in_ifindex;
> + return err;
I think it's better to just move the assignment of params->ifindex
entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
That way this can be simplified to:
err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
if (!err && fwd_dev)
*fwd_dev = dev;
return err;
> }
> #endif
>
> #if IS_ENABLED(CONFIG_IPV6)
> static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> - u32 flags, bool check_mtu)
> + u32 flags, bool check_mtu,
> + struct net_device **fwd_dev)
> {
> struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
> struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
> + u32 in_ifindex = params->ifindex;
> struct fib6_result res = {};
> struct neighbour *neigh;
> struct net_device *dev;
> @@ -6486,13 +6513,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>
> set_fwd_params:
> - return bpf_fib_set_fwd_params(params, mtu);
> + if (fwd_dev)
> + *fwd_dev = dev;
> + err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> + if (err == BPF_FIB_LKUP_RET_VLAN_FAILURE)
> + params->ifindex = in_ifindex;
> + return err;
Same as above.
-Toke
^ permalink raw reply
* [PATCH] net: sparx5: unregister blocking notifier on init failure
From: Haoxiang Li @ 2026-06-23 11:57 UTC (permalink / raw)
To: andrew+netdev, davem, edumazet, kuba, pabeni, Steen.Hegelund,
daniel.machon, UNGLinuxDriver, kees, horms, bjarni.jonasson,
lars.povlsen
Cc: netdev, linux-arm-kernel, linux-kernel, Haoxiang Li, stable
sparx5_register_notifier_blocks() registers the switchdev blocking
notifier before allocating the ordered workqueue. If the workqueue
allocation fails, the error path unregisters the switchdev and netdevice
notifiers, but leaves the blocking notifier registered.
Add a separate error label for the workqueue allocation failure path and
unregister the switchdev blocking notifier there.
Fixes: d6fce5141929 ("net: sparx5: add switching support")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
index 644458108dd2..dac4dd833127 100644
--- a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
+++ b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
@@ -765,11 +765,13 @@ int sparx5_register_notifier_blocks(struct sparx5 *s5)
sparx5_owq = alloc_ordered_workqueue("sparx5_order", 0);
if (!sparx5_owq) {
err = -ENOMEM;
- goto err_switchdev_blocking_nb;
+ goto err_alloc_workqueue;
}
return 0;
+err_alloc_workqueue:
+ unregister_switchdev_blocking_notifier(&s5->switchdev_blocking_nb);
err_switchdev_blocking_nb:
unregister_switchdev_notifier(&s5->switchdev_nb);
err_switchdev_nb:
--
2.25.1
^ permalink raw reply related
* [PATCH] octeontx2-af: Free BPID bitmap on setup failure
From: Haoxiang Li @ 2026-06-23 11:43 UTC (permalink / raw)
To: sgoutham, lcherian, gakula, hkelam, sbhatta, andrew+netdev, davem,
edumazet, kuba, pabeni, horms
Cc: netdev, linux-kernel, Haoxiang Li, stable
nix_setup_bpids() allocates bp->bpids with rvu_alloc_bitmap(), which uses
a plain kcalloc(). If any of the following devm_kcalloc() allocations for
the BPID mapping arrays fails, the function returns without freeing the
bitmap. Free the BPID bitmap before returning from those error paths.
Fixes: d6212d2e41a0 ("octeontx2-af: Create BPIDs free pool")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index d8989395e875..0297c7ab0614 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -528,19 +528,24 @@ static int nix_setup_bpids(struct rvu *rvu, struct nix_hw *hw, int blkaddr)
bp->fn_map = devm_kcalloc(rvu->dev, bp->bpids.max,
sizeof(u16), GFP_KERNEL);
if (!bp->fn_map)
- return -ENOMEM;
+ goto free_bpids;
bp->intf_map = devm_kcalloc(rvu->dev, bp->bpids.max,
sizeof(u8), GFP_KERNEL);
if (!bp->intf_map)
- return -ENOMEM;
+ goto free_bpids;
bp->ref_cnt = devm_kcalloc(rvu->dev, bp->bpids.max,
sizeof(u8), GFP_KERNEL);
if (!bp->ref_cnt)
- return -ENOMEM;
+ goto free_bpids;
return 0;
+
+free_bpids:
+ rvu_free_bitmap(&bp->bpids);
+ bp->bpids.bmap = NULL;
+ return -ENOMEM;
}
void rvu_nix_flr_free_bpids(struct rvu *rvu, u16 pcifunc)
--
2.25.1
^ permalink raw reply related
* Re: [PATCH net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
From: Andrew Lunn @ 2026-06-23 11:33 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: Jie Luo, Bjorn Andersson, Michael Turquette, Stephen Boyd,
Brian Masney, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lei Wei, Suruchi Agarwal, Pavithra R, linux-kernel,
linux-arm-msm, linux-clk, devicetree, netdev
In-Reply-To: <f8441903-c768-46a1-8f95-b1b25d420a2c@oss.qualcomm.com>
> If address did not work for half a year, I really doubt that you commit
> to above.
I tend to agree. Maybe we should set it to Orphaned, and then decide
in 6 months time if it can be set back to Maintained?
Andrew
^ permalink raw reply
* Re: [PATCH net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
From: Andrew Lunn @ 2026-06-23 11:31 UTC (permalink / raw)
To: Jie Luo
Cc: Krzysztof Kozlowski, Bjorn Andersson, Michael Turquette,
Stephen Boyd, Brian Masney, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Lei Wei, Suruchi Agarwal, Pavithra R,
linux-kernel, linux-arm-msm, linux-clk, devicetree, netdev
In-Reply-To: <8b0560ae-af5c-4d54-be02-d186be1d799c@oss.qualcomm.com>
On Tue, Jun 23, 2026 at 05:42:34PM +0800, Jie Luo wrote:
>
>
> On 6/23/2026 4:10 PM, Andrew Lunn wrote:
> >> Driver is not supported - in terms of how netdev understands supported
> >> commitment - if maintainer does not care to receive the patches for its
> >> code, so demote it to "maintained" to reflect true status.
> >
> > Maybe "Orphan" would be better, if the listed Maintainer is not doing
> > any Maintainer work?
> >
> > Andrew
>
> Hello Andrew, Krzysztof,
> I will continue to maintain the listed drivers, so their status can
> remain Supported.
Please understand that being a Maintainer requires that you respond to
patches and questions about this driver, give Reviewed-by:, ask for
patches to be changed etc. If you don't respond, ideally with 2 to 3
days, the driver will be set to Orphaned.
If you want to maintain the Supported status, we can help you set up
the needed CI system, and get it registered so it reports the results.
Andrew
^ permalink raw reply
* Re: [PATCH net v3 1/2] net: ethernet: sunplus: spl2sw: fix phy_node refcount leak in remove
From: Andrew Lunn @ 2026-06-23 11:24 UTC (permalink / raw)
To: Shitalkumar Gandhi
Cc: Wells Lu, Jakub Kicinski, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, netdev, linux-kernel,
Shitalkumar Gandhi
In-Reply-To: <f3bdd4c91f3e2269b4e256075f9dc70808b1b8e9.1782195965.git.shitalkumar.gandhi@cambiumnetworks.com>
On Tue, Jun 23, 2026 at 12:11:42PM +0530, Shitalkumar Gandhi wrote:
> mac->phy_node is acquired via of_parse_phandle() in spl2sw_probe() and
> stored in the mac private data, transferring ownership of the
> device_node reference to mac. On driver removal, spl2sw_phy_remove()
> disconnects the PHY but never drops that reference, so each
> probe-then-remove cycle leaks one of_node refcount per port permanently.
>
> Drop the reference after phy_disconnect(). While at it, remove the
> redundant inner "if (ndev)" check; comm->ndev[i] was just verified
> non-NULL on the line above.
>
> Compile-tested only; no SP7021 hardware available.
>
> Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021")
> Signed-off-by: Shitalkumar Gandhi <shitalkumar.gandhi@cambiumnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* [PATCH net v2] net: ti: icssg-prueth: fix XDP_TX from the AF_XDP zero-copy RX path
From: David Carlier @ 2026-06-23 11:22 UTC (permalink / raw)
To: danishanwar, rogerq, andrew+netdev, netdev
Cc: davem, edumazet, kuba, pabeni, horms, m-malladi, hawk,
john.fastabend, sdf, ast, daniel, bpf, linux-arm-kernel,
linux-kernel, stable, David Carlier
On XDP_TX from the zero-copy RX path, emac_run_xdp() converts the xsk
buffer via xdp_convert_zc_to_xdp_frame(), which clones the data into a
fresh MEM_TYPE_PAGE_ORDER0 page that is not DMA mapped. Transmitting it
as PRUETH_TX_BUFF_TYPE_XDP_TX derives the DMA address with
page_pool_get_dma_addr(), reading an uninitialized page->dma_addr, so
the device DMAs from a bogus address (corrupt TX, or an IOMMU fault).
Pick the TX buffer type from the frame's memory type: keep
PRUETH_TX_BUFF_TYPE_XDP_TX for page_pool frames and use
PRUETH_TX_BUFF_TYPE_XDP_NDO for the cloned zero-copy frame, which is then
DMA mapped through the NDO path and unmapped on completion.
While at it, fix the page_pool XDP_TX completion path. A
PRUETH_TX_BUFF_TYPE_XDP_TX frame carries a page_pool-owned DMA mapping
(established against rx_chn->dma_dev), yet prueth_xmit_free()
unconditionally calls dma_unmap_single() on it with tx_chn->dma_dev,
tearing down a mapping the driver does not own; xdp_return_frame()
already recycles the page back to the pool. Tag such frames with a
dedicated PRUETH_SWDATA_XDPF_TX type so the completion path skips the
unmap, the same way PRUETH_SWDATA_XSK buffers are handled.
Fixes: 7a64bb388df3 ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Fixes: 62aa3246f462 ("net: ti: icssg-prueth: Add XDP support")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
v2:
- fold in the page_pool XDP_TX completion-path unmap fix raised by
Meghana Malladi: tag page_pool TX frames with PRUETH_SWDATA_XDPF_TX
so prueth_xmit_free() skips dma_unmap_single() on a pool-owned
mapping; xdp_return_frame() already recycles the page.
- add Fixes: 62aa3246f462 for that path.
- no change to the original zero-copy fix.
v1: https://lore.kernel.org/netdev/20260620213756.87499-1-devnexen@gmail.com
drivers/net/ethernet/ti/icssg/icssg_common.c | 20 +++++++++++++++++---
drivers/net/ethernet/ti/icssg/icssg_prueth.h | 1 +
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
index 82ddef9c17d5..96c8bf5ef671 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_common.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
@@ -185,7 +185,7 @@ void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
first_desc = desc;
next_desc = first_desc;
swdata = cppi5_hdesc_get_swdata(first_desc);
- if (swdata->type == PRUETH_SWDATA_XSK)
+ if (swdata->type == PRUETH_SWDATA_XSK || swdata->type == PRUETH_SWDATA_XDPF_TX)
goto free_pool;
cppi5_hdesc_get_obuf(first_desc, &buf_dma, &buf_dma_len);
@@ -259,6 +259,7 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
napi_consume_skb(skb, budget);
break;
case PRUETH_SWDATA_XDPF:
+ case PRUETH_SWDATA_XDPF_TX:
xdpf = swdata->data.xdpf;
dev_sw_netstats_tx_add(ndev, 1, xdpf->len);
total_bytes += xdpf->len;
@@ -769,7 +770,8 @@ u32 emac_xmit_xdp_frame(struct prueth_emac *emac,
k3_udma_glue_tx_dma_to_cppi5_addr(tx_chn->tx_chn, &buf_dma);
cppi5_hdesc_attach_buf(first_desc, buf_dma, xdpf->len, buf_dma, xdpf->len);
swdata = cppi5_hdesc_get_swdata(first_desc);
- swdata->type = PRUETH_SWDATA_XDPF;
+ swdata->type = buff_type == PRUETH_TX_BUFF_TYPE_XDP_TX ?
+ PRUETH_SWDATA_XDPF_TX : PRUETH_SWDATA_XDPF;
swdata->data.xdpf = xdpf;
/* Report BQL before sending the packet */
@@ -804,6 +806,7 @@ EXPORT_SYMBOL_GPL(emac_xmit_xdp_frame);
*/
static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len)
{
+ enum prueth_tx_buff_type tx_buff_type;
struct net_device *ndev = emac->ndev;
struct netdev_queue *netif_txq;
int cpu = smp_processor_id();
@@ -826,11 +829,21 @@ static u32 emac_run_xdp(struct prueth_emac *emac, struct xdp_buff *xdp, u32 *len
goto drop;
}
+ /* In AF_XDP zero-copy mode xdp_convert_buff_to_frame()
+ * clones the xsk buffer into a fresh MEM_TYPE_PAGE_ORDER0
+ * page that is not DMA mapped. Such a frame must be mapped
+ * via the NDO path; only a page pool-backed frame already
+ * carries a usable page_pool DMA address.
+ */
+ tx_buff_type = xdpf->mem_type == MEM_TYPE_PAGE_POOL ?
+ PRUETH_TX_BUFF_TYPE_XDP_TX :
+ PRUETH_TX_BUFF_TYPE_XDP_NDO;
+
q_idx = cpu % emac->tx_ch_num;
netif_txq = netdev_get_tx_queue(ndev, q_idx);
__netif_tx_lock(netif_txq, cpu);
result = emac_xmit_xdp_frame(emac, xdpf, q_idx,
- PRUETH_TX_BUFF_TYPE_XDP_TX);
+ tx_buff_type);
__netif_tx_unlock(netif_txq);
if (result == ICSSG_XDP_CONSUMED) {
ndev->stats.tx_dropped++;
@@ -1395,6 +1408,7 @@ void prueth_tx_cleanup(void *data, dma_addr_t desc_dma)
dev_kfree_skb_any(skb);
break;
case PRUETH_SWDATA_XDPF:
+ case PRUETH_SWDATA_XDPF_TX:
xdpf = swdata->data.xdpf;
xdp_return_frame(xdpf);
break;
diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.h b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
index df93d15c5b78..00bb760d68a9 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.h
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.h
@@ -153,6 +153,7 @@ enum prueth_swdata_type {
PRUETH_SWDATA_CMD,
PRUETH_SWDATA_XDPF,
PRUETH_SWDATA_XSK,
+ PRUETH_SWDATA_XDPF_TX,
};
enum prueth_tx_buff_type {
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v2 2/2] net: fman: use devm_kzalloc() for fman and rely on devres
From: Andrew Lunn @ 2026-06-23 11:22 UTC (permalink / raw)
To: 赵金明
Cc: horms, andrew+netdev, davem, edumazet, kuba, linux-kernel,
madalin.bucur, netdev, pabeni, sean.anderson
In-Reply-To: <823580887DE24145+2026062314162397367012@uniontech.com>
On Tue, Jun 23, 2026 at 02:16:25PM +0800, 赵金明 wrote:
> Hi Andrew,
>
> Thank you for pointing me to the netdev maintainer documentation. I have
> read section 1.7.4 and I understand the concern about standalone
> cleanup conversions.
>
> I would like to clarify the actual motivation behind the
> devm_kzalloc() change. While it may appear to be a simple devm_
> conversion on the surface, it is in fact fixing a use-after-free race
> condition in the IRQF_SHARED error paths. Let me explain the problem
> in detail.
Please make the commit message explain what the fix is, rather then
saying converting to devm_.
But i also hope you also see why we don't like devm_ conversions,
because developers get them wrong like this. And all too often, they
do the conversion without actual hardware to test it with. So it
results in more bugs, not less.
Andrew
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox