* [PATCH net] net: gianfar: dispose irq mappings on probe failure and device removal
From: Rosen Penev @ 2026-06-26 22:52 UTC (permalink / raw)
To: netdev
Cc: Claudiu Manoil, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Andy Fleming, open list
irq_of_parse_and_map() creates irqdomain mappings that should be
balanced with irq_dispose_mapping(). The driver never called
irq_dispose_mapping(), leaking mappings on probe failure and
device removal.
Fix by adding irq_dispose_mapping() in free_gfar_dev() and
expanding its loop from priv->num_grps to MAXGROUPS so the
error path also catches partially-initialized groups. All
irqinfo pointers are pre-initialized to NULL in gfar_of_init(),
making the NULL-guarded walk in free_gfar_dev() safe for every
scenario.
gfar_parse_group() itself is left as a simple parse function
with no resource management; cleanup is centralized in the
caller's error path.
Assisted-by: opencode:big-pickle
Fixes: b31a1d8b4151 ("gianfar: Convert gianfar to an of_platform_driver")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
drivers/net/ethernet/freescale/gianfar.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 3271de5844f8..89215e1ddc2d 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -469,10 +469,13 @@ static void free_gfar_dev(struct gfar_private *priv)
{
int i, j;
- for (i = 0; i < priv->num_grps; i++)
+ for (i = 0; i < MAXGROUPS; i++)
for (j = 0; j < GFAR_NUM_IRQS; j++) {
- kfree(priv->gfargrp[i].irqinfo[j]);
- priv->gfargrp[i].irqinfo[j] = NULL;
+ if (priv->gfargrp[i].irqinfo[j]) {
+ irq_dispose_mapping(priv->gfargrp[i].irqinfo[j]->irq);
+ kfree(priv->gfargrp[i].irqinfo[j]);
+ priv->gfargrp[i].irqinfo[j] = NULL;
+ }
}
free_netdev(priv->ndev);
@@ -616,7 +619,7 @@ static phy_interface_t gfar_get_interface(struct net_device *dev)
static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
{
const char *model;
- int err = 0, i;
+ int err = 0, i, j;
phy_interface_t interface;
struct net_device *dev = NULL;
struct gfar_private *priv = NULL;
@@ -702,8 +705,11 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
priv->rx_list.count = 0;
mutex_init(&priv->rx_queue_access);
- for (i = 0; i < MAXGROUPS; i++)
+ for (i = 0; i < MAXGROUPS; i++) {
priv->gfargrp[i].regs = NULL;
+ for (j = 0; j < GFAR_NUM_IRQS; j++)
+ priv->gfargrp[i].irqinfo[j] = NULL;
+ }
/* Parse and initialize group specific information */
if (priv->mode == MQ_MG_MODE) {
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net-next v1] tcp/dccp: avoid parity split for socket-local bind range
From: Kuniyuki Iwashima @ 2026-06-26 23:40 UTC (permalink / raw)
To: xuanqiang.luo
Cc: Eric Dumazet, Neal Cardwell, netdev, David S . Miller,
Jakub Kicinski, Paolo Abeni, Simon Horman, luoxuanqiang
In-Reply-To: <20260626093856.61864-1-xuanqiang.luo@linux.dev>
On Fri, Jun 26, 2026 at 2:40 AM <xuanqiang.luo@linux.dev> wrote:
>
> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>
> IP_LOCAL_PORT_RANGE lets applications override the netns ephemeral port
> range on a per-socket basis. __inet_hash_connect() already treats such a
> range as an explicit application partition and scans it with step 1 [1].
>
> Do the same in inet_csk_find_open_port():
What's the use case of IP_LOCAL_PORT_RANGE + bind(, 0)
without IP_BIND_ADDRESS_NO_PORT ?
> when a socket-local range is set,
> walk the whole selected range instead of first splitting it by parity.
> Keep the existing step-2 parity behavior for sockets using the netns range,
> so the default bind/connect separation remains unchanged.
>
> [1] https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.com
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
> ---
> net/ipv4/inet_connection_sock.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 56902bba54838..ad8af70c92ca3 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -323,13 +323,16 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
> struct inet_bind2_bucket *tb2;
> struct inet_bind_bucket *tb;
> u32 remaining, offset;
> + bool local_ports;
> bool relax = false;
> + int step;
>
> l3mdev = inet_sk_bound_l3mdev(sk);
> ports_exhausted:
> attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
> other_half_scan:
> - inet_sk_get_local_port_range(sk, &low, &high);
> + local_ports = inet_sk_get_local_port_range(sk, &low, &high);
> + step = local_ports ? 1 : 2;
> high++; /* [32768, 60999] -> [32768, 61000[ */
> if (high - low < 4)
> attempt_half = 0;
> @@ -342,18 +345,19 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
> low = half;
> }
> remaining = high - low;
> - if (likely(remaining > 1))
> + if (!local_ports && remaining > 1)
> remaining &= ~1U;
>
> offset = get_random_u32_below(remaining);
> /* __inet_hash_connect() favors ports having @low parity
> * We do the opposite to not pollute connect() users.
> */
> - offset |= 1U;
> + if (!local_ports)
> + offset |= 1U;
>
> other_parity_scan:
> port = low + offset;
> - for (i = 0; i < remaining; i += 2, port += 2) {
> + for (i = 0; i < remaining; i += step, port += step) {
> if (unlikely(port >= high))
> port -= remaining;
> if (inet_is_local_reserved_port(net, port))
> @@ -384,9 +388,11 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
> cond_resched();
> }
>
> - offset--;
> - if (!(offset & 1))
> - goto other_parity_scan;
> + if (!local_ports) {
> + offset--;
> + if (!(offset & 1))
> + goto other_parity_scan;
> + }
>
> if (attempt_half == 1) {
> /* OK we now try the upper half of the range */
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH bpf-next v2 00/15] bpf: A common way to attach struct_ops to a cgroup
From: Roman Gushchin @ 2026-06-26 23:59 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, eddyz87, memxor,
martin.lau, shakeel.butt, kuniyu, kerneljasonxing, kernel-team
In-Reply-To: <20260623175006.3136053-1-ameryhung@gmail.com>
Amery Hung <ameryhung@gmail.com> writes:
> Hi,
>
> I am continuing Martin's work to support attaching struct_ops to
> cgroup.
Awesome, thank you for working on this!
I'm going to rebase bpf oom work on top of this patchset and will give
it some additional testing.
Thanks!
^ permalink raw reply
* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Jakub Kicinski @ 2026-06-27 0:33 UTC (permalink / raw)
To: Andrew Lunn
Cc: Maxime Chevallier, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <5b7dbdbc-93fd-4664-abad-0f47855fab55@lunn.ch>
On Fri, 26 Jun 2026 14:39:57 +0200 Andrew Lunn wrote:
> On Fri, Jun 26, 2026 at 10:33:50AM +0200, Maxime Chevallier wrote:
> >
> > > Sphinx follows pythons object orientate structure. So you could have a
> > > class test_ethtool_pause_advertising, with class documentation. And
> > > then methods within the class which are individual tests. The
> > > commented out section would then be method documentation.
> >
> > Good point, so maybe something along these lines :
> >
> > - A class for the test
> > - methods for indivitual tests
> > - For readability, I've written what the internal test helper would look
> > like (_adv_test), and how a test would look like without the helper in
> > adv_rx_on_tx_on().
> >
> > I'm already diving into coding, but it helps me a bit in the definition of the
> > "description" format :)
> >
> > this is what the class would look like :
>
> I like this :-)
This is very far from what existing python tests do in netdev.
I would prefer to stick to the "bash on steroids" use of Python.
Are you both familiar with the existing tests?
^ permalink raw reply
* Re: [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
From: Jason Xing @ 2026-06-27 0:36 UTC (permalink / raw)
To: Fijalkowski, Maciej, Roman Gushchin
Cc: Zaremba, Larysa, netdev@vger.kernel.org, bpf@vger.kernel.org,
Karlsson, Magnus, stfomichev@gmail.com, kuba@kernel.org,
pabeni@redhat.com, horms@kernel.org, bjorn@kernel.org, Jason Xing
In-Reply-To: <IA1PR11MB60978AF3799FD895C53A860882EB2@IA1PR11MB6097.namprd11.prod.outlook.com>
On Fri, Jun 26, 2026 at 9:43 PM Fijalkowski, Maciej
<maciej.fijalkowski@intel.com> wrote:
>
> >
> > On Fri, Jun 26, 2026 at 7:12 PM Larysa Zaremba <larysa.zaremba@intel.com>
> > wrote:
> > >
> > > On Tue, Jun 23, 2026 at 03:32:34PM +0200, Maciej Fijalkowski wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch is inspired by the check[1] from sashiko. It says when
> > > > overflow happens, the address of cq to be published is invalid.
> > > > Actually the severer thing is the whole process of publishing the
> > > > address of cq in this particular case is not right: it should truely
> > > > publish the address and advance the cached_prod in cq as long as it
> > > > reads descriptors from txq.
> > > >
> > > > The following is the full analysis.
> > > > xsk_drop_skb() is called in three places, which all discard a partially
> > > > built multi-buffer skb:
> > > > 1) xsk_build_skb() -EOVERFLOW error path: packet exceeds
> > MAX_SKB_FRAGS
> > > > 2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
> > > > the TX ring prevents the partial packet from completing
> > > > 3) xsk_release(): socket close while xs->skb holds an incomplete packet
> > > >
> > > > In all three cases, the TX descriptors for the already-processed frags
> > > > have been consumed from the TX ring (xskq_cons_release), and CQ slots
> > > > have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
> > > > which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
> > > > the buffer addresses never appear in the completion queue, userspace
> > > > permanently loses track of these buffers.
> > > >
> > > > Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
> > > > destructor, which already submits buffer addresses to the CQ via
> > > > xsk_cq_submit_addr_locked().
> > > >
> > > > Note that cancelling the descriptors back to the TX ring (via
> > > > xskq_cons_cancel_n) is not a appropriate option because an oversized
> > > > packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
> > > > which is an obviously deadlock bug in the TX path.
> > > >
> > > > Also move the desc->addr assignment in xsk_build_skb() above the
> > > > overflow check so that the current descriptor's address is recorded
> > > > before a potential -EOVERFLOW jump to free_err, consistent with the
> > > > zerocopy path in xsk_build_skb_zerocopy().
> > > >
> > > > [1]:
> > https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > >
> > > This change looks good, but overflow case with only 1 descriptor worries
> > me.
> >
> > I presume you referred to xsk_build_skb_zerocopy()?
> >
> > > In such cases, once we get to following code, kfree_skb() has already
> > happened:
> > >
> > > if (err == -EOVERFLOW) {
> > > if (xs->skb) {
> > > /* Drop the packet */
> > > xsk_inc_num_desc(xs->skb);
> > > xsk_drop_skb(xs->skb);
> > > } else {
> > > xsk_cq_cancel_locked(xs->pool, 1);
> > > xs->tx->invalid_descs++;
> > > }
> > > xskq_cons_release(xs->tx);
> > > }
> > >
> > > kfree_skb() should have resulted in submission of the single fat descriptor to
> > > xsk_cq_submit_addr_locked() via xsk_destruct_skb(), so far consistent with
> > the
> >
> > At least, in the NO_LINEAR case, xsk_skb_init_misc() is not called
> > since the OVERFLOW skips this function, which means kfree_skb()
> > doesn't invoke xsk_destruct_skb() to publish it in the CQ. So it's
> > safe to cancel the cq reservation (in xsk_cq_cancel_locked(xs->pool,
> > 1)).
>
> (responding from outlook so apologies for any broken formatting)
>
> Yes, I have the same understanding here. However, how technically
> possible would it be to produce > MAX_SKB_FRAGS from a single
> AF_XDP descriptor?
Very unlikely. But my viewpoint might change after a wide deployment
internally in the second half of the year.
>
> I know Sashiko has pointed this out and you came up with previous
> fix, but for valid descriptor it is simply not possible. And invalid
> descs wouldn't reach this function.
Yep.
>
> I wouldn't like to stir up the pot too much so let us keep this
> code, but is there any way to give Sashiko additional context?
> I mean, for case where we would say *this can't happen*, will
> It be able to carry this information onwards?
I don't know about how sashiko works, sorry. Maybe @Roman Gushchin has
unique insights on this?
Thanks,
Jason
>
> >
> > Thanks,
> > Jason
> >
> > > multi-descriptor bevaior you are proposing here.
> > >
> > > But what happens when we cancel a submitted CQ slot via
> > > xsk_cq_cancel_locked(xs->pool, 1) in the above code?
> > >
> > > >
> > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx
> > path")
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > net/xdp/xsk.c | 13 ++++++++-----
> > > > 1 file changed, 8 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index b970f30ea9b9..a7a83dc4546a 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff
> > *skb)
> > > >
> > > > static void xsk_drop_skb(struct sk_buff *skb)
> > > > {
> > > > - xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
> > > > - xsk_consume_skb(skb);
> > > > + struct xdp_sock *xs = xdp_sk(skb->sk);
> > > > +
> > > > + xs->tx->invalid_descs += xsk_get_num_desc(skb);
> > > > + consume_skb(skb);
> > > > + xs->skb = NULL;
> > > > }
> > > >
> > > > static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> > > > @@ -877,7 +880,7 @@ static struct sk_buff
> > *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > return ERR_PTR(-ENOMEM);
> > > >
> > > > /* in case of -EOVERFLOW that could happen below,
> > > > - * xsk_consume_skb() will release this node as whole skb
> > > > + * xsk_drop_skb() will release this node as whole skb
> > > > * would be dropped, which implies freeing all list elements
> > > > */
> > > > xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > @@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct
> > xdp_sock *xs,
> > > > goto free_err;
> > > > }
> > > >
> > > > + xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > +
> > > > if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) &&
> > xp_mb_desc(desc))) {
> > > > err = -EOVERFLOW;
> > > > goto free_err;
> > > > @@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct
> > xdp_sock *xs,
> > > >
> > > > skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
> > > > refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
> > > > -
> > > > - xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > }
> > > > }
> > > >
> > > > --
> > > > 2.43.0
> > > >
> > > >
^ permalink raw reply
* Re: [PATCH v4] virtio_net: disable cb when NAPI is busy-polled
From: Jakub Kicinski @ 2026-06-27 0:44 UTC (permalink / raw)
To: Simon Horman, lange_tang
Cc: mst, xuanzhuo, jasowang, edumazet, virtualization, netdev,
tanglongjun
In-Reply-To: <20260626151508.1319440-1-horms@kernel.org>
On Fri, 26 Jun 2026 16:15:08 +0100 Simon Horman wrote:
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 26afa6341d161..c1e252400c0fc 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3011,6 +3011,8 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
> > unsigned int xdp_xmit = 0;
> > bool napi_complete;
> >
> > + virtqueue_disable_cb(rq->vq);
> > +
>
> [Severity: High]
> Can this unconditionally disable the RX callback and cause a permanent network
> stall when polled by netpoll?
Good catch, Longjun just add if (budget)
^ permalink raw reply
* Re: [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe
From: Jakub Kicinski @ 2026-06-27 0:46 UTC (permalink / raw)
To: Carlo Szelinsky
Cc: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
Russell King, David S . Miller, Eric Dumazet, Paolo Abeni,
Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel
In-Reply-To: <20260626165929.2908782-1-github@szelinsky.de>
On Fri, 26 Jun 2026 18:59:26 +0200 Carlo Szelinsky wrote:
> Subject: [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe
## Form letter - net-next-closed
We have already submitted our pull request with net-next material for v7.2,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.
Please repost when net-next reopens after June 29th.
RFC patches sent for review only are obviously welcome at any time.
See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
--
pw-bot: defer
pv-bot: closed
^ permalink raw reply
* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Jason Xing @ 2026-06-27 0:47 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Maciej Fijalkowski, netdev, bpf, magnus.karlsson, stfomichev,
kuba, pabeni, horms, bjorn
In-Reply-To: <aj6mr-cIcF2Tg73r@devvm7509.cco0.facebook.com>
On Sat, Jun 27, 2026 at 12:30 AM Stanislav Fomichev
<sdf.kernel@gmail.com> wrote:
>
> On 06/26, Jason Xing wrote:
> > On Fri, Jun 26, 2026 at 12:05 AM Stanislav Fomichev
> > <sdf.kernel@gmail.com> wrote:
> > >
> > > On 06/25, Jason Xing wrote:
> > > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > > >
> > > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > > make userspace lose track of those buffers.
> > > > > > >
> > > > > > > The generic path fixes cover three related cases:
> > > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > > > continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > > > reports overflow;
> > > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > > > packet, including the offending invalid descriptor itself.
> > > > > > >
> > > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > > address area and published in completion order, after any earlier
> > > > > > > driver-visible Tx descriptors.
> > > > > > >
> > > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > > state local to the singular batched path, the series prevents a second
> > > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > > >
> > > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > > still not expecting those invalid packets on the Rx side.
> > > > > > >
> > > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > > against ice driver.
> > > > > >
> > > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > > question from me is: is it time to scrap our current approach where
> > > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > > half-consumed descriptor queues)
> > > > > >
> > > > > > Should we:
> > > > > >
> > > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > > and do a full stop here)
> > > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > > the cq space (if not -> EAGAIN)
> > > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > > locally, don't ever create semi-initialized skb)
> > > > > > 4. construct the skb
> > > > > > 5. xmit
> > > > >
> > > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > > previously being addressed by Jason.
> > > > >
> > > > > I believe planned logistics were to get these fixes onto net and then
> > > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > > -next and that's where we could address current flow.
> > > >
> > > > Agreed. That's what I'm hoping for. There would be much more
> > > > discussion on how to do batch xmit in an elegant way, I believe.
> > >
> > > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > > this non-zc in net, this is still technically fixes, not feature work..
> > >
> > > There was already a couple of revisions with this drain_cont approach
> > > and every time I look at it feels like the cure is worse than the
> > > decease :-( Obviously not gonna stop you from going with the current approach,
> > > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > > coming and we are piling more complexity).
> >
> > I see your point, but rewriting is something that cannot be easily
> > applied to the stable branches? Until now, we fix issues one by one
> > which have an explicit target branch (because of the fixes tag). Cross
> > fingers :(
> >
> > Sashiko has the magic to find out the hidden bugs more than ever and
> > AF_XDP is not the only place where a pile of reports are coming in.
>
> net vs net-next is fixes vs feature work. If we can't fix the current
> code, I think we can justify a rewrite using a better approach and
> route it via net. This series is 7 patches anyway, it's not like
> it is a quick short fix :-) But I'm ok with pushing it as it, I'm just
> trying to see if someone on your side is fed up with that part as well
> and wants to fix it "properly" :-p
>
> > My take is that batch xmit has been appending too long and at least so
> > far less and less bugs are found by sashiko. I believe if the mode is
> > changed to batch xmit, there are likely to be new and challenging
> > problems to discuss. I prefer to solve questions of the batch xmit
> > series.
>
> We can redo this part separately, without batching. Move from "read
> one chunk at a time" to "pre-read all chunks". Batching vs current issue
> are separate.
If the implementation of 'pre-read' is clear and simple, yes, it's a
better way. (I really want to engage myself in this right now, but
sorry, I can't since I'm writing many slides for Netdev.)
Probably Maciej will give it one last try; we'll see then.
>
> > BTW, would you both come to Netdev 0x1a next month? I believe we could
> > sit around the table and discuss some future plans there (in xdp
> > workshop?).
> > https://netdevconf.info/0x1A/sessions/workshop/xdp-workshop.html
>
> Yes, I plan to be there in person.
Great.
Thanks,
Jason
^ permalink raw reply
* Re: [PATCH net v2] octeontx2-pf: check DMAC extraction support before filtering
From: Harshitha Ramamurthy @ 2026-06-27 0:50 UTC (permalink / raw)
To: nshettyj
Cc: netdev, linux-kernel, sgoutham, gakula, sbhatta, hkelam,
bbhushan2, andrew+netdev, davem, edumazet, kuba, pabeni, naveenm,
tduszynski, sumang
In-Reply-To: <20260626062329.871990-1-nshettyj@marvell.com>
On Thu, Jun 25, 2026 at 11:24 PM <nshettyj@marvell.com> wrote:
>
> From: Suman Ghosh <sumang@marvell.com>
>
> Currently, configuring a VF MAC address via the PF (e.g., 'ip link
> set <pf> vf 0 mac <mac>') blindly attempts to install a DMAC-based
> hardware filter. However, the hardware parser profile might not
> support DMAC extraction.
>
> Check if the hardware parsing profile supports DMAC extraction
> before adding the filter. Additionally, emit a warning message
> to inform the operator if the MAC filter installation fails due
> to missing DMAC extraction support.
>
> Fixes: f0c2982aaf98 ("octeontx2-pf: Add support for SR-IOV management functions")
> Signed-off-by: Suman Ghosh <sumang@marvell.com>
> Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
>
> ---
> v2:
> - Move the DMAC extraction check from otx2_set_vf_mac() into
> otx2_do_set_vf_mac() which already holds pf->mbox.lock, so all
> mbox operations are under a single lock/unlock pair. All error
> paths now use the existing goto-out pattern, eliminating the
> scattered mutex_unlock() + return calls from v1.
> - Return -EOPNOTSUPP instead of 0 when DMAC extraction is not
> supported, so the caller gets an explicit error rather than a
> silent success.
Please ensure a minimum of 24 hr gap before posting a new revision and
also don't post patches in reply to a previous posting as documented
in:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html
> ---
> .../ethernet/marvell/octeontx2/nic/otx2_pf.c | 33 +++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> index b63df5737ff2..dc7e4a225dd0 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> @@ -2517,10 +2517,43 @@ EXPORT_SYMBOL(otx2_config_hwtstamp_set);
>
> static int otx2_do_set_vf_mac(struct otx2_nic *pf, int vf, const u8 *mac)
> {
> + struct npc_get_field_status_req *freq;
> + struct npc_get_field_status_rsp *frsp;
> struct npc_install_flow_req *req;
> int err;
>
> mutex_lock(&pf->mbox.lock);
> +
> + /* Skip installing the DMAC filter if the hardware parser profile
> + * does not support DMAC extraction.
> + */
> + freq = otx2_mbox_alloc_msg_npc_get_field_status(&pf->mbox);
> + if (!freq) {
> + err = -ENOMEM;
> + goto out;
> + }
I noticed that otx2_set_vf_mac() copies the MAC address into the vf
config structure before the programming is successful. Is that
intended?
> +
> + freq->field = NPC_DMAC;
> + if (otx2_sync_mbox_msg(&pf->mbox)) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + frsp = (struct npc_get_field_status_rsp *)otx2_mbox_get_rsp
> + (&pf->mbox.mbox, 0, &freq->hdr);
> + if (IS_ERR(frsp)) {
> + err = PTR_ERR(frsp);
> + goto out;
> + }
> +
> + if (!frsp->enable) {
> + netdev_warn(pf->netdev,
> + "VF %d MAC filter not installed: DMAC extraction not supported by parser profile\n",
> + vf);
Would a netdev_warn_ratelimited() be better here to avoid spamming the log?
> + err = -EOPNOTSUPP;
> + goto out;
> + }
> +
> req = otx2_mbox_alloc_msg_npc_install_flow(&pf->mbox);
> if (!req) {
> err = -ENOMEM;
> --
> 2.48.1
>
>
^ permalink raw reply
* Re: [PATCH] MAINTAINERS: Update Jason Wang's email address
From: Jakub Kicinski @ 2026-06-27 1:04 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, virtualization, netdev, eperezma, kvm, linux-kernel
In-Reply-To: <20260626022039.96139-1-jasowang@redhat.com>
On Fri, 26 Jun 2026 10:20:38 +0800 Jason Wang wrote:
> I will use jasowangio@gmail.com for future review and discussion
Do you want to add a mailmap entry, too?
Otherwise I think you'll get CCed twice (once for MAINTAINERS and once
because you given tags to previous changes)
^ permalink raw reply
* Re: [PATCH net-next] caif: annotate phyinfo lookup under config lock
From: Jakub Kicinski @ 2026-06-27 1:07 UTC (permalink / raw)
To: Runyu Xiao
Cc: davem, edumazet, pabeni, horms, netdev, linux-kernel, jianhao.xu
In-Reply-To: <20260626042440.2013499-1-runyu.xiao@seu.edu.cn>
On Fri, 26 Jun 2026 12:24:40 +0800 Runyu Xiao wrote:
> cfcnfg_get_phyinfo_rcu() is used by both RCU read-side paths and config
> update paths that hold cnfg->lock before adding or deleting entries from
> cnfg->phys. The helper walks the list with list_for_each_entry_rcu(),
> but does not tell lockdep about the config-lock-protected callers.
>
> Pass lockdep_is_held(&cnfg->lock) to the iterator. RCU-reader callers
> remain valid, and CONFIG_PROVE_RCU_LIST can now see the non-RCU
> protection used by the add/delete paths.
>
> This was found by our static analysis tool and then manually reviewed
> against the current tree. The dynamic triage evidence is a
> target-matched CONFIG_PROVE_RCU_LIST warning; the change is limited
> to documenting the existing protection contract.
This code was removed a couple of releases ago.
^ permalink raw reply
* Re: [PATCH] selftests: Open /dev/udmabuf O_RDONLY
From: Jakub Kicinski @ 2026-06-27 1:09 UTC (permalink / raw)
To: T.J. Mercier
Cc: kraxel, vivek.kasireddy, Shuah Khan, Andrew Lunn, David S. Miller,
Eric Dumazet, Paolo Abeni, linux-kselftest, linux-kernel, netdev,
bpf
In-Reply-To: <20260625181557.1086105-1-tjmercier@google.com>
On Thu, 25 Jun 2026 11:15:55 -0700 T.J. Mercier wrote:
> Write permissions on the /dev/udmabuf device file are not required to
> issue ioctls and allocate udmabufs. Applications should be opening this
> file as O_RDONLY. The BPF dmabuf_iter selftest already does this. [1]
>
> Remove the write access mode from the drivers/dma-buf/udmabuf.c and
> drivers/net/hw/ncdevmem.c selftests.
You need to explain "why", too. Why change it if it clearly
worked for everyone running this test until now.
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH v2] netdevsim: fix use-after-free in nsim_create and __nsim_dev_port_del
From: Jakub Kicinski @ 2026-06-27 1:48 UTC (permalink / raw)
To: Hrushiraj Gandhi
Cc: Simon Horman, Andrew Lunn, David S . Miller, Eric Dumazet,
Paolo Abeni, Jiri Pirko, netdev, linux-kernel, bpf,
syzbot+6c25f4750230faf70be9
In-Reply-To: <20260623144447.255326-1-hrushirajg23@gmail.com>
On Tue, 23 Jun 2026 20:14:47 +0530 Hrushiraj Gandhi wrote:
> Fix both paths by calling debugfs_remove_recursive() on the port's
> ddir before every free_netdev() call. The subsequent
> nsim_dev_port_debugfs_exit() calls become harmless no-ops since ddir is
> set to NULL.
Looks like the wrong fix. All features clean up after themselves with
the exception of ethtool. Save the ethtool ddir and remove just that
one. This will align with how the other features behave.
--
pw-bot: cr
^ permalink raw reply
* [PATCH net] net/smc: fix UAF in smc_cdc_rx_handler() by pinning the socket
From: Xiang Mei @ 2026-06-27 1:49 UTC (permalink / raw)
To: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
Mahanta Jambigi, Tony Lu, Wen Gu, netdev
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Hans Wippel, linux-rdma, linux-s390, Weiming Shi,
Xiang Mei
smc_cdc_rx_handler() looks up the connection by token under the link
group's conns_lock, drops the lock, and then dereferences conn and the
smc_sock derived from it, ending in sock_hold(&smc->sk) inside
smc_cdc_msg_recv(). No reference is held across the lock release.
The only reference pinning the socket while the connection is
discoverable in the link group is taken in smc_lgr_register_conn()
(sock_hold) and dropped in __smc_lgr_unregister_conn() (sock_put), both
under conns_lock. Once the handler drops conns_lock, a concurrent
close() -> smc_release() -> smc_conn_free() -> smc_lgr_unregister_conn()
can drop that reference and free the smc_sock, so the handler's later
sock_hold() runs on freed memory:
WARNING: lib/refcount.c:25 at refcount_warn_saturate
Workqueue: rxe_wq do_work
refcount_warn_saturate (lib/refcount.c:25)
smc_cdc_msg_recv (net/smc/smc_cdc.c:430)
smc_cdc_rx_handler (net/smc/smc_cdc.c:502)
smc_wr_rx_tasklet_fn (net/smc/smc_wr.c:445)
tasklet_action_common (kernel/softirq.c:938)
handle_softirqs (kernel/softirq.c:622)
Kernel panic - not syncing: panic_on_warn set
Only SMC-R is affected. The SMC-D receive tasklet is stopped by
tasklet_kill(&conn->rx_tsklet) in smc_conn_free() before the connection
is unregistered, so it cannot run concurrently with the free.
Take the socket reference while still holding conns_lock, so the
registration reference can no longer be the last one, and drop it once
the handler is done.
Fixes: d7b0e37c1ac1 ("net/smc: restructure CDC message reception")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
net/smc/smc_cdc.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 619b3bab3824..b809139d7e87 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -483,21 +483,27 @@ static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf)
lgr = smc_get_lgr(link);
read_lock_bh(&lgr->conns_lock);
conn = smc_lgr_find_conn(ntohl(cdc->token), lgr);
+ if (conn && !conn->out_of_sync)
+ sock_hold(&container_of(conn, struct smc_sock, conn)->sk);
+ else
+ conn = NULL;
read_unlock_bh(&lgr->conns_lock);
- if (!conn || conn->out_of_sync)
+ if (!conn)
return;
smc = container_of(conn, struct smc_sock, conn);
if (cdc->prod_flags.failover_validation) {
smc_cdc_msg_validate(smc, cdc, link);
- return;
+ goto out;
}
if (smc_cdc_before(ntohs(cdc->seqno),
conn->local_rx_ctrl.seqno))
/* received seqno is old */
- return;
+ goto out;
smc_cdc_msg_recv(smc, cdc);
+out:
+ sock_put(&smc->sk);
}
static struct smc_wr_rx_handler smc_cdc_rx_handlers[] = {
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net v2] net: ipa: fix SMEM state handle leaks in SMP2P init
From: patchwork-bot+netdevbpf @ 2026-06-27 1:50 UTC (permalink / raw)
To: haoxiang_li2024
Cc: elder, andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
linux-kernel, stable
In-Reply-To: <20260624065955.2822765-1-haoxiang_li2024@163.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 24 Jun 2026 14:59:55 +0800 you wrote:
> ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
> qcom_smem_state_get(). However, neither the init error paths
> nor ipa_smp2p_exit() release them.
>
> Release both handles with qcom_smem_state_put() in the init
> error paths and in ipa_smp2p_exit().
>
> [...]
Here is the summary with links:
- [net,v2] net: ipa: fix SMEM state handle leaks in SMP2P init
https://git.kernel.org/netdev/net/c/96ca1e658ae4
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] net: liquidio: fix BAR resource leak on PF number failure
From: patchwork-bot+netdevbpf @ 2026-06-27 1:50 UTC (permalink / raw)
To: haoxiang_li2024
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, ricardo.farrington,
felix.manlunas, horms, netdev, linux-kernel, stable
In-Reply-To: <20260624064013.2809570-1-haoxiang_li2024@163.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 24 Jun 2026 14:40:13 +0800 you wrote:
> If cn23xx_get_pf_num() fails, the function returns without
> unmapping either BAR. Unmap both BARs before returning from
> the error path.
>
> Found by manual code review.
>
> Fixes: 0c45d7fe12c7 ("liquidio: fix use of pf in pass-through mode in a virtual machine")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
>
> [...]
Here is the summary with links:
- [net,v2] net: liquidio: fix BAR resource leak on PF number failure
https://git.kernel.org/netdev/net/c/c63ee62a3c4a
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] net: pse-pd: scope pse_control regulator handle to kref lifetime
From: patchwork-bot+netdevbpf @ 2026-06-27 1:50 UTC (permalink / raw)
To: Carlo Szelinsky
Cc: o.rempel, kory.maincent, andrew+netdev, davem, edumazet, kuba,
pabeni, horms, corey, hkallweit1, linux, netdev, linux-kernel
In-Reply-To: <20260624204017.2752934-1-github@szelinsky.de>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 24 Jun 2026 22:40:16 +0200 you wrote:
> From: Corey Leavitt <corey@leavitt.info>
>
> __pse_control_release() drops psec->ps via devm_regulator_put(), which
> only succeeds if the devres entry added by the matching
> devm_regulator_get_exclusive() is still present on pcdev->dev at the
> time the pse_control's kref hits zero.
>
> [...]
Here is the summary with links:
- [net,v2] net: pse-pd: scope pse_control regulator handle to kref lifetime
https://git.kernel.org/netdev/net/c/16759757c4d2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v1] tcp/dccp: avoid parity split for socket-local bind range
From: luoxuanqiang @ 2026-06-27 1:59 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Eric Dumazet, Neal Cardwell, netdev, David S . Miller,
Jakub Kicinski, Paolo Abeni, Simon Horman, luoxuanqiang
In-Reply-To: <CAAVpQUCayy3o59i2vh9hHRPi-3pw1BJgEYMwZYRpZnYEUoqsGw@mail.gmail.com>
> 2026年6月27日 07:40,Kuniyuki Iwashima <kuniyu@google.com> 写道:
>
> On Fri, Jun 26, 2026 at 2:40 AM <xuanqiang.luo@linux.dev> wrote:
>>
>> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>>
>> IP_LOCAL_PORT_RANGE lets applications override the netns ephemeral port
>> range on a per-socket basis. __inet_hash_connect() already treats such a
>> range as an explicit application partition and scans it with step 1 [1].
>>
>> Do the same in inet_csk_find_open_port():
>
> What's the use case of IP_LOCAL_PORT_RANGE + bind(, 0)
> without IP_BIND_ADDRESS_NO_PORT ?
Hi Kuniyuki,
Thanks for the question!
The use case is when an application wants to restrict ephemeral port
allocation to a socket-local IP_LOCAL_PORT_RANGE, but still needs
bind(..., 0) to allocate and reserve a local port immediately.
IP_BIND_ADDRESS_NO_PORT is useful when the application can defer port
allocation until connect(), but it changes this behavior: bind(..., 0)
does not reserve a port in that case. So it is not a replacement for
applications that need the local port before connect(), for example to
publish it to another component or set up local policy.
This patch is also intended to keep the bind(..., 0) path consistent with
Eric's earlier change in __inet_hash_connect().
Thanks,
Xuanqiang
^ permalink raw reply
* Re: [PATCH] qede: fix out-of-bounds check for cqe->len_list[]
From: patchwork-bot+netdevbpf @ 2026-06-27 2:00 UTC (permalink / raw)
To: Matvey Kovalev
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Pavel.Zhigulin,
netdev, linux-kernel, lvc-project
In-Reply-To: <20260623144602.3521-1-matvey.kovalev@ispras.ru>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 23 Jun 2026 17:45:54 +0300 you wrote:
> Move index check before element access.
>
> Fixes: 896f1a2493b5 ("net: qlogic/qede: fix potential out-of-bounds read in qede_tpa_cont() and qede_tpa_end()")
> Found by Linux Verification Center (linuxtesting.org) with SVACE.
>
> Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru>
>
> [...]
Here is the summary with links:
- qede: fix out-of-bounds check for cqe->len_list[]
https://git.kernel.org/netdev/net/c/f9ba47fce593
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH 0/2] net/sched: finish the qdisc_dequeue_peeked conversion (taprio, multiq)
From: patchwork-bot+netdevbpf @ 2026-06-27 2:00 UTC (permalink / raw)
To: Bryam Vargas
Cc: vinicius.gomes, pabeni, jhs, jiri, kuba, davem, edumazet, horms,
netdev, jarkao2, vladimir.oltean, linux-kernel
In-Reply-To: <20260625-b4-disp-31bcb279-v1-0-85c40b83c529@proton.me>
Hello:
This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 25 Jun 2026 04:51:18 -0500 you wrote:
> Commit 77be155cba4e added peek emulation: a non-work-conserving qdisc's
> ->peek dequeues one skb and stashes it in the child's gso_skb. A parent
> that peeks such a child must then take the packet with
> qdisc_dequeue_peeked(), not a direct ->dequeue(), or the stashed skb is
> bypassed and the child's qlen/backlog desync. sch_red and sch_sfb were
> just fixed for this; taprio and multiq still take the direct path.
>
> [...]
Here is the summary with links:
- [1/2] net/sched: sch_taprio: Replace direct dequeue call with peek and qdisc_dequeue_peeked
https://git.kernel.org/netdev/net/c/e056e1dfcddc
- [2/2] net/sched: sch_multiq: Replace direct dequeue call with peek and qdisc_dequeue_peeked
https://git.kernel.org/netdev/net/c/54f6b0c843e2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] seg6: validate SRH length before reading fixed fields
From: patchwork-bot+netdevbpf @ 2026-06-27 2:00 UTC (permalink / raw)
To: Nuoqi Gui
Cc: davem, edumazet, kuba, pabeni, horms, andrea.mayer, netdev, bpf,
linux-kernel, m.xhonneux, daniel, dlebrun
In-Reply-To: <20260623-f01-17-seg6-srh-len-v2-1-2edc40e9e3e1@mails.tsinghua.edu.cn>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 23 Jun 2026 18:32:31 +0800 you wrote:
> seg6_validate_srh() reads fixed SRH fields such as srh->type and
> srh->hdrlen before checking that the supplied length covers the fixed
> struct ipv6_sr_hdr fields.
>
> The BPF SEG6 encap path reaches this with a BPF program-supplied pointer
> and length: bpf_lwt_push_encap() and the SEG6 local BPF END_B6 and
> END_B6_ENCAP actions call bpf_push_seg6_encap(), which forwards the
> length to seg6_validate_srh() with no minimum-size guard. A 2-byte SEG6
> encap header can therefore make the validator read srh->type at offset 2
> beyond the caller-supplied buffer.
>
> [...]
Here is the summary with links:
- [net,v2] seg6: validate SRH length before reading fixed fields
https://git.kernel.org/netdev/net/c/a75d99f46bf2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH ipsec] xfrm: fix sk_dst_cache double-free in xfrm_user_policy()
From: Xiang Mei (Microsoft) @ 2026-06-27 2:40 UTC (permalink / raw)
To: steffen.klassert, herbert, davem, netdev
Cc: edumazet, kuba, pabeni, linux-kernel, AutonomousCodeSecurity,
tgopinath, kys, Xiang Mei (Microsoft)
From: "Xiang Mei (Microsoft)" <xmei5@asu.edu>
xfrm_user_policy() clears the socket dst cache with __sk_dst_reset(),
i.e. the non-atomic __sk_dst_set(sk, NULL): it reads sk_dst_cache with
rcu_dereference_protected(), stores NULL and dst_release()s the old dst.
That is only safe if no other thread modifies sk_dst_cache concurrently.
For a connected UDP socket that does not hold: the transmit fast path
(udp_sendmsg -> sk_dst_check -> sk_dst_reset) resets the cache locklessly
with an atomic xchg(). A per-socket policy change racing a send can make
both sides observe the same old dst and each dst_release() it, dropping
the socket's single reference twice and freeing the xfrm_dst bundle while
it is still referenced:
BUG: KASAN: slab-use-after-free in dst_release
Write of size 4 at addr ffff88801897b6c0 by task exploit/155
Call Trace:
...
dst_release (... ./include/linux/rcuref.h:109)
xfrm_user_policy (./include/net/sock.h:2239 ./include/net/sock.h:2256 net/xfrm/xfrm_state.c:3053)
do_ip_setsockopt (net/ipv4/ip_sockglue.c:1347)
ip_setsockopt (net/ipv4/ip_sockglue.c:1417)
do_sock_setsockopt (net/socket.c:2368)
__sys_setsockopt (net/socket.c:2393)
__x64_sys_setsockopt (net/socket.c:2396)
do_syscall_64 (arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
Reachable by an unprivileged user via a user+network namespace.
Use the atomic sk_dst_reset() so the cache is cleared and released with a
single xchg(): whichever side wins releases the dst once, the other sees
NULL and does nothing. Behaviour is otherwise unchanged.
Fixes: 2b06cdf3e688 ("xfrm: Clear sk_dst_cache when applying per-socket policy.")
Fixes: be8f8284cd89 ("net: xfrm: allow clearing socket xfrm policies.")
Reported-by: AutonomousCodeSecurity@microsoft.com
Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu>
---
net/xfrm/xfrm_state.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index c58cd024e3c6..08ba6805ddb3 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -3010,7 +3010,7 @@ int xfrm_user_policy(struct sock *sk, int optname, sockptr_t optval, int optlen)
if (sockptr_is_null(optval) && !optlen) {
xfrm_sk_policy_insert(sk, XFRM_POLICY_IN, NULL);
xfrm_sk_policy_insert(sk, XFRM_POLICY_OUT, NULL);
- __sk_dst_reset(sk);
+ sk_dst_reset(sk);
return 0;
}
@@ -3050,7 +3050,7 @@ int xfrm_user_policy(struct sock *sk, int optname, sockptr_t optval, int optlen)
if (err >= 0) {
xfrm_sk_policy_insert(sk, err, pol);
xfrm_pol_put(pol);
- __sk_dst_reset(sk);
+ sk_dst_reset(sk);
err = 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH] xfrm: cache the offload ifindex for netlink dumps
From: Cen Zhang @ 2026-06-27 3:00 UTC (permalink / raw)
To: Steffen Klassert, Herbert Xu, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netdev, linux-kernel, baijiaju1990, zzzccc427
Validation reproduced this kernel report:
Oops: general protection fault
Call Trace:
<TASK>
copy_to_user_state_extra+0xb8d/0x1370 [xfrm_user]
? __pfx_copy_to_user_state_extra+0x10/0x10 [xfrm_user]
? __asan_memset+0x23/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? __alloc_skb+0x342/0x960
? srso_alias_return_thunk+0x5/0xfbef5
? __asan_memset+0x23/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? __nlmsg_put+0x147/0x1b0
dump_one_state+0x1c7/0x3e0 [xfrm_user]
xfrm_state_netlink+0xcb/0x130 [xfrm_user]
? __pfx_xfrm_state_netlink+0x10/0x10 [xfrm_user]
? srso_alias_return_thunk+0x5/0xfbef5
? xfrm_user_state_lookup.constprop.0+0x230/0x310 [xfrm_user]
xfrm_get_sa+0x102/0x250 [xfrm_user]
? __pfx_xfrm_get_sa+0x10/0x10 [xfrm_user]
xfrm_user_rcv_msg+0x504/0xaa0 [xfrm_user]
? __pfx_xfrm_user_rcv_msg+0x10/0x10 [xfrm_user]
? srso_alias_return_thunk+0x5/0xfbef5
? stack_trace_save+0x8e/0xc0
? __pfx_stack_trace_save+0x10/0x10
netlink_rcv_skb+0x11f/0x350
? __pfx_xfrm_user_rcv_msg+0x10/0x10 [xfrm_user]
? __pfx_netlink_rcv_skb+0x10/0x10
? __pfx_mutex_lock+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
xfrm_netlink_rcv+0x65/0x80 [xfrm_user]
netlink_unicast+0x600/0x870
? __pfx_netlink_unicast+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? __pfx_stack_trace_save+0x10/0x10
netlink_sendmsg+0x75d/0xc10
? __pfx_netlink_sendmsg+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
____sys_sendmsg+0x77a/0x900
? srso_alias_return_thunk+0x5/0xfbef5
? __pfx_____sys_sendmsg+0x10/0x10
? __pfx_copy_msghdr_from_user+0x10/0x10
? release_sock+0x1a/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? netlink_insert+0x143/0xec0
___sys_sendmsg+0xff/0x180
? __pfx____sys_sendmsg+0x10/0x10
? _raw_spin_lock_irqsave+0x85/0xe0
? do_getsockname+0xf9/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? fdget+0x53/0x3b0
__sys_sendmsg+0x111/0x1a0
? __pfx___sys_sendmsg+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? __sys_getsockname+0x8c/0x100
do_syscall_64+0x102/0x5a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
copy_to_user_state_extra() only holds a reference to the outer xfrm_state.
That does not pin x->xso.dev. NETDEV_DOWN and NETDEV_UNREGISTER can race
through xfrm_dev_state_flush(), xfrm_state_delete(), and
xfrm_dev_state_free(), which clears xso->dev and drops the netdev
reference before the GETSA dump reaches xso_to_xuo() and reads
xso->dev->ifindex.
The buggy scenario involves two paths, with each column showing the order
within that path:
XFRM_MSG_GETSA dump path: NETDEV teardown path:
1. xfrm_get_sa() gets xfrm_state 1. xfrm_dev_state_flush() finds x
2. copy_to_user_state_extra() sees 2. xfrm_state_delete() removes x
x->xso.dev from the SAD
3. copy_user_offload() calls 3. xfrm_dev_state_free() clears
xso_to_xuo() xso->dev
4. xso->dev->ifindex dereferences 4. netdev_put() drops the device
a detached net_device reference
Avoid following the live net_device from the dump paths. Cache the
attached ifindex in xfrm_dev_offload when state or policy offload is bound
to a device, and serialize that snapshot instead. This preserves the
user-visible XFRMA_OFFLOAD_DEV value without depending on the embedded
net_device lifetime.
Fixes: 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
include/net/xfrm.h | 2 ++
net/xfrm/xfrm_device.c | 1 +
net/xfrm/xfrm_state.c | 1 +
net/xfrm/xfrm_user.c | 38 +++++++++++++++++++++++++++++---------
4 files changed, 33 insertions(+), 9 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 519a0156a05c..a6d69aaa6cd2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -162,6 +162,8 @@ struct xfrm_dev_offload {
*/
struct net_device *real_dev;
unsigned long offload_handle;
+ /* Snapshot the attached device index for dump paths. */
+ int ifindex;
u8 dir : 2;
u8 type : 2;
u8 flags : 2;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 630f3dd31cc5..44bfaa04e621 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -313,6 +313,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
}
xso->dev = dev;
+ xso->ifindex = dev->ifindex;
netdev_tracker_alloc(dev, &xso->dev_tracker, GFP_ATOMIC);
if (xuo->flags & XFRM_OFFLOAD_INBOUND)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index c58cd024e3c6..707e29c82020 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1547,6 +1547,7 @@ xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
xso->type = XFRM_DEV_OFFLOAD_PACKET;
xso->dir = xdo->dir;
xso->dev = dev;
+ xso->ifindex = dev->ifindex;
xso->flags = XFRM_DEV_OFFLOAD_FLAG_ACQ;
netdev_hold(dev, &xso->dev_tracker, GFP_ATOMIC);
error = dev->xfrmdev_ops->xdo_dev_state_add(dev, x,
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 6384795ee6b2..0eb87fc998d1 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1201,17 +1201,26 @@ static int copy_sec_ctx(struct xfrm_sec_ctx *s, struct sk_buff *skb)
return 0;
}
-static void xso_to_xuo(const struct xfrm_dev_offload *xso,
- struct xfrm_user_offload *xuo)
+static void xso_to_xuo_ifindex(const struct xfrm_dev_offload *xso, int ifindex,
+ struct xfrm_user_offload *xuo)
{
- xuo->ifindex = xso->dev->ifindex;
+ xuo->ifindex = ifindex;
if (xso->dir == XFRM_DEV_OFFLOAD_IN)
xuo->flags = XFRM_OFFLOAD_INBOUND;
if (xso->type == XFRM_DEV_OFFLOAD_PACKET)
xuo->flags |= XFRM_OFFLOAD_PACKET;
}
-static int copy_user_offload(struct xfrm_dev_offload *xso, struct sk_buff *skb)
+#ifdef CONFIG_XFRM_MIGRATE
+static void xso_to_xuo(const struct xfrm_dev_offload *xso,
+ struct xfrm_user_offload *xuo)
+{
+ xso_to_xuo_ifindex(xso, xso->dev->ifindex, xuo);
+}
+#endif
+
+static int copy_user_offload_ifindex(const struct xfrm_dev_offload *xso,
+ int ifindex, struct sk_buff *skb)
{
struct xfrm_user_offload *xuo;
struct nlattr *attr;
@@ -1222,11 +1231,22 @@ static int copy_user_offload(struct xfrm_dev_offload *xso, struct sk_buff *skb)
xuo = nla_data(attr);
memset(xuo, 0, sizeof(*xuo));
- xso_to_xuo(xso, xuo);
+ xso_to_xuo_ifindex(xso, ifindex, xuo);
return 0;
}
+static int copy_user_offload(struct xfrm_dev_offload *xso, struct sk_buff *skb)
+{
+ return copy_user_offload_ifindex(xso, xso->dev->ifindex, skb);
+}
+
+static int copy_user_state_offload(const struct xfrm_dev_offload *xso,
+ struct sk_buff *skb)
+{
+ return copy_user_offload_ifindex(xso, READ_ONCE(xso->ifindex), skb);
+}
+
static bool xfrm_redact(void)
{
return IS_ENABLED(CONFIG_SECURITY) &&
@@ -1433,8 +1453,8 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
&x->replay);
if (ret)
goto out;
- if(x->xso.dev)
- ret = copy_user_offload(&x->xso, skb);
+ if (READ_ONCE(x->xso.dev))
+ ret = copy_user_state_offload(&x->xso, skb);
if (ret)
goto out;
if (x->if_id) {
@@ -4046,8 +4066,8 @@ static inline unsigned int xfrm_sa_len(struct xfrm_state *x)
l += nla_total_size(sizeof(*x->coaddr));
if (x->props.extra_flags)
l += nla_total_size(sizeof(x->props.extra_flags));
- if (x->xso.dev)
- l += nla_total_size(sizeof(struct xfrm_user_offload));
+ if (READ_ONCE(x->xso.dev))
+ l += nla_total_size(sizeof(struct xfrm_user_offload));
if (x->props.smark.v | x->props.smark.m) {
l += nla_total_size(sizeof(x->props.smark.v));
l += nla_total_size(sizeof(x->props.smark.m));
--
2.43.0
^ permalink raw reply related
* [PATCH] xfrm: clear mode callbacks after failed mode setup
From: Cen Zhang @ 2026-06-27 3:01 UTC (permalink / raw)
To: Steffen Klassert, Herbert Xu, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Christian Hopps
Cc: netdev, linux-kernel, baijiaju1990, zzzccc427
xfrm_state_gc_task can run long after a failed IPTFS state setup. In the
reproduced case, __xfrm_init_state() cached x->mode_cbs, IPTFS setup
returned -ENOMEM before publishing mode_data, and the temporary module
reference from xfrm_get_mode_cbs() was dropped immediately. The dead state
then kept x->mode_cbs until deferred GC ran after xfrm_iptfs had been
unloaded.
Clear x->mode_cbs when mode init or clone fails before publishing
mode_data. Those states never installed mode-specific state or the
long-term IPTFS module pin, so deferred GC has nothing mode-specific to
destroy and must not retain a callback table pointer past the temporary
lookup reference.
The buggy scenario involves two paths, with each column showing the order
within that path:
failed setup path:
1. cache x->mode_cbs
2. mode setup fails before mode_data
3. drop the temporary module ref
4. dead state keeps x->mode_cbs cached
GC/unload path:
1. xfrm_state_put() queues GC work
2. xfrm_iptfs unloads later
3. xfrm_state_gc_task runs
4. GC dereferences stale x->mode_cbs
This also covers the failed clone path where clone_state() returns before
publishing mode_data.
Validation reproduced this kernel report:
Kernel panic - not syncing: Fatal exception
CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y
failslab_stacktrace_filter matched xfrm_iptfs frames
ack_error=-12
FAULT_INJECTION: forcing a failure
BUG: unable to handle page fault
Workqueue: events xfrm_state_gc_task
RIP: xfrm_state_gc_task+0x142/0x650
Modules linked in: esp4_offload xfrm_user [last unloaded: xfrm_iptfs]
Kernel panic - not syncing: Fatal exception
Fixes: 4b3faf610cc6 ("xfrm: iptfs: add new iptfs xfrm mode impl")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
---
net/xfrm/xfrm_state.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index c58cd024e3c6..4d95b2720894 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -2071,8 +2071,11 @@ static struct xfrm_state *xfrm_state_clone_and_setup(struct xfrm_state *orig,
x->mode_cbs = orig->mode_cbs;
if (x->mode_cbs && x->mode_cbs->clone_state) {
- if (x->mode_cbs->clone_state(x, orig))
+ if (x->mode_cbs->clone_state(x, orig)) {
+ if (!x->mode_data)
+ x->mode_cbs = NULL;
goto error;
+ }
}
x->props.reqid = m->new_reqid;
@@ -3291,6 +3294,8 @@ int __xfrm_init_state(struct xfrm_state *x, struct netlink_ext_ack *extack)
if (x->mode_cbs->init_state)
err = x->mode_cbs->init_state(x);
module_put(x->mode_cbs->owner);
+ if (err && !x->mode_data)
+ x->mode_cbs = NULL;
}
error:
return err;
--
2.43.0
^ permalink raw reply related
* [PATCH] fix: net/batman-adv: batadv_interface_kill_vid: extra batadv_meshif_vlan_put after destroy
From: WenTao Liang @ 2026-06-27 3:46 UTC (permalink / raw)
To: marek.lindner, sw, antonio, sven, davem, edumazet, kuba, pabeni
Cc: horms, b.a.t.m.a.n, netdev, linux-kernel, WenTao Liang, stable
In batadv_interface_kill_vid(), batadv_meshif_vlan_get() acquires a
reference on the vlan object. batadv_meshif_destroy_vlan() internally
calls batadv_meshif_vlan_put() which balances that reference. However, an
additional batadv_meshif_vlan_put(vlan) is called after
batadv_meshif_destroy_vlan(), causing a refcount underflow and potential
use-after-free of the vlan object.
Remove the extra batadv_meshif_vlan_put(vlan) call.
Cc: stable@vger.kernel.org
Fixes: 5d2c05b21337 ("batman-adv: add per VLAN interface attribute framework")
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
---
net/batman-adv/mesh-interface.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/net/batman-adv/mesh-interface.c b/net/batman-adv/mesh-interface.c
index e7aa45bc6b7a..cc974f243200 100644
--- a/net/batman-adv/mesh-interface.c
+++ b/net/batman-adv/mesh-interface.c
@@ -691,9 +691,6 @@ static int batadv_interface_kill_vid(struct net_device *dev, __be16 proto,
batadv_meshif_destroy_vlan(bat_priv, vlan);
- /* finally free the vlan object */
- batadv_meshif_vlan_put(vlan);
-
return 0;
}
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox