Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v3 3/4] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-06-25 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260625155416.480669-1-andrey.drobyshev@virtuozzo.com>

During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
keeps running while the host drops and later re-attaches vhost backends.
If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
while the backend is temporarily NULL (between vhost_vsock_drop_backends()
and the next vhost_vsock_start()), then the kick is delivered to the
vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
kick signal is consumed.  The buffer is then left in the ring.

Then upon device start vhost_vsock_start() only re-kicks the RX send
worker, never the TX VQ, so the buffer is processed only if the guest
happens to kick again.  But if the guest itself is now waiting for data
from the host, it will never kick TX VQ again, and we end up in a
deadlock.

The issue itself is pre-existing, but it only manifests during a brief
pause caused by VHOST_RESET_OWNER.  Namely, the deadlock is reproduced
during active host->guest socat data transfer under multiple consecutive
CPR live-update's.

To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
queue the TX vq poll so any buffers the guest enqueued while we were paused
get scanned.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 drivers/vhost/vsock.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index bec6bcfd885f..81d4f7209719 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -646,6 +646,13 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	 */
 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);

+	/*
+	 * Some packets might've also been queued in TX VQ.  That is the case
+	 * during the brief device pause caused by VHOST_RESET_OWNER.  Re-scan
+	 * the TX VQ here, mirroring the RX send-worker kick above.
+	 */
+	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
+
 	mutex_unlock(&vsock->dev.mutex);
 	return 0;

-- 
2.47.1

^ permalink raw reply related

* [PATCH v3 2/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Andrey Drobyshev @ 2026-06-25 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260625155416.480669-1-andrey.drobyshev@virtuozzo.com>

Earlier commit bb26ed5f3a8b ("vhost/vsock: Refuse the connection
immediately when guest isn't ready") added a fast-fail in
vhost_transport_send_pkt().  It rejects every host send with -EHOSTUNREACH
until the destination calls SET_RUNNING(1).  The fast-fail condition checks
whether device's backends are dropped, and if they're, the guest is
considered to be not ready.

However, there might be other reasons for backends to be nulled.  In
particular, when QEMU is performing CPR (checkpoint-restore) migration,
device ownership is being RESET and SET again, which leads to backends
drop and reattach.  If we end up connecting during this window, an
AF_VSOCK client gets -EHOSTUNREACH, which is wrong.

Add a 'started' flag which is set once in vhost_vsock_start() and is
never cleared.  The behaviour changes to:

  * When device was never started -> flag is unset -> no listener can
    exist yet -> fast-fail;
  * Once the device starts -> flag is set -> we don't fast-fail ->
    we queue and preserve during any later stop / CPR pause.

Important caveat: after the first start, a connect during any stopped
window is queued instead of fast-failed.  That was the behaviour before
the patch bb26ed5f3a8b, and we're restoring it now.  However we still
keep the behaviour originally intended by that commit (i.e. fast-fail if
there's no real listener yet) while fixing the CPR path.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 drivers/vhost/vsock.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index b12221ce6faf..bec6bcfd885f 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -61,6 +61,7 @@ struct vhost_vsock {
 
 	u32 guest_cid;
 	bool seqpacket_allow;
+	bool started;		/* set on first SET_RUNNING(1); never cleared */
 };
 
 static u32 vhost_transport_get_local_cid(void)
@@ -302,17 +303,12 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
 		return -ENODEV;
 	}
 
-	/* Fast-fail if the guest hasn't enabled the RX vq yet. Queuing the packet
-	 * and making the caller wait is pointless: even if the guest manages to init
-	 * within the timeout, it'll immediately reply with RST, because there's no
-	 * listener on the port yet.
-	 *
-	 * vhost_vq_get_backend() without vq->mutex is acceptable here: locking
-	 * the mutex would be too expensive in this hot path, and we already have
-	 * all the outcomes covered: if the backend becomes NULL right after the check,
-	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
+	/* Fast-fail until the guest first enables the device (SET_RUNNING(1)).
+	 * Before that there is no listener, so queuing is pointless. 'started'
+	 * is never cleared, so once we're up we keep queuing across later
+	 * stop / CPR-pause windows.
 	 */
-	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
+	if (unlikely(!READ_ONCE(vsock->started))) {
 		rcu_read_unlock();
 		kfree_skb(skb);
 		return -EHOSTUNREACH;
@@ -640,6 +636,11 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 		mutex_unlock(&vq->mutex);
 	}
 
+	/* Set 'started' flag on the first start; never cleared, so send_pkt
+	 * keeps queuing (instead of fast-failing) on later stop / CPR pauses.
+	 */
+	WRITE_ONCE(vsock->started, true);
+
 	/* Some packets may have been queued before the device was started,
 	 * let's kick the send worker to send them.
 	 */
@@ -728,6 +729,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 
 	vsock->guest_cid = 0; /* no CID assigned yet */
 	vsock->seqpacket_allow = false;
+	vsock->started = false;
 
 	atomic_set(&vsock->queued_replies, 0);
 
-- 
2.47.1


^ permalink raw reply related

* Re: [PATCH iproute2-next] "ip help" wrong output, exit code.
From: Dmitri Seletski @ 2026-06-25 15:54 UTC (permalink / raw)
  Cc: netdev
In-Reply-To: <3d6f256c-3bd0-441a-bece-8692985c5ddc@gmail.com>

I am confused.

Whats the next step here?

Regards

Dmitri

On 6/22/26 18:47, Dmitri Seletski wrote:
> Hello David,
>
>
> Based on change introduced:
>
> Two samples of "ip help" with demonstration of exit code and standard 
> output are below.
>
> This is in line with what expect.
>
>
> dimkosPC~/compiled/iproute2-next #if ./ip/ip help a >>/dev/null  ; 
> then echo help triggered  ; else echo error code triggered  ;fi  #this 
> redirects standard output  to /dev/null, so text missing is not error,
> but standard text
> help triggered
>
> dimkosPC~/compiled/iproute2-next #if ./ip/ip help   ; then echo help 
> triggered  ; else echo error code triggered  ;fi
> Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }
>       ip [ -force ] -batch filename
> where  OBJECT := { address | addrlabel | fou | help | ila | ioam | 
> l2tp | link |
>                   macsec | maddress | monitor | mptcp | mroute | mrule |
>                   neighbor | neighbour | netconf | netns | nexthop | 
> ntable |
>                   ntbl | route | rule | sr | stats | tap | tcpmetrics |
>                   token | tunnel | tuntap | vrf | xfrm }
>       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |
>                    -h[uman-readable] | -iec | -j[son] | -p[retty] |
>                    -f[amily] { inet | inet6 | mpls | bridge | link } |
>                    -4 | -6 | -M | -B | -0 |
>                    -l[oops] { maximum-addr-flush-attempts } | -echo | 
> -br[ief] |
>                    -o[neline] | -t[imestamp] | -ts[hort] | -b[atch] 
> [filename] |
>                    -rc[vbuf] [size] | -n[etns] name | -N[umeric] | 
> -a[ll] |
>                    -c[olor]}
> help triggered
>
> Two samples of command that is broken on purpose.
>
> dimkosPC~/compiled/iproute2-next #if ./ip/ip idontexist   ; then echo 
> help triggered  ; else echo error code triggered  ;fi
> Object "idontexist" is unknown, try "ip help".
> error code triggered
>
> dimkosPC~/compiled/iproute2-next #if ./ip/ip idontexist  >>/dev/null 
>  ; then echo help triggered  ; else echo error code triggered  ;fi 
>  #this redirects standard output  to /dev/null, so text missing is not 
> error, but standard text
> Object "idontexist" is unknown, try "ip help".
> error code triggered
>
> This works as expected as per my understanding.
>
>
> Not everything is fixed, but chunk of things fixed is better than non 
> of it.
>
> for example:
>
> if ip  add help    ; then echo help triggered  ; else echo error code 
> triggered  ;fi  #this redirects standard output  to /dev/null, so text 
> missing is not error, but standard text
> Usage: ip address {add|change|replace} IFADDR dev IFNAME [ LIFETIME ]
>                                                      [ CONFFLAG-LIST ]
>       ip address del IFADDR dev IFNAME [mngtmpaddr]
>       ip address {save|flush} [ dev IFNAME ] [ scope SCOPE-ID ] [ to 
> PREFIX ]
>                            [ FLAG-LIST ] [ label LABEL ] [ { up | down 
> } ]
>       ip address [ show [ dev IFNAME ] [ scope SCOPE-ID ] [ master 
> DEVICE ]
>                         [ nomaster ]
>                         [ type TYPE ] [ to PREFIX ] [ FLAG-LIST ]
>                         [ label LABEL ] [ { up | down } ] [ vrf NAME ]
>                         [ proto ADDRPROTO ] ]
>       ip address {showdump|restore}
> IFADDR := PREFIX | ADDR peer PREFIX
>          [ broadcast ADDR ] [ anycast ADDR ]
>          [ label IFNAME ] [ scope SCOPE-ID ] [ metric METRIC ]
>          [ proto ADDRPROTO ]
> SCOPE-ID := [ host | link | global | NUMBER ]
> FLAG-LIST := [ FLAG-LIST ] FLAG
> FLAG  := [ permanent | dynamic | secondary | primary |
>           [-]tentative | [-]deprecated | [-]dadfailed | temporary |
>           CONFFLAG-LIST ]
> CONFFLAG-LIST := [ CONFFLAG-LIST ] CONFFLAG
> CONFFLAG  := [ home | nodad | mngtmpaddr | noprefixroute | autojoin ]
> LIFETIME := [ valid_lft LFT ] [ preferred_lft LFT ]
> LFT := forever | SECONDS
> ADDRPROTO := [ NAME | NUMBER ]
> TYPE := { amt | bareudp | bond | bond_slave | bridge | bridge_slave |
>          dsa | dummy | erspan | geneve | gre | gretap | gtp | hsr |
>          ifb | ip6erspan | ip6gre | ip6gretap | ip6tnl |
>          ipip | ipoib | ipvlan | ipvtap |
>          macsec | macvlan | macvtap | netdevsim |
>          netkit | nlmon | pfcp | rmnet | sit | team | team_slave |
>          vcan | veth | vlan | vrf | vti | vxcan | vxlan | wwan |
>          xfrm | virt_wifi }
> error code triggered
>
> This is still problematic.
>
>
> But so far code leaves "ip help" command/argument in better shape than 
> it found it in.
>
>
> I may try improve things more, but lets submit what we already have 
> "better", please.
>
> Kind Regards
>
> Dmitri Seletski
>
>
> On 6/22/26 17:44, David Laight wrote:
>> On Mon, 22 Jun 2026 07:57:00 -0700
>> Stephen Hemminger <stephen@networkplumber.org> wrote:
>>
>>> On Sun, 21 Jun 2026 22:48:59 +0100
>>> Dmitri Seletski <drjoms@gmail.com> wrote:
>>>
>>>>  From 0805e07105cd15c5b94271a4706e50e3c65dbde5 Mon Sep 17 00:00:00 
>>>> 2001
>>>> From: Dmitri Seletski <drjoms@gmail.com>
>>>> Date: Sun, 21 Jun 2026 22:12:43 +0100
>>>> Subject: [PATCH iproute2-next]  "ip help" wrong output, exit code.
>>>>
>>>> Changed output of "ip help" from standard error to standard output. 
>>>> And
>>>> Exit is now 0 instead of -1. "ip help|grep bridge" - now gives bridge
>>>> syntax instead of flooding user with everything from "ip help".
>>>> ---
>>>> ip/ip.c | 4 ++--
>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/ip/ip.c b/ip/ip.c
>>>> index e4b71bde..4627b61c 100644
>>>> --- a/ip/ip.c
>>>> +++ b/ip/ip.c
>>>> @@ -56,7 +56,7 @@ static void usage(void) __attribute__((noreturn));
>>>>
>>>> static void usage(void)
>>>> {
>>>> -fprintf(stderr,
>>>> +fprintf(stdout,
>>>> "Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n"
>>>> "       ip [ -force ] -batch filename\n"
>>>> "where  OBJECT := { address | addrlabel | fou | help | ila | ioam | 
>>>> l2tp
>>>> | link |\n"
>>>> @@ -72,7 +72,7 @@ static void usage(void)
>>>> "                    -o[neline] | -t[imestamp] | -ts[hort] | -b[atch]
>>>> [filename] |\n"
>>>> "                    -rc[vbuf] [size] | -n[etns] name | -N[umeric] |
>>>> -a[ll] |\n"
>>>> "                    -c[olor]}\n");
>>>> -exit(-1);
>>>> +exit(0);
>>>> }
>>> Your mailer damages white space.
>>>
>> The output also needs to depend on whether these is a 'usage' error or
>> if 'help' is requested.
>> Code code is correct for the former - except it should do exit(1).
>>
>>     David
>>
>>

^ permalink raw reply

* [PATCH v3 1/4] vhost/vsock: split out vhost_vsock_drop_backends helper
From: Andrey Drobyshev @ 2026-06-25 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260625155416.480669-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

Split the actual backend dropping part from vhost_vsock_stop.  We're
going to need it for the VHOST_RESET_OWNER implementation in the
following patch, when vsock->dev.mutex is already taken and owner is
checked.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 drivers/vhost/vsock.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 9aaab6bb8061..b12221ce6faf 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -664,9 +664,24 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	return ret;
 }
 
-static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
 {
+	struct vhost_virtqueue *vq;
 	size_t i;
+
+	lockdep_assert_held(&vsock->dev.mutex);
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vhost_vq_set_backend(vq, NULL);
+		mutex_unlock(&vq->mutex);
+	}
+}
+
+static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+{
 	int ret = 0;
 
 	mutex_lock(&vsock->dev.mutex);
@@ -677,14 +692,7 @@ static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
 			goto err;
 	}
 
-	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
-		struct vhost_virtqueue *vq = &vsock->vqs[i];
-
-		mutex_lock(&vq->mutex);
-		vhost_vq_set_backend(vq, NULL);
-		mutex_unlock(&vq->mutex);
-	}
-
+	vhost_vsock_drop_backends(vsock);
 err:
 	mutex_unlock(&vsock->dev.mutex);
 	return ret;
-- 
2.47.1


^ permalink raw reply related

* [PATCH v3 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Andrey Drobyshev @ 2026-06-25 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev

v2 -> v3:

  * Patch 4: skip the kick of TX VQ worker once backend is gone - add
    to cancel_pkt() the same guard as in send_pkt().
    (Reported by Sashiko AI)

v2: https://lore.kernel.org/virtualization/20260622175808.508084-1-andrey.drobyshev@virtuozzo.com

Andrey Drobyshev (2):
  vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
  vhost/vsock: re-scan TX virtqueue on device start

Pavel Tikhomirov (2):
  vhost/vsock: split out vhost_vsock_drop_backends helper
  vhost/vsock: add VHOST_RESET_OWNER ioctl

 drivers/vhost/vsock.c | 106 +++++++++++++++++++++++++++++++++---------
 1 file changed, 85 insertions(+), 21 deletions(-)

-- 
2.47.1


^ permalink raw reply

* Re: [Regression] Broken MPLS routes with multiple nexthops
From: Kuniyuki Iwashima @ 2026-06-25 15:51 UTC (permalink / raw)
  To: anthony.doeraene; +Cc: davem, kuniyu, netdev
In-Reply-To: <036a0c95-f5d4-46ab-88e7-1eab567d7a84@uclouvain.be>

From: Anthony Doeraene <anthony.doeraene@uclouvain.be>
Date: Thu, 25 Jun 2026 17:07:41 +0200
> Hello all,
> 
> According to my experiments, it seems that ECMP with MPLS (i.e. an MPLS 
> route with multiple
> nexthops) is broken on the master branch of the kernel.
> 
> Indeed, whenever adding an MPLS route with multiple nexthops, ip route 
> show the route as
> a dead route/link down, even if nexthops are reachable.
> 
> Example to reproduce the error (tested with virtme-ng and the master 
> branch of the kernel):
> ```
> modprobe mpls_iptunnel mpls_router
> sysctl net.mpls.platform_labels=100000
> ip link set dummy0 up
> ip addr add fc00:1::1/112 dev dummy0
> ip addr add fc00:2::1/112 dev dummy0
> ip -M route add 16000 \
>      nexthop via inet6 fc00:1::2 as 16001 \
>      nexthop via inet6 fc00:2::2 as 16002
> 
> # Check the route
> ip -M route
> # Output:
> #     16000 dead linkdown
> #
> # Route is not present, even if accepted
> ```
> 
>  From a git blame, it seems that commit 
> f0914b8436c589b7ab32c614d8d7868eb4ebd5bf
> broke the core logic for building nexthops.

Thanks for the report !

It was to balance refcount with netdev_put() in mpls_rt_alloc().
I'll post the patch below.

(Updating rt->rt_nhn is not strictlly needed for netdev_put()
 because it has NULL check and rt is allocated with kzalloc(),
 but it's a bit error prone, so I'll keep it)

---8<---
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index ca504d9626cf..4a81514e919a 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -922,8 +922,7 @@ static int mpls_nh_build_multi(struct mpls_route_config *cfg,
 	struct nlattr *nla_via, *nla_newdst;
 	int remaining = cfg->rc_mp_len;
 	int err = 0;
-
-	rt->rt_nhn = 0;
+	u8 nhs = 0;
 
 	change_nexthops(rt) {
 		int attrlen;
@@ -959,12 +958,15 @@ static int mpls_nh_build_multi(struct mpls_route_config *cfg,
 			rt->rt_nhn_alive--;
 
 		rtnh = rtnh_next(rtnh, &remaining);
-		rt->rt_nhn++;
+		nhs++;
 	} endfor_nexthops(rt);
 
+	rt->rt_nhn = nhs;
+
 	return 0;
 
 errout:
+	rt->rt_nhn = nhs;
 	return err;
 }
 
---8<---


> 
> This commit modified function `mpls_nx_build_multi` by setting 
> `rt->rt_nhn` to 0 at the
> start of the function. However, the loop `change_nexthops` just below 
> depends on
> `rt->rt_nhn` to know the actual number of nexthops that it should build. 
> As `rt->rt_nhn`
> is set to 0 just before, **no nexthop is ever built**, leading to a dead 
> route. Even if we
> remove this modification, this commit incorrectly increments 
> `rt->rt_nhn` at the end of
> the loop (`rt->rt_nhn++`),  such that the loop always end with an error 
> as it tries to
> constructs more nexthops that actually provided.
> 
> Commenting these two lines fix the issue, and allows to create once 
> again MPLS routes
> with multiple nexthops:
> 
> ```
> modprobe mpls_iptunnel mpls_router
> sysctl net.mpls.platform_labels=100000
> ip link set dummy0 up
> ip addr add fc00:1::1/112 dev dummy0
> ip addr add fc00:2::1/112 dev dummy0
> ip -M route add 16000 \
>      nexthop via inet6 fc00:1::2 as 16001 \
>      nexthop via inet6 fc00:2::2 as 16002
> 
> # Check the route
> ip -M route
> # Output:
> #     16000
> #            nexthop as to 16001 via inet6 fc00:1::2 dev dummy0
> #            nexthop as to 16002 via inet6 fc00:2::2 dev dummy0
> #
> # Route is accepted and present !
> ```
> 
> Overall, I think it would be interesting to discuss what this patch was 
> trying to achieve,
> and how we can conciliate both use-cases.
> 
> Best regards and looking forward to hearing from you,
> Doeraene Anthony

^ permalink raw reply related

* Re: [PATCH] octeontx2-af: Free BPID bitmap on setup failure
From: Jakub Kicinski @ 2026-06-25 15:50 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: sgoutham, lcherian, gakula, hkelam, sbhatta, andrew+netdev, davem,
	edumazet, pabeni, horms, netdev, linux-kernel, nshettyj, rkannoth
In-Reply-To: <20260623114316.2182271-1-haoxiang_li2024@163.com>

On Tue, 23 Jun 2026 19:43:16 +0800 Haoxiang Li wrote:
> nix_setup_bpids() allocates bp->bpids with rvu_alloc_bitmap(), which uses
> a plain kcalloc(). If any of the following devm_kcalloc() allocations for
> the BPID mapping arrays fails, the function returns without freeing the
> bitmap. Free the BPID bitmap before returning from those error paths.

Marvell, you are actively working on this driver but review none of the
patches posted by others. This is your last warning, please, you have to
start reviewing the fixes.

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Andrew Lunn @ 2026-06-25 15:46 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <7a88fee8-bbb3-480f-9c93-677b7270a940@bootlin.com>

On Thu, Jun 25, 2026 at 12:46:44PM +0200, Maxime Chevallier wrote:
> Hi Andrew,
> 
> On 5/29/26 14:59, Andrew Lunn wrote:
> 
> (This discussion was a while ago, but this bit of context should be enough)
> 
> > But we also need to consider that for some APIs, we have decided that
> > a configuration can be set now, which does not actually apply in our
> > current conditions, but it will be stored away for when conditions
> > change and it is applicable. The half duplex case could fit that. When
> > the link is currently half duplex, you can configure pause, but you
> > don't expect it to actually change the current behaviour. It only
> > kicks in when the link renegotiates to full duplex sometime in the
> > future. We have to also consider this the other way around. The link
> > is full duplex and pause is configured by the user. Something happens
> > with the LP and the link renegotiates to half duplex. The local end
> > should not throw away the configuration, it simply cannot apply it
> > given the current situation.
> 
> I'm writing the test description for HD with a better formatting, so the
> HD test wouldn't be about "are we using pause stuff while in HD" as it
> doesn't make sense, but rather "do we correctly store the pause settings
> aside for later".

O.K.

> I'm realising that we don't really have an API to report the *true* in-use pause
> settings. Taking HD as an example :
> 
> # ethtool -s eth2 duplex half
> 
> [588209.379363] mvpp2 f4000000.ethernet eth2: Link is Up - 100Mbps/Half - flow control off
> 
> # ethtool eth2
> 	[...]
> 	Supported pause frame use: Symmetric Receive-only
> 	Advertised pause frame use: Symmetric Receive-only
> 	Link partner advertised pause frame use: Symmetric Receive-only

Does it even make sense to advertise this when in HD? But i don't
think we need to consider this now. I consider HD low priority, i
doubt it is actually used very often. We should concentrate on FD
testing.

> # ethtool -a eth2
> Autonegotiate:	on
> RX:		off
> TX:		off
> RX negotiated: on
> TX negotiated: on
> 
> 
> Sure, pause and HD don't make sense, however what I find confusing to some
> extent is that the only place we have information about the *actual* pause
> settings is the "link is Up" log in dmesg.

Maybe we should extend ksetting get to return the resolved pause
parameters? But i'm not sure how much that actually gives us. Anything
using phylink will just ask phylink to fill in the ksettings
information, and it seems unlikely phylink gets it wrong. What we are
really trying to test is drivers which don't user phylink, those are
the ones which are generally broken, and they are not going to
implement anything new in ksettings. So i think the test has to look
at:

> 	Advertised pause frame use: Symmetric Receive-only
> 	Link partner advertised pause frame use: Symmetric Receive-only

and check these match what we expect.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Jakub Kicinski @ 2026-06-25 15:42 UTC (permalink / raw)
  To: Vimal Agrawal; +Cc: netdev, kuniyu, edumazet, vimal.agrawal
In-Reply-To: <20260625102020.92814-1-vimal.agrawal@sophos.com>

On Thu, 25 Jun 2026 10:20:20 +0000 Vimal Agrawal wrote:
> Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
> on every allocation attempt with no rate limiting. In workloads with mostly
> active/reachable entries, the GC walk traverses a large portion of the
> neighbour table without reclaiming entries, holding tbl->lock for an
> extended period. This causes severe lock contention and allocation
> latencies exceeding 16ms under sustained neighbour creation.
> 
> Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
> performed within the last second, avoiding repeated full table scans and
> lock acquisitions on the hot allocation path.
> 
> Profiling of neigh_create() shows ~3 orders of magnitude latency
> improvement with this change.

I'm not an expert on neigh but 1 second seems a little aggressive.
Can you see if 10msec doesn't give us a similar win?

>  net/core/neighbour.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 1349c0eedb64..078842db3c5f 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
>  	int shrunk = 0;
>  	int loop = 0;
>  
> +	if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
> +		return 0;
> +
>  	NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
>  
>  	spin_lock_bh(&tbl->lock);


^ permalink raw reply

* Re: [PATCH v2 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Pavel Tikhomirov @ 2026-06-25 15:32 UTC (permalink / raw)
  To: Andrey Drobyshev, linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, den
In-Reply-To: <20260622175808.508084-1-andrey.drobyshev@virtuozzo.com>

Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

On 6/22/26 19:58, Andrey Drobyshev wrote:
> v1 -> v2:
> 
>   * Patch 2 (suppress EHOSTUNREACH): replace 'cpr_paused' + backend check
>     with a single 'started' latch;
>   * Patch 3 (re-scan TX virtqueue): reword commit message;
>   * Patch 4 (VHOST_RESET_OWNER):
>       - fix a vhost_worker use-after-free / stuck VHOST_WORK_QUEUED stall
>         against the lockless send path;
>       - drop the no-op vsock_for_each_connected_socket() iteration;
>   * Shuffle the patches, keep RESET_OWNER implementation last to preserve
>     bisectability;
>   * Reword the cover letter.
> 
> v1: https://lore.kernel.org/virtualization/20260612165718.433546-1-andrey.drobyshev@virtuozzo.com
> 
> Host<-->guest connections via AF_VSOCK sockets aren't supposed to
> outlive VM migration, since VM is moving to another host.  However
> there's a special case, which is QEMU live-update, or CPR
> (checkpoint-restore) migration.  In this case, VM remains on the same
> host, and we'd like such connections to persist.
> 
> For this to work, we need to be able to transfer device ownership from
> source QEMU to dest QEMU.  Namely, source needs to reset ownership by
> issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
> calling VHOST_SET_OWNER.
> 
> Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
> such implementation.  Patch 1 is a preliminary helper.  Patches 2 and 3
> fix the pre-existing issues which do manifest during CPR / RESET_OWNER.
> Patch 4 is the ioctl's implementation itself - we keep it last to
> preserve bisectability.
> 
> There's a complementary series for QEMU [0] adding support of vhost-vsock
> devices during CPR migration.
> 
> I've tested this (patched QEMU + patched kernel) approximately as follows:
> 
>   * Run listener in the guest:
>   socat -u VSOCK-LISTEN:9999 - >/tmp/recv.bin
> 
>   * Run data transfer from host to guest:
>   socat -u FILE:/root/bigfile.bin VSOCK-CONNECT:CID:9999
> 
>   * Perform CPR migration during transfer (either cpr-exec or cpr-transfer)
>   * Check that file hash sum matches
> 
> [0] https://lore.kernel.org/qemu-devel/20260619105514.128812-1-andrey.drobyshev@virtuozzo.com
> 
> Andrey Drobyshev (2):
>   vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
>   vhost/vsock: re-scan TX virtqueue on device start
> 
> Pavel Tikhomirov (2):
>   vhost/vsock: split out vhost_vsock_drop_backends helper
>   vhost/vsock: add VHOST_RESET_OWNER ioctl
> 
>  drivers/vhost/vsock.c | 96 ++++++++++++++++++++++++++++++++++---------
>  1 file changed, 76 insertions(+), 20 deletions(-)
> 

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.


^ permalink raw reply

* [PATCH] net: usb: cx82310_eth: stop parsing reboot marker as packet
From: Tianchu Chen @ 2026-06-25 15:32 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni; +Cc: linux-usb, netdev

From: Tianchu Chen <flynnnchen@tencent.com>

Discovered by Atuin - Automated Vulnerability Discovery Engine.

cx82310_rx_fixup() treats an RX length of 0xffff as a device reboot
marker and schedules work to re-enable ethernet mode, but then continues
processing the marker as a normal packet length. This is an out-of-bounds
heap write controlled by the usb device.

Return immediately after scheduling the recovery work so the marker skb
is dropped instead of being assembled as packet data.

Fixes: ca139d76b0d9 ("cx82310_eth: re-enable ethernet mode after router reboot")
Cc: stable@vger.kernel.org
Signed-off-by: Tianchu Chen <flynnnchen@tencent.com>
---
 drivers/net/usb/cx82310_eth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/cx82310_eth.c b/drivers/net/usb/cx82310_eth.c
index 068acb052..5df657acf 100644
--- a/drivers/net/usb/cx82310_eth.c
+++ b/drivers/net/usb/cx82310_eth.c
@@ -282,6 +282,7 @@ static int cx82310_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 		if (len == 0xffff) {
 			netdev_info(dev->net, "router was rebooted, re-enabling ethernet mode");
 			schedule_work(&priv->reenable_work);
+			return 0;
 		} else if (len > CX82310_MTU) {
 			netdev_err(dev->net, "RX packet too long: %d B\n", len);
 			return 0;
-- 
2.51.0

^ permalink raw reply related

* RE: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Selvamani Rajagopal @ 2026-06-25 15:31 UTC (permalink / raw)
  To: Parthiban.Veerasooran@microchip.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, Piergiorgio Beruto
  Cc: andrew@lunn.ch, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Conor.Dooley@microchip.com,
	devicetree@vger.kernel.org
In-Reply-To: <CYYPR02MB982836BC273D09FD3BDE623583EC2@CYYPR02MB9828.namprd02.prod.outlook.com>

Parthiban,

Let me know if you prefer updating the patchset. I certainly prefer adding a NULL check 
In oa_tc6_update_rx_skb function.


> 
> Root cause seems to be same. When oa_tc6_update_rx_skb function is called, tc6-
> >rx_skb
> seems to be NULL, which may mean, controller seems to be not getting start
> 
> I have a theory. Look at line #933. We have the following comment. I am sure this could
> be true
> for the call to oa_tc6_prcs_rx_frame_end at line #926 or oa_tc6_prcs_ongoing_rx_frame
> at line #950.
>                /* After rx buffer overflow error received, there might be a
>                  * possibility of getting an end valid of a previously
>                  * incomplete rx frame along with the new rx frame start valid.
>                  */
> 


^ permalink raw reply

* Re: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
From: Jakub Kicinski @ 2026-06-25 15:29 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, stfomichev, pabeni, horms,
	tushar.vyavahare, kerneljasonxing
In-Reply-To: <aj1He9vNkRh+Ettf@boxer>

On Thu, 25 Jun 2026 17:21:31 +0200 Maciej Fijalkowski wrote:
> On Thu, Jun 25, 2026 at 07:36:36AM -0700, Jakub Kicinski wrote:
> > On Thu, 25 Jun 2026 12:35:12 +0200 Maciej Fijalkowski wrote:  
> > > On Wed, Jun 24, 2026 at 07:33:26PM -0700, Jakub Kicinski wrote:  
> > > I have not checked if this has been -net propagated already, but the rule
> > > of thumb on bpf side was that all selftests related effort goes to -next.
> > > Is it different on netdev side?  
> > 
> > We prefer -next too, but during the merge window net-next is closed.
> > 
> > What we definitely don't want is a -next patch with a Fixes tag.
> > So either net or drop the tag, please.  
> 
> I have verified that offending commit is present in net tree. Could you
> apply the v2 that I unfortunately sent already targeted at net-next, to
> net tree?

Will do. Hopefully it applies, cause net-next wasn't forwarded yet
so it doesn't include Tushar's patches.

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 15:29 UTC (permalink / raw)
  To: Andrew Lunn, Jakub Kicinski
  Cc: davem, Eric Dumazet, Paolo Abeni, Simon Horman, Russell King,
	Heiner Kallweit, Jonathan Corbet, Shuah Khan, Oleksij Rempel,
	Vladimir Oltean, Florian Fainelli, thomas.petazzoni, netdev,
	linux-kernel, linux-doc
In-Reply-To: <5cb8e2b4-8eb6-4446-9b90-1cd4c7964cd9@lunn.ch>

Hi Andrew,

On 5/27/26 04:47, Andrew Lunn wrote:
> On Tue, May 26, 2026 at 05:24:47PM -0700, Jakub Kicinski wrote:
>> On Fri, 22 May 2026 19:51:06 +0200 Maxime Chevallier (Netdev
>> Foundation) wrote:
>>>  Documentation/networking/pause_test_plan.rst | 556 +++++++++++++++++++
>>
>> It'd be great to hear from others but IMHO in the current form this is
>> not suitable for Documentation/networking/ We can commit the "knowledge"
>> part but enumerating the test cases seems odd for Documentation/.
> 
> Sorry, not looked too deeply at the actual content yet.
> 
> What i was thinking was a python file, which sphinx can ingest to
> produce documentation, and place holders were code would be added to
> implement the actual test during the next phase.
> 
> This is how i've done testing in the past. I would be the evil one who
> thought up the tests and described them in detail using sphinx markup
> in a python test template file. After some review they got passed off
> to a python developer for implementation. And when they got run and
> failed, sometimes the feature developer, the test developer and myself
> got together to figure who made the error.
> 
> I'm not sure we even need sphinx. What i find important is that the
> test is documented. What kAPI calls should be made with what
> parameters. What results we are expected and why? So that when a test
> fails, a developer has the information they need to fix their
> code. The Why? is important, and often missing from the kernel tests.

This isn't sphynx, but I've come-up with something like this for a
test definition :


@ksft_ethtool_needs_supported_anyof([Pause, Asym_Pause])
def test_ethtool_pause_advertising(cfg, peer) -> None:
    """Pause advertisement

    Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
    translates to a change in the advertised pause params, and that these
    parameters are correct w.r.t the supported pause params and requested pause
    params.
    
    This exercises the .set_pauseparams() ethtool ops for MAC configuration,
    as well as the reconfiguration of the PHY's advertising and negociation.
    
    On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
    PHY's advertising, and restart a negotiation with phy_start_aneg() if
    need be. Failure to do so will result on the wrong advertising parameters.
    
    Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided
    the MAC driver calls phylink_ethtool_set_pauseparam().
    
    Failing this test likely means that the PHY driver is not correctly advertising
    pause settings, either due to the MAC not triggering a PHY reconfiguration,
    a misconficonfiguration of the advertising registers by the PHY, or by
    mis-handling the phydev->advertising bitfield in the PHY driver directly.
    
    The validation is made by looking at the advertised modes locally, as well as
    what the peer's 'lp_advertising' values report.

    cfg -- local device's interface configuration
    peer -- peer device handle
    """

    # Initial conditions :
    # - Local interface is admin UP, and reports lowlayer link UP
    # - Remote interface is adming UP, and reports lowlayer link UP
    #
    # Test 1
    # - SKIP if supported doesn't contain "Pause"
    # - run 'ethtool -A ethX rx on tx on autoneg on'
    # - FAIL if the return isn't 0
    # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
    #   "Pause" or contains "Asym_Pause"
    # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
    #   "Asym_Pause"
    # - Succeed otherwise
    #
    # Test 2
    # - SKIP uif supported doesn't contain both "Pause" and "Asym_Pause"
    # - run 'ethtool -A ethX rx on tx on autoneg on'
    # - FAIL if the return isn't 0
    # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
    #   "Pause" or contains "Asym_Pause"
    # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
    #   "Asym_Pause"
    #
    # ...
   
The annotation defines the pre-requisites in terms of locally supported
linkmodes, we have a docstring containing information for developpers
to debug their drivers, what I'm unsure about is the commented-out part
below, so either one big function testing multiple adjacent scenarios
or indivitual functions.

We could also use annotations to enumerate the various combinations of
modes to test.

That's just an extract of the full test suite for Pause, but before
writing the whole thing down i figure it's better to iterate on a single
test's design.

What do you think ?

Maxime

^ permalink raw reply

* Re: [PATCH net v4 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Jakub Kicinski @ 2026-06-25 15:23 UTC (permalink / raw)
  To: Aleksander Jan Bajkowski
  Cc: Petr Wozniak, Russell King, Andrew Lunn, Heiner Kallweit,
	David S . Miller, Eric Dumazet, Paolo Abeni, netdev, linux-kernel,
	linux-phy, Maxime Chevallier, Bjorn Mork, Marek Behun
In-Reply-To: <9f813a8e-8b9a-4708-b3b6-db4972adac35@wp.pl>

On Wed, 24 Jun 2026 23:44:19 +0200 Aleksander Jan Bajkowski wrote:
> > For genuine RollBall modules (e.g. FLYPRO SFP-10GT-CS-30M with Aquantia
> > AQR113C) the probe now runs after initialization is complete and
> > correctly returns 0, so PHY detection proceeds normally.  
> The FLPRO SFP module still fails to detect the PHY. It is necessary to
> increase `module_t_wait` to 20 seconds. Most likely, during this time
> the module loads the PHY firmware from SPI memory or from the
> microcontroller (rollball bridge) via MDIO. Same probably applies to
> most SFP modules with a PHY that load firmware at start-up (AQR113,
> RTL8261C etc.).

Just to clarify is FLPRO a typo or a knock off ?
Do you want something to be changed here or you're just flagging that
more follow ups are needed if we want to cover more modules?

^ permalink raw reply

* RE: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Selvamani Rajagopal @ 2026-06-25 15:21 UTC (permalink / raw)
  To: Parthiban.Veerasooran@microchip.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, Piergiorgio Beruto
  Cc: andrew@lunn.ch, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Conor.Dooley@microchip.com,
	devicetree@vger.kernel.org
In-Reply-To: <f127837f-e08f-48e0-a3a9-906e1d61d6bb@microchip.com>

> -----Original Message-----
> From: Parthiban.Veerasooran@microchip.com <Parthiban.Veerasooran@microchip.com>
> Subject: Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
> 
> 
> With your above patches, I did a quick test (Test case 2) with two
> Microchip MAC-PHYs and faced a similar issue reported before. Sharing
> the dmesg crash log for your reference.

Root cause seems to be same. When oa_tc6_update_rx_skb function is called, tc6->rx_skb 
seems to be NULL, which may mean, controller seems to be not getting start

I have a theory. Look at line #933. We have the following comment. I am sure this could be true
for the call to oa_tc6_prcs_rx_frame_end at line #926 or oa_tc6_prcs_ongoing_rx_frame at line #950.
               /* After rx buffer overflow error received, there might be a
                 * possibility of getting an end valid of a previously
                 * incomplete rx frame along with the new rx frame start valid.
                 */

Either we change the following line in the function oa_tc6_update_rx_skb
    if ((tc6->rx_skb->tail + length) > tc6->rx_skb->end) {
to
        if (tc6->rx_skb == NULL || (tc6->rx_skb->tail + length) > tc6->rx_skb->end) {

Or add a check 
   If (tc6->rx_skb) before calling above mentioned two functions from the callee function.

I could do. But I have no way of verifying this. I am sure it will fix the crash. I would like to confirm
whether traffic recovers.

> 
> [ 2863.182105] eth1: Receive buffer overflow error
> [ 2863.199905] eth1: Receive buffer overflow error
> [ 2867.669312] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000000000b8


^ permalink raw reply

* Re: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
From: Maciej Fijalkowski @ 2026-06-25 15:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, bpf, magnus.karlsson, stfomichev, pabeni, horms,
	tushar.vyavahare, kerneljasonxing
In-Reply-To: <20260625073636.449a28c0@kernel.org>

On Thu, Jun 25, 2026 at 07:36:36AM -0700, Jakub Kicinski wrote:
> On Thu, 25 Jun 2026 12:35:12 +0200 Maciej Fijalkowski wrote:
> > On Wed, Jun 24, 2026 at 07:33:26PM -0700, Jakub Kicinski wrote:
> > > On Tue, 23 Jun 2026 11:10:08 +0200 Maciej Fijalkowski wrote:  
> > > > Subject: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test  
> > > 
> > > Do you want it in net? Either way - we'll need a rebase  
> > 
> > I have not checked if this has been -net propagated already, but the rule
> > of thumb on bpf side was that all selftests related effort goes to -next.
> > Is it different on netdev side?
> 
> We prefer -next too, but during the merge window net-next is closed.
> 
> What we definitely don't want is a -next patch with a Fixes tag.
> So either net or drop the tag, please.

I have verified that offending commit is present in net tree. Could you
apply the v2 that I unfortunately sent already targeted at net-next, to
net tree?

^ permalink raw reply

* Re: [PATCH net v2] sctp: fix SCTP_RESET_STREAMS stream list length limit
From: Jakub Kicinski @ 2026-06-25 15:19 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Marcelo Ricardo Leitner, Xin Long, David S . Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, linux-sctp, netdev, linux-kernel
In-Reply-To: <20260625142354.2600-1-alhouseenyousef@gmail.com>

On Thu, 25 Jun 2026 16:23:54 +0200 Yousef Alhouseen wrote:
> Changes in v2:
> - Add Fixes and Acked-by tags from Xin Long.
> - v1: https://lore.kernel.org/r/20260624122213.4052-1-alhouseenyousef@gmail.com

You don't have to repost patches for networking just to add tags :/

^ permalink raw reply

* Re: [PATCH v2 net 2/3] net: udp_tunnel: convert state flags to atomic bitops
From: Eric Dumazet @ 2026-06-25 15:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, David S . Miller, Paolo Abeni, Simon Horman,
	Yue Sun, netdev, eric.dumazet
In-Reply-To: <20260625080854.06851faf@kernel.org>

On Thu, Jun 25, 2026 at 8:08 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 25 Jun 2026 06:59:37 +0000 Eric Dumazet wrote:
> > These flags can be modified concurrently from different contexts:
> > - RTNL-locked paths (like add_port/del_port) write to need_sync and
> >   work_pending.
>
> These should hold utn->lock. Not sure why udp_tunnel_nic_lock()
> is locking in the callers rather than directly in
> __udp_tunnel_nic_add_port() / __udp_tunnel_nic_del_port()..
>
> > - The RTNL-less reset path (reset_ntf, used by netdevsim) writes to
> >   need_sync and need_replay under utn->lock.
>
> I'd rather add asserts to confirm utn lock is held everywhere.
> This code is hard enough to follow as is, without having to
> think through potential concurrent accesses.

Ah ok, I will let you finish this, it seems I am wasting your time.

Thanks.

^ permalink raw reply

* Re: [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility
From: Jakub Kicinski @ 2026-06-25 15:17 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, nb, aleksandr.loktionov,
	dtatulea
In-Reply-To: <CAAVpQUCbT1Q9BPTLrVCjpt2vcJiWsYKa0onJ_vwnq86L73m8mw@mail.gmail.com>

On Wed, 24 Jun 2026 22:55:06 -0700 Kuniyuki Iwashima wrote:
> Oh very nice !
> 
> I was drafting almost the same change for dev_set_rx_mode()
> in mcast path and some ipvlan changes.

Glad to hear! I wasn't 100% convinced by the added complexity
in the core :S

^ permalink raw reply

* Re: [PATCH] xsk: fix memory corruptions in net/core/xdp.c
From: Alexander Lobakin @ 2026-06-25 15:14 UTC (permalink / raw)
  To: Clement Lecigne
  Cc: edumazet, netdev, bpf, linux-kernel, kuba, sdf, horms,
	john.fastabend, ast, daniel
In-Reply-To: <20260624084130.2382335-1-clecigne@google.com>

From: Clement Lecigne <clecigne@google.com>
Date: Wed, 24 Jun 2026 08:41:28 +0000

> From: Clément Lecigne <clecigne@google.com>
> 
> Commit 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
> introduced a vulnerability in the handling of XDP_PASS for AF_XDP zero-copy
> frames.
> 
> Note: Currently, this specific AF_XDP zero-copy conversion path is only
> reachable from the drivers/net/ethernet/intel/ice driver.

idpf uses this, too (every driver based on libeth_xdp in general,
currently these two).

> 
> When building an skb, xdp_build_skb_from_zc() uses the chunk size
> (xdp->frame_sz) for the allocation. However, napi_build_skb() automatically
> reserves space at the end of the allocation for the skb_shared_info
> structure. 
> 
> Most high performance UMEM applications use 4K chunks, where the
> corruption cannot happen. However, if the UMEM is configured with 2KB
> chunks (a very common configuration to maximize packet density in memory),
> a standard 1500 MTU packet will trigger the corruption because the required
> space exceeds the 2048 byte chunk size:
> 
> Headroom (256) + Packet (1514) + skb_shared_info (320) = 2090 bytes
> 
> Because 2090 bytes > 2048 bytes and __skb_put() does not perform bounds
> checking, the memcpy() writes past the available linear data area and
> corrupts the skb_shared_info structure. This can lead to arbitrary code
> execution if pointers like destructor_arg are overwritten.
> 
> Additionally, in xdp_copy_frags_from_zc(), the allocation size is set
> strictly to the fragment size (len), but the subsequent memcpy() uses
> LARGEST_ALIGN(len). This mismatch results in an out-of-bounds write of
> up to 7 bytes, which triggers KASAN warnings and is unsafe despite typical
> page pool allocator padding.
> 
> Fix the skb allocation in xdp_build_skb_from_zc() by dynamically
> calculating the exact truesize required: the sum of the headroom, the
> packet length, and the skb_shared_info overhead, properly aligned via
> SKB_DATA_ALIGN.
> 
> Fix the out-of-bounds write in xdp_copy_frags_from_zc() by rounding up
> the allocation request using LARGEST_ALIGN(len) to match the copy
> operation.
> 
> Fixes: 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> CC: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Clément Lecigne <clecigne@google.com>
> ---
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 9890a30584ba..f36d1fb875ab 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -699,7 +699,7 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb,
>  	for (u32 i = 0; i < nr_frags; i++) {
>  		const skb_frag_t *frag = &xinfo->frags[i];
>  		u32 len = skb_frag_size(frag);
> -		u32 offset, truesize = len;
> +		u32 offset, truesize = LARGEST_ALIGN(len);

I think you need to re-sort this to keep RCT, now that the truesize
initialization is way longer than it was.

		const skb_frag_t *frag = &xinfo->frags[i];
		u32 offset, len = skb_frag_size(frag);
		u32 truesize = LARGEST_ALIGN(len);
		struct page *page;

>  		struct page *page;
>  
>  		page = page_pool_dev_alloc(pp, &offset, &truesize);

BTW usually LARGEST_ALIGN() aligns to 16, I've never seen a bigger one.
IIRC Page Pool never returns a truesize aligned to a smaller value. But
if you're really able to trigger this, it probably does?

> @@ -740,7 +740,9 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
>  {
>  	const struct xdp_rxq_info *rxq = xdp->rxq;
>  	u32 len = xdp->data_end - xdp->data_meta;
> -	u32 truesize = xdp->frame_sz;
> +	u32 headroom = xdp->data_meta - xdp->data_hard_start;
> +	u32 truesize = SKB_DATA_ALIGN(headroom + len) +
> +		       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));

Ah now I get it: xdp->frame_sz doesn't account the shinfo for
single-buffer frames, only for multi-buffer ones. The fix looks correct,
but I'd use SKB_HEAD_ALIGN() since it does exactly what you're
open-coding here and sort the declarations:

{
	u32 hr = xdp->data_meta - xdp->data_hard_start;
	const struct xdp_rxq_info *rxq = xdp->rxq;
	u32 len = xdp->data_end - xdp->data_meta;
	u32 truesize = SKB_HEAD_ALIGN(hr + len);
	struct sk_buff *skb = NULL;
	struct page_pool *pp;
	int metalen;
	void *data;

	if (!IS_ENABLED(CONFIG_PAGE_POOL))
		return NULL;

	...

>  	struct sk_buff *skb = NULL;
>  	struct page_pool *pp;
>  	int metalen;
> @@ -762,7 +764,7 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
>  	}
>  
>  	skb_mark_for_recycle(skb);
> -	skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
> +	skb_reserve(skb, headroom);
>  
>  	memcpy(__skb_put(skb, len), xdp->data_meta, LARGEST_ALIGN(len));

Thanks,
Olek

^ permalink raw reply

* [PATCH v4 net 3/3] i40e: keep q_vectors array in sync with channel count changes
From: Maciej Fijalkowski @ 2026-06-25 15:14 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, magnus.karlsson, kuba, pabeni, horms, przemyslaw.kitszel,
	jacob.e.keller, Maciej Fijalkowski
In-Reply-To: <20260625151431.1102838-1-maciej.fijalkowski@intel.com>

For the main VSI, i40e_set_num_rings_in_vsi() always derives
num_q_vectors from pf->num_lan_msix. At the same time, ethtool -L stores
the user requested channel count in vsi->req_queue_pairs and the queue
setup path uses that value for the effective number of queue pairs.

This leaves queue and vector counts out of sync after shrinking channel
count via ethtool -L. The active queue configuration is reduced, but the
VSI still keeps the full PF-sized q_vector topology.

That mismatch breaks reconfiguration flows which rely on vector/NAPI
state matching the effective channel configuration. In particular,
toggling /sys/class/net/<dev>/threaded after reducing the channel count
can hang, and later channel-count changes can fail because VSI reinit
does not rebuild q_vectors to match the new vector count.

Fix this by making the main VSI num_q_vectors follow the effective
requested channel count, capped by the available MSI-X vectors. Update
i40e_vsi_reinit_setup() to rebuild q_vectors during VSI reinit so the
vector topology is refreshed together with the ring arrays when channel
count changes.

Keep alloc_queue_pairs unchanged and based on pf->num_lan_qps so the VSI
retains its full queue capacity.

Selftest napi_threaded.py was originally used when Jakub reported hang
on /sys/class/net/<dev>/threaded toggle. In order to make it pass on
i40e, use persistent NAPI configuration for q_vector NAPIs so NAPI
identity and threaded settings survive q_vector reallocation across
channel-count changes. This is achieved by using netif_napi_add_config()
when configuring q_vectors.

$ export NETIF=ens259f1np1
$ sudo -E env PATH="$PATH" ./tools/testing/selftests/drivers/net/napi_threaded.py
TAP version 13
1..3
ok 1 napi_threaded.napi_init
ok 2 napi_threaded.change_num_queues
ok 3 napi_threaded.enable_dev_threaded_disable_napi_threaded
Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/intel-wired-lan/20260316133100.6054a11f@kernel.org/
Fixes: d2a69fefd756 ("i40e: Fix changing previously set num_queue_pairs for PFs")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 60 +++++++++++++--------
 1 file changed, 37 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 4adc7b0fb2f4..c017217a1bc3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11406,10 +11406,14 @@ static void i40e_service_timer(struct timer_list *t)
 static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi)
 {
 	struct i40e_pf *pf = vsi->back;
+	u16 qps;
 
 	switch (vsi->type) {
 	case I40E_VSI_MAIN:
 		vsi->alloc_queue_pairs = pf->num_lan_qps;
+		qps = vsi->req_queue_pairs ?
+		      min(vsi->req_queue_pairs, pf->num_lan_qps) :
+		      pf->num_lan_qps;
 		if (!vsi->num_tx_desc)
 			vsi->num_tx_desc = ALIGN(I40E_DEFAULT_NUM_DESCRIPTORS,
 						 I40E_REQ_DESCRIPTOR_MULTIPLE);
@@ -11417,7 +11421,7 @@ static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi)
 			vsi->num_rx_desc = ALIGN(I40E_DEFAULT_NUM_DESCRIPTORS,
 						 I40E_REQ_DESCRIPTOR_MULTIPLE);
 		if (test_bit(I40E_FLAG_MSIX_ENA, pf->flags))
-			vsi->num_q_vectors = pf->num_lan_msix;
+			vsi->num_q_vectors = clamp(qps, 1, pf->num_lan_msix);
 		else
 			vsi->num_q_vectors = 1;
 
@@ -11469,12 +11473,11 @@ static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi)
 /**
  * i40e_vsi_alloc_arrays - Allocate queue and vector pointer arrays for the vsi
  * @vsi: VSI pointer
- * @alloc_qvectors: a bool to specify if q_vectors need to be allocated.
  *
  * On error: returns error code (negative)
  * On success: returns 0
  **/
-static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi, bool alloc_qvectors)
+static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi)
 {
 	struct i40e_ring **next_rings;
 	int size;
@@ -11493,19 +11496,18 @@ static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi, bool alloc_qvectors)
 	}
 	vsi->rx_rings = next_rings;
 
-	if (alloc_qvectors) {
-		/* allocate memory for q_vector pointers */
-		size = sizeof(struct i40e_q_vector *) * vsi->num_q_vectors;
-		vsi->q_vectors = kzalloc(size, GFP_KERNEL);
-		if (!vsi->q_vectors) {
-			ret = -ENOMEM;
-			goto err_vectors;
-		}
+	/* allocate memory for q_vector pointers */
+	size = sizeof(struct i40e_q_vector *) * vsi->num_q_vectors;
+	vsi->q_vectors = kzalloc(size, GFP_KERNEL);
+	if (!vsi->q_vectors) {
+		ret = -ENOMEM;
+		goto err_vectors;
 	}
 	return ret;
 
 err_vectors:
 	kfree(vsi->tx_rings);
+	vsi->tx_rings = NULL;
 	return ret;
 }
 
@@ -11578,7 +11580,7 @@ static int i40e_vsi_mem_alloc(struct i40e_pf *pf, enum i40e_vsi_type type)
 	if (ret)
 		goto err_rings;
 
-	ret = i40e_vsi_alloc_arrays(vsi, true);
+	ret = i40e_vsi_alloc_arrays(vsi);
 	if (ret)
 		goto err_rings;
 
@@ -11603,18 +11605,15 @@ static int i40e_vsi_mem_alloc(struct i40e_pf *pf, enum i40e_vsi_type type)
 /**
  * i40e_vsi_free_arrays - Free queue and vector pointer arrays for the VSI
  * @vsi: VSI pointer
- * @free_qvectors: a bool to specify if q_vectors need to be freed.
  *
  * On error: returns error code (negative)
  * On success: returns 0
  **/
-static void i40e_vsi_free_arrays(struct i40e_vsi *vsi, bool free_qvectors)
+static void i40e_vsi_free_arrays(struct i40e_vsi *vsi)
 {
 	/* free the ring and vector containers */
-	if (free_qvectors) {
-		kfree(vsi->q_vectors);
-		vsi->q_vectors = NULL;
-	}
+	kfree(vsi->q_vectors);
+	vsi->q_vectors = NULL;
 	kfree(vsi->tx_rings);
 	vsi->tx_rings = NULL;
 	vsi->rx_rings = NULL;
@@ -11674,7 +11673,7 @@ static int i40e_vsi_clear(struct i40e_vsi *vsi)
 	i40e_put_lump(pf->irq_pile, vsi->base_vector, vsi->idx);
 
 	bitmap_free(vsi->af_xdp_zc_qps);
-	i40e_vsi_free_arrays(vsi, true);
+	i40e_vsi_free_arrays(vsi);
 	i40e_clear_rss_config_user(vsi);
 
 	pf->vsi[vsi->idx] = NULL;
@@ -12046,7 +12045,8 @@ static int i40e_vsi_alloc_q_vector(struct i40e_vsi *vsi, int v_idx)
 	cpumask_copy(&q_vector->affinity_mask, cpu_possible_mask);
 
 	if (vsi->netdev)
-		netif_napi_add(vsi->netdev, &q_vector->napi, i40e_napi_poll);
+		netif_napi_add_config(vsi->netdev, &q_vector->napi,
+				      i40e_napi_poll, v_idx);
 
 	/* tie q_vector and vsi together */
 	vsi->q_vectors[v_idx] = q_vector;
@@ -14267,12 +14267,26 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 
 	pf = vsi->back;
 
+	if (test_bit(I40E_FLAG_MSIX_ENA, pf->flags)) {
+		i40e_put_lump(pf->irq_pile, vsi->base_vector, vsi->idx);
+		vsi->base_vector = 0;
+	}
+
 	i40e_put_lump(pf->qp_pile, vsi->base_queue, vsi->idx);
+	i40e_vsi_free_q_vectors(vsi);
 	i40e_vsi_clear_rings(vsi);
+	i40e_vsi_free_arrays(vsi);
 
-	i40e_vsi_free_arrays(vsi, false);
 	i40e_set_num_rings_in_vsi(vsi);
-	ret = i40e_vsi_alloc_arrays(vsi, false);
+	ret = i40e_vsi_alloc_arrays(vsi);
+	if (ret)
+		goto err_netdev;
+
+	/* Rebuild q_vectors during VSI reinit because the effective channel
+	 * count may change num_q_vectors. Keep vector topology aligned with the
+	 * queue configuration after ethtool's .set_channels() callback.
+	 */
+	ret = i40e_vsi_setup_vectors(vsi);
 	if (ret)
 		goto err_netdev;
 
@@ -14284,7 +14298,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 		dev_info(&pf->pdev->dev,
 			 "failed to get tracking for %d queues for VSI %d err %d\n",
 			 alloc_queue_pairs, vsi->seid, ret);
-		goto err_netdev;
+		goto err_rings;
 	}
 	vsi->base_queue = ret;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 net 2/3] i40e: fix potential UAF in i40e_vsi_setup()'s error path
From: Maciej Fijalkowski @ 2026-06-25 15:14 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, magnus.karlsson, kuba, pabeni, horms, przemyslaw.kitszel,
	jacob.e.keller, Maciej Fijalkowski
In-Reply-To: <20260625151431.1102838-1-maciej.fijalkowski@intel.com>

Sashiko pointed out an issue where error path in i40e_vsi_reinit_setup()
released ring memory but then when freeing q_vectors, the rings mapped
to q_vectors where touched which implies a regular use-after-free bug.

Apparently i40e_vsi_setup() has the same problem, so swap the allocation
and freeing order and fix the 13 year old bug.

Fixes: 41c445ff0f48 ("i40e: main driver core")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 471fa7f7b643..4adc7b0fb2f4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -14460,14 +14460,14 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
 		fallthrough;
 	case I40E_VSI_FDIR:
 		/* set up vectors and rings if needed */
-		ret = i40e_vsi_setup_vectors(vsi);
-		if (ret)
-			goto err_msix;
-
 		ret = i40e_alloc_rings(vsi);
 		if (ret)
 			goto err_rings;
 
+		ret = i40e_vsi_setup_vectors(vsi);
+		if (ret)
+			goto err_qvec;
+
 		/* map all of the rings to the q_vectors */
 		i40e_vsi_map_rings_to_vectors(vsi);
 
@@ -14487,10 +14487,10 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
 	return vsi;
 
 err_config:
+	i40e_vsi_free_q_vectors(vsi);
+err_qvec:
 	i40e_vsi_clear_rings(vsi);
 err_rings:
-	i40e_vsi_free_q_vectors(vsi);
-err_msix:
 	if (vsi->netdev_registered) {
 		vsi->netdev_registered = false;
 		unregister_netdev(vsi->netdev);
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 net 1/3] i40e: unregister netdev before clearing VSI on reinit failure
From: Maciej Fijalkowski @ 2026-06-25 15:14 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, magnus.karlsson, kuba, pabeni, horms, przemyslaw.kitszel,
	jacob.e.keller, Maciej Fijalkowski
In-Reply-To: <20260625151431.1102838-1-maciej.fijalkowski@intel.com>

i40e_vsi_reinit_setup() tears down the existing VSI queue/ring backing
state before allocating replacement arrays and queue tracking. If one of
these early allocations fails, the function jumps directly to err_vsi
and calls i40e_vsi_clear().

For a registered netdev, this frees the VSI while
netdev_priv(netdev)->vsi can still point at it, leaving the registered
netdev with dangling private driver state.

Split the error path so failures after destructive reinit teardown first
unregister and free the netdev before clearing the VSI.

Fixes: d2a69fefd756 ("i40e: Fix changing previously set num_queue_pairs for PFs")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index a04683004a56..471fa7f7b643 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -14274,7 +14274,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 	i40e_set_num_rings_in_vsi(vsi);
 	ret = i40e_vsi_alloc_arrays(vsi, false);
 	if (ret)
-		goto err_vsi;
+		goto err_netdev;
 
 	alloc_queue_pairs = vsi->alloc_queue_pairs *
 			    (i40e_enabled_xdp_vsi(vsi) ? 2 : 1);
@@ -14284,7 +14284,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 		dev_info(&pf->pdev->dev,
 			 "failed to get tracking for %d queues for VSI %d err %d\n",
 			 alloc_queue_pairs, vsi->seid, ret);
-		goto err_vsi;
+		goto err_netdev;
 	}
 	vsi->base_queue = ret;
 
@@ -14309,6 +14309,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 
 err_rings:
 	i40e_vsi_free_q_vectors(vsi);
+err_netdev:
 	if (vsi->netdev_registered) {
 		vsi->netdev_registered = false;
 		unregister_netdev(vsi->netdev);
@@ -14318,7 +14319,6 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 	if (vsi->type == I40E_VSI_MAIN)
 		i40e_devlink_destroy_port(pf);
 	i40e_aq_delete_element(&pf->hw, vsi->seid, NULL);
-err_vsi:
 	i40e_vsi_clear(vsi);
 	return NULL;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 net 0/3] i40e: re-init and UAF fixes
From: Maciej Fijalkowski @ 2026-06-25 15:14 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, magnus.karlsson, kuba, pabeni, horms, przemyslaw.kitszel,
	jacob.e.keller, Maciej Fijalkowski

v4:
- add preceding patch that fixes a case when some of re-init allocations
  failed and we missed de-registering netdev at failure path
- pull out i40e_vsi_setup() changes onto separate patch
v3:
- address UAF when ring arrays were freed before q_vector's ring
  containers (Sashiko, Jacob)
- remove bool params from alloc/free array routines (Simon)
v2:
- NULL vsi->tx_rings in i40e_vsi_alloc_arrays() (Sashiko)

Maciej Fijalkowski (3):
  i40e: unregister netdev before clearing VSI on reinit failure
  i40e: fix potential UAF in i40e_vsi_setup()'s error path
  i40e: keep q_vectors array in sync with channel count changes

 drivers/net/ethernet/intel/i40e/i40e_main.c | 76 ++++++++++++---------
 1 file changed, 45 insertions(+), 31 deletions(-)

-- 
2.43.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox