Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Andrey Drobyshev @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFUk7quPhbI7Te-@sgarzare-redhat>

On 6/16/26 5:18 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Earlier commit ("ms/vhost/vsock: Refuse the connection immediately when
> 
> Please follow 
> https://docs.kernel.org/process/submitting-patches.html#describe-your-changes 
> on how to refer to a commit.
>

I omitted the hash on purpose as the commit is not yet in the mainline
tree, although our series is based and depends on it, as I mentioned:

https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?id=bb26ed5f3a8b

So it's a different (Michael's) repo and the commit is about to get
merged (but not yet there).  But maybe usual reference style + repo link
would be better.
>> guest isn't ready") added a fast-fail in vhost_transport_send_pkt().  It
>> rejects every host send with -EHOSTUNREACH until the destination calls
>> SET_RUNNING(1).  The fast-fail condition checks whether device's backends
>> are dropped, and if they're, the guest is considered to be not ready.
> 
> Okay, so it's not a regression, I mean without this series that patch is 
> not adding any regression, no?
> 
> If it's the case, I'll change the wording in the cover letter.
>

Agreed.

>>
>> However, there might be other reasons for backends to be nulled.  In
>> particular, when QEMU is performing CPR (checkpoint-restore) migration,
>> device ownership is being RESET and SET again, which leads to backends
>> drop and reattach.  If we end up connecting during this window, an
>> AF_VSOCK client gets -EHOSTUNREACH, which is wrong.
> 
> Please add this change before starting to support VHOST_RESET_OWNER 
> ioctl in vhost-vsock, otherwise we are breaking the bisectability.
>

Agreed.

>>
>> Add a cpr_paused flag set inside vhost_vsock_drop_backends() when the
>> backend was previously live, cleared by vhost_vsock_start(). When set,
>> vhost_transport_send_pkt() queues the skb instead of fast-failing; the
>> existing kick of send_pkt_work in vhost_vsock_start() drains it on
>> resume. A device that has never run keeps cpr_paused == false and the
>> boot-time fast-fail behaviour is preserved.
>>
>> Pair the cpr_paused store with the backend store using an
>> smp_wmb()/smp_rmb() pair so a concurrent sender on a weakly-ordered
>> architecture never observes (NULL backend, !paused):
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>> drivers/vhost/vsock.c | 22 +++++++++++++++++++---
>> 1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index e629886e5cf8..bcaba36becd7 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -61,6 +61,7 @@ struct vhost_vsock {
>>
>> 	u32 guest_cid;
>> 	bool seqpacket_allow;
>> +	bool cpr_paused;	/* between stop and next start */
>> };
>>
>> static u32 vhost_transport_get_local_cid(void)
>> @@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>> 	 * the mutex would be too expensive in this hot path, and we already have
>> 	 * all the outcomes covered: if the backend becomes NULL right after the check,
>> 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>> +	 *
>> +	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>> +	 * The kick in vhost_vsock_start() will drain them on resume.
>> 	 */
>> 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>> -		rcu_read_unlock();
>> -		kfree_skb(skb);
>> -		return -EHOSTUNREACH;
>> +		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
>> +		if (!READ_ONCE(vsock->cpr_paused)) {
> 
> Can we avoid this which is not really readable and maybe add a single 
> variable to control the fast-fail at all?
> 
> I mean replacing both cpr_paused + backend-pointer with a single 
> `started` flag: set it to false at open, true on start via 
> smp_store_release(), back to false on normal stop, and leave it true 
> during CPR pause.
> 
> The reader in send_pkt can do just:
> 
>      if (!smp_load_acquire(&vsock->started))
>          return -EHOSTUNREACH;
> 
> WDYT?
>

I don't think it's gonna work as suggested.  As I understand, the order
during CPR migration is:

1) SET_RUNNING(0)
       -> vhost_vsock_stop()
           -> vhost_vsock_drop_backends()
2) RESET_OWNER
       -> vhost_vsock_drop_backends()
3) SET_OWNER
4) SET_RUNNING(1)
       -> vhost_vsock_start
           -> for (...) vhost_vq_set_backend()

(Btw I just noticed backends are already NULL at step 2), but that's
just our CPR case, for any potential RESET_OWNER users it might not be
the case).

So the race windows starts from 1) (not from 2)).  We have no way of
differentiating whether device is actually being stopped for good, or
we're in the middle of CPR.  If we set the flag to false on stop as you
suggested, we'll still hit the -EHOSTUNREACH case eventually, and
avoiding it is the whole purpose of this patch.

The fast-fail with -EHOSTUNREACH relies on the presence of backends.
IIUC the backend will only become set after initial SET_RUNNING(1),
which will only happen once the guest driver writes smth to virtio
config register, QEMU catches it and calls SET_RUNNING(1).  So we have
ordering with the guest's actions here, which is logical.  But for our
issue that means that the only true marker of paused/not paused is the
presence of backends - and that's why the flag is set in
vhost_vsock_drop_backends().

>> +			rcu_read_unlock();
>> +			kfree_skb(skb);
>> +			return -EHOSTUNREACH;
>> +		}
> 
> 
> That said claude here is reporting a potential issue that I think we 
> should consider:
>      After VHOST_RESET_OWNER, the guest CID stays in the hash, so 
>      vhost_transport_send_pkt() can still find the vsock, skip the 
>      fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while 
>      vhost_workers_free() is freeing workers without a synchronize_rcu() 
>      — risking a use-after-free. Also, any send_pkt_work queued between 
>      the last flush and worker teardown gets its VHOST_WORK_QUEUED bit 
>      stuck (the vhost task exits without draining), deadlocking 
>      host→guest traffic after restart.
> 
>      A synchronize_rcu() in vhost_workers_free() between the 
>      rcu_assign_pointer(NULL) loop and the destroy loop would close the 
>      use-after-free, and reinitializing send_pkt_work via 
>      vhost_work_init() after vhost_dev_reset_owner() returns would clear 
>      the stuck QUEUED bit.
> 
> 

Yes, this looks real indeed.  Though I couldn't hit the UAF issue while
testing host->guest transfer under KASAN.

>> 	}
>>
>> 	if (virtio_vsock_skb_reply(skb))
>> @@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> 		mutex_unlock(&vq->mutex);
>> 	}
>>
>> +	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>> +	WRITE_ONCE(vsock->cpr_paused, false);
>> +
>> 	/* Some packets may have been queued before the device was started,
>> 	 * let's kick the send worker to send them.
>> 	 */
>> @@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>>
>> 	lockdep_assert_held(&vsock->dev.mutex);
>>
>> +	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>> +		WRITE_ONCE(vsock->cpr_paused, true);
>> +		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>> +	}
> 
> Why here and not in vhost_vsock_reset_owner()?
> 
> Also having this here will set it to true also with 
> VHOST_VSOCK_SET_RUNNING(0), is that right?
>

That was added here precisely to cover the vhost_vsock_stop() case (see
above).

> Thanks,
> Stefano
> 
>> +
>> 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>> 		vq = &vsock->vqs[i];
>>
>> @@ -728,6 +743,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>>
>> 	vsock->guest_cid = 0; /* no CID assigned yet */
>> 	vsock->seqpacket_allow = false;
>> +	vsock->cpr_paused = false;
>>
>> 	atomic_set(&vsock->queued_replies, 0);
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply

* Re: [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFbT6sDESh9FDOl@sgarzare-redhat>

On 6/16/26 5:23 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:18PM +0300, Andrey Drobyshev wrote:
>> During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
>> keeps running while the host drops and later re-attaches vhost backends.
>> If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
>> while the backend is temporarily NULL (between vhost_vsock_drop_backends()
>> and the next vhost_vsock_start()), then the kick is delivered to the
>> vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
>> kick signal is consumed.  The buffer is then left in the ring.
>>
>> Then upon device start vhost_vsock_start() only re-kicks the RX send
>> worker, never the TX VQ, so the buffer is processed only if the guest
>> happens to kick again.  But if the guest itself is now waiting for data
>>from the host, it will never kick TX VQ again, and we end up in a
>> deadlock.
>>
>> The deadlock is reproduced during active host->guest socat data transfer
>> under multiple consecutive CPR live-update's.
>>
>> To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
>> queue the TX vq poll so any buffers the guest enqueued while we were paused
>> get scanned.
> 
> Again, it seems like we're fixing an issue that existed before this 
> series, but IIUC without support for VHOST_RESET_OWNER, this could never 
> have happened, so the wording should be changed to make it clear that 
> this is can happen only with the new VHOST_RESET_OWNER support.
> 
> In addition, this patch must also be applied before the 
> VHOST_RESET_OWNER support or merged into it.
>

Agreed.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> drivers/vhost/vsock.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index bcaba36becd7..1fcfe71d18be 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -655,6 +655,12 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> 	 */
>> 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
>>
>> +	/*
>> +	 * Some packets might've also been queued in TX VQ.  Re-scan it here,
>> +	 * mirroring the RX send-worker kick above.
>> +	 */
> 
> Can we also mention that this is related to VHOST_RESET_OWNER?
>

Agreed.
> Thanks,
> Stefano
> 
>> +	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
>> +
>> 	mutex_unlock(&vsock->dev.mutex);
>> 	return 0;
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply

* Re: [PATCH net-next v2 0/9] atm: remove more dead code
From: patchwork-bot+netdevbpf @ 2026-06-16 16:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, 3chas3,
	mitch, linux-atm-general, dwmw2
In-Reply-To: <20260615194416.752559-1-kuba@kernel.org>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 15 Jun 2026 12:44:07 -0700 you wrote:
> Commit 6deb53595092 ("net: remove unused ATM protocols and legacy
> ATM device drivers") removed a good chunk of old ATM drivers.
> Our goal going forward is to limit the ATM support to PPPoATM
> used in ADSL deployments.
> 
> A recent burst of AI generated fixes for net/atm/signaling.c and
> net/atm/svc.c made me look closer at the remaining code. PPPoATM runs
> over permanent virtual circuits (PF_ATMPVC) with a statically
> configured VPI/VCI. We can drop switched virtual circuits (SVCs)
> and user-space signaling (atmsigd) support. While digging around
> I noticed a few more obviously dead pieces of code.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,1/9] atm: remove AAL3/4 transport support
    https://git.kernel.org/netdev/net-next/c/c1468145ce75
  - [net-next,v2,2/9] atm: remove the unused send_oam / push_oam callbacks
    https://git.kernel.org/netdev/net-next/c/b20aa9eded10
  - [net-next,v2,3/9] atm: remove dead SONET PHY ioctls
    https://git.kernel.org/netdev/net-next/c/277fb497d101
  - [net-next,v2,4/9] atm: remove the local ATM (NSAP) address registry
    https://git.kernel.org/netdev/net-next/c/a5a12d76d2cb
  - [net-next,v2,5/9] atm: remove SVC socket support and the signaling daemon interface
    https://git.kernel.org/netdev/net-next/c/aa582dc25ace
  - [net-next,v2,6/9] atm: remove the unused change_qos device operation
    https://git.kernel.org/netdev/net-next/c/6719d57ee047
  - [net-next,v2,7/9] atm: remove the unused pre_send and send_bh device operations
    https://git.kernel.org/netdev/net-next/c/ae6e653514d1
  - [net-next,v2,8/9] atm: remove unused ATM PHY operations
    https://git.kernel.org/netdev/net-next/c/e44e224e2f44
  - [net-next,v2,9/9] atm: remove orphaned uAPI for deleted drivers, protocols and SVCs
    https://git.kernel.org/netdev/net-next/c/8f9616500c59

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH net v2] ice: fix memory leak in ice_lbtest_prepare_rings()
From: Dawei Feng @ 2026-06-16 15:57 UTC (permalink / raw)
  To: Tony Nguyen
  Cc: Przemek Kitszel, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, intel-wired-lan, netdev,
	linux-kernel, jianhao.xu, Dawei Feng, stable

ice_lbtest_prepare_rings() frees Rx rings only when
ice_vsi_start_all_rx_rings() fails. If ice_vsi_setup_rx_rings() fails
after allocating some descriptors, or if ice_vsi_cfg_lan() fails after
the Rx rings were prepared, the function reaches the Tx cleanup path
without releasing the initialized Rx resources.

Fix this by adding separate unwind paths for Rx setup failure and LAN
configuration failure. The Rx setup failure path releases the partially
prepared Rx rings before freeing Tx rings, while later failures first
undo the LAN Tx configuration and then release the Rx rings in reverse
setup order.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc7.

An x86_64 allyesconfig build showed no new warnings. As we do not have an
Intel E800 Series adapter available to run the ethtool offline loopback
selftest, no runtime testing was able to be performed.

Fixes: 0e674aeb0b77 ("ice: Add handler for ethtool selftest")
Cc: stable@vger.kernel.org
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
---
Changes in v2:
- Fix cleanup order

 drivers/net/ethernet/intel/ice/ice_ethtool.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index f28416a707d7..10a4abc66974 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -1069,18 +1069,18 @@ static int ice_lbtest_prepare_rings(struct ice_vsi *vsi)
 
 	status = ice_vsi_cfg_lan(vsi);
 	if (status)
-		goto err_setup_rx_ring;
+		goto err_cfg_lan;
 
 	status = ice_vsi_start_all_rx_rings(vsi);
 	if (status)
-		goto err_start_rx_ring;
+		goto err_cfg_lan;
 
 	return 0;
 
-err_start_rx_ring:
-	ice_vsi_free_rx_rings(vsi);
-err_setup_rx_ring:
+err_cfg_lan:
 	ice_vsi_stop_lan_tx_rings(vsi, ICE_NO_RESET, 0);
+err_setup_rx_ring:
+	ice_vsi_free_rx_rings(vsi);
 err_setup_tx_ring:
 	ice_vsi_free_tx_rings(vsi);
 
-- 
2.34.1

^ permalink raw reply related

* [PATCH 6.18 266/325] rxrpc: Fix the ACK parser to extract the SACK table for parsing
From: Greg Kroah-Hartman @ 2026-06-16 15:01 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Michael Bommarito, David Howells,
	Marc Dionne, Jeffrey Altman, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, linux-afs, netdev,
	stable
In-Reply-To: <20260616145057.827196531@linuxfoundation.org>

6.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: David Howells <dhowells@redhat.com>

commit 333b6d5bb9f87827ac2639c737bf9613dbae7253 upstream.

Fix modification of the received skbuff in rxrpc_input_soft_acks() and a
potential incorrect access of the buffer in a fragmented UDP packet (the
packet would probably have to be deliberately pre-generated as fragmented)
when AF_RXRPC tries to extract the contents of the SACK table by copying
out the contents of the SACK table into a buffer before attempting to parse

AF_RXRPC assumes that it can just call skb_condense() and then validly
access the SACK table from skb->data and that it will be a flat buffer -
but skb_condense() can silently fail to do anything under some
circumstances.

Note that whilst rxrpc_input_soft_acks() should be able to parse extended
ACKs, the rest of AF_RXRPC doesn't currently support that.

Further, there's then no need to call skb_condense() in rxrpc_input_ack(),
so don't.

Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs")
Reported-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://lore.kernel.org/r/20260513180907.2061972-1-michael.bommarito@gmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: stable@kernel.org
Link: https://patch.msgid.link/105362.1780573560@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/rxrpc/input.c |   26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -963,23 +963,34 @@ static void rxrpc_input_soft_acks(struct
 	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 	struct rxrpc_txqueue *tq = call->tx_queue;
 	unsigned long extracted = ~0UL;
-	unsigned int nr = 0;
+	unsigned int nr = 0, nsack;
 	rxrpc_seq_t seq = call->acks_hard_ack + 1;
 	rxrpc_seq_t lowest_nak = seq + sp->ack.nr_acks;
-	u8 *acks = skb->data + sizeof(struct rxrpc_wire_header) + sizeof(struct rxrpc_ackpacket);
+	u8 sack[256] __aligned(sizeof(unsigned long));
+	u8 *acks = sack;
 
 	_enter("%x,%x,%u", tq->qbase, seq, sp->ack.nr_acks);
 
 	while (after(seq, tq->qbase + RXRPC_NR_TXQUEUE - 1))
 		tq = tq->next;
 
+	/* Extract an individual SACK table.  A normal SACK table is up to 255
+	 * bytes with 1 ACK flag per byte, but an extended SACK table can be up
+	 * to 256 bytes with up to 8 ACK/NACK flags per byte.  The ACK flags go
+	 * across all bit 0's then all bit 1's, then all bit 2's, ...
+	 */
+	memset(sack, 0, sizeof(sack));
+	nsack = umin(sp->ack.nr_acks, 256);
+	if (skb_copy_bits(skb,
+			  sizeof(struct rxrpc_wire_header) + sizeof(struct rxrpc_ackpacket),
+			  sack, nsack) < 0)
+		return;
+
 	for (unsigned int i = 0; i < sp->ack.nr_acks; i++) {
 		/* Decant ACKs until we hit a txqueue boundary. */
+		if ((i & 255) == 0)
+			acks = sack;
 		shiftr_adv_rotr(acks, extracted);
-		if (i == 256) {
-			acks -= i;
-			i = 0;
-		}
 		seq++;
 		nr++;
 		if ((seq & RXRPC_TXQ_MASK) != 0)
@@ -1117,9 +1128,6 @@ static void rxrpc_input_ack(struct rxrpc
 	    skb_copy_bits(skb, ioffset, &trailer, sizeof(trailer)) < 0)
 		return rxrpc_proto_abort(call, 0, rxrpc_badmsg_short_ack_trailer);
 
-	if (nr_acks > 0)
-		skb_condense(skb);
-
 	call->acks_latest_ts = ktime_get_real();
 	call->acks_hard_ack = hard_ack;
 	call->acks_prev_seq = prev_pkt;



^ permalink raw reply

* Re: [PATCH net-next 0/5] tls: reject the combination of TLS and sockmap
From: patchwork-bot+netdevbpf @ 2026-06-16 16:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, bpf, jakub,
	john.fastabend, sd
In-Reply-To: <20260614014102.461064-1-kuba@kernel.org>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sat, 13 Jun 2026 18:40:55 -0700 you wrote:
> There are no known TLS+sockmap users and it has some known
> hard to solve bugs. Let's reject this configuration as we
> discussed a number of times.
> 
> Jakub Kicinski (5):
>   tls: reject the combination of TLS and sockmap
>   tls: remove dead sockmap (psock) handling from the SW path
>   selftests/bpf: remove sockmap + ktls tests
>   selftests/bpf: drop the unused kTLS program from test_sockmap
>   selftests/bpf: test that TLS crypto is rejected on a sockmap socket
> 
> [...]

Here is the summary with links:
  - [net-next,1/5] tls: reject the combination of TLS and sockmap
    https://git.kernel.org/netdev/net-next/c/460e6486617c
  - [net-next,2/5] tls: remove dead sockmap (psock) handling from the SW path
    https://git.kernel.org/netdev/net-next/c/79511603a65b
  - [net-next,3/5] selftests/bpf: remove sockmap + ktls tests
    https://git.kernel.org/netdev/net-next/c/faf89584e436
  - [net-next,4/5] selftests/bpf: drop the unused kTLS program from test_sockmap
    https://git.kernel.org/netdev/net-next/c/6af8971d910e
  - [net-next,5/5] selftests/bpf: test that TLS crypto is rejected on a sockmap socket
    https://git.kernel.org/netdev/net-next/c/5949a7cf11e6

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH net] netconsole: don't drop the last byte of a full-sized message
From: Breno Leitao @ 2026-06-16 16:09 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, asantostc, gustavold, kernel-team,
	Breno Leitao

nt->buf is exactly MAX_PRINT_CHUNK bytes, but scnprintf() reserves one
byte for its NUL terminator, so a non-fragmented payload of exactly
MAX_PRINT_CHUNK loses its last byte (emitted as a stray NUL in the
release path). Grow nt->buf to MAX_PRINT_CHUNK + 1 and bound the
scnprintf() calls with sizeof(nt->buf); the transmitted length stays
capped at MAX_PRINT_CHUNK.

Alternatively, nt->buf could be left at MAX_PRINT_CHUNK and the NUL byte
reserved by routing exactly-MAX_PRINT_CHUNK payloads to fragmentation
('len < MAX_PRINT_CHUNK'), at the cost of fragmenting those messages.
But it would look less sane, thus the current approach.

Fixes: c62c0a17f9b7 ("netconsole: Append kernel version to message")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 57dd6821a8aa9..bfab0a47678c9 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -184,8 +184,10 @@ struct netconsole_target {
 	bool			extended;
 	bool			release;
 	struct netpoll		np;
-	/* protected by target_list_lock */
-	char			buf[MAX_PRINT_CHUNK];
+	/* protected by target_list_lock; +1 gives scnprintf() room for its
+	 * NUL terminator so a full MAX_PRINT_CHUNK payload is not truncated
+	 */
+	char			buf[MAX_PRINT_CHUNK + 1];
 	struct work_struct	resume_wq;
 };
 
@@ -1692,7 +1694,7 @@ static void send_msg_no_fragmentation(struct netconsole_target *nt,
 	if (release_len) {
 		release = init_utsname()->release;
 
-		scnprintf(nt->buf, MAX_PRINT_CHUNK, "%s,%.*s", release,
+		scnprintf(nt->buf, sizeof(nt->buf), "%s,%.*s", release,
 			  msg_len, msg);
 		msg_len += release_len;
 	} else {
@@ -1701,12 +1703,12 @@ static void send_msg_no_fragmentation(struct netconsole_target *nt,
 
 	if (userdata)
 		msg_len += scnprintf(&nt->buf[msg_len],
-				     MAX_PRINT_CHUNK - msg_len, "%s",
+				     sizeof(nt->buf) - msg_len, "%s",
 				     userdata);
 
 	if (sysdata)
 		msg_len += scnprintf(&nt->buf[msg_len],
-				     MAX_PRINT_CHUNK - msg_len, "%s",
+				     sizeof(nt->buf) - msg_len, "%s",
 				     sysdata);
 
 	send_udp(nt, nt->buf, msg_len);

---
base-commit: fbc6a80cb5d3fd4ac4b56e8c9d791dd17be890c4
change-id: 20260616-max_print_chunk-0a8cea1b1ed7

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply related

* Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Stefano Garzarella @ 2026-06-16 16:13 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <021a6604-289c-4dd8-a0be-33c7812c0105@virtuozzo.com>

On Tue, Jun 16, 2026 at 06:58:40PM +0300, Andrey Drobyshev wrote:
>On 6/16/26 5:18 PM, Stefano Garzarella wrote:
>> On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:

[...]

>>> static u32 vhost_transport_get_local_cid(void)
>>> @@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>>> 	 * the mutex would be too expensive in this hot path, and we already have
>>> 	 * all the outcomes covered: if the backend becomes NULL right after the check,
>>> 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>>> +	 *
>>> +	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>>> +	 * The kick in vhost_vsock_start() will drain them on resume.
>>> 	 */
>>> 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>>> -		rcu_read_unlock();
>>> -		kfree_skb(skb);
>>> ]		return -EHOSTUNREACH;
>>> +		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
>>> +		if (!READ_ONCE(vsock->cpr_paused)) {
>>
>> Can we avoid this which is not really readable and maybe add a single
>> variable to control the fast-fail at all?
>>
>> I mean replacing both cpr_paused + backend-pointer with a single
>> `started` flag: set it to false at open, true on start via
>> smp_store_release(), back to false on normal stop, and leave it true
>> during CPR pause.
>>
>> The reader in send_pkt can do just:
>>
>>      if (!smp_load_acquire(&vsock->started))
>>          return -EHOSTUNREACH;
>>
>> WDYT?
>>
>
>I don't think it's gonna work as suggested.  As I understand, the order
>during CPR migration is:
>
>1) SET_RUNNING(0)
>       -> vhost_vsock_stop()
>           -> vhost_vsock_drop_backends()
>2) RESET_OWNER
>       -> vhost_vsock_drop_backends()
>3) SET_OWNER
>4) SET_RUNNING(1)
>       -> vhost_vsock_start
>           -> for (...) vhost_vq_set_backend()
>
>(Btw I just noticed backends are already NULL at step 2), but that's
>just our CPR case, for any potential RESET_OWNER users it might not be
>the case).
>
>So the race windows starts from 1) (not from 2)).  We have no way of
>differentiating whether device is actually being stopped for good, or
>we're in the middle of CPR.  If we set the flag to false on stop as you
>suggested, we'll still hit the -EHOSTUNREACH case eventually, and
>avoiding it is the whole purpose of this patch.
>
>The fast-fail with -EHOSTUNREACH relies on the presence of backends.
>IIUC the backend will only become set after initial SET_RUNNING(1),
>which will only happen once the guest driver writes smth to virtio
>config register, QEMU catches it and calls SET_RUNNING(1).  So we have
>ordering with the guest's actions here, which is logical.  But for our
>issue that means that the only true marker of paused/not paused is the
>presence of backends - and that's why the flag is set in
>vhost_vsock_drop_backends().

Okay, so what about avoiding to set `started` to false in 
SET_RUNNING(0)? I mean use it just to track the first SET_RUNNING(1).
(And maybe changing the name to that variable).

Apart from CPR, when can SET_RUNNING(0) occur?

At the end that was just an optimization, if we queue the packet is not 
a big issue IMO.

>
>>> +			rcu_read_unlock();
>>> +			kfree_skb(skb);
>>> +			return -EHOSTUNREACH;
>>> +		}
>>
>>
>> That said claude here is reporting a potential issue that I think we
>> should consider:
>>      After VHOST_RESET_OWNER, the guest CID stays in the hash, so
>>      vhost_transport_send_pkt() can still find the vsock, skip the
>>      fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while
>>      vhost_workers_free() is freeing workers without a synchronize_rcu()
>>      — risking a use-after-free. Also, any send_pkt_work queued between
>>      the last flush and worker teardown gets its VHOST_WORK_QUEUED 
>>      bit
>>      stuck (the vhost task exits without draining), deadlocking
>>      host→guest traffic after restart.
>>
>>      A synchronize_rcu() in vhost_workers_free() between the
>>      rcu_assign_pointer(NULL) loop and the destroy loop would close the
>>      use-after-free, and reinitializing send_pkt_work via
>>      vhost_work_init() after vhost_dev_reset_owner() returns would clear
>>      the stuck QUEUED bit.
>>
>>
>
>Yes, this looks real indeed.  Though I couldn't hit the UAF issue while
>testing host->guest transfer under KASAN.
>
>>> 	}
>>>
>>> 	if (virtio_vsock_skb_reply(skb))
>>> @@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>>> 		mutex_unlock(&vq->mutex);
>>> 	}
>>>
>>> +	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>>> +	WRITE_ONCE(vsock->cpr_paused, false);
>>> +
>>> 	/* Some packets may have been queued before the device was started,
>>> 	 * let's kick the send worker to send them.
>>> 	 */
>>> @@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>>>
>>> 	lockdep_assert_held(&vsock->dev.mutex);
>>>
>>> +	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>>> +		WRITE_ONCE(vsock->cpr_paused, true);
>>> +		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>>> +	}
>>
>> Why here and not in vhost_vsock_reset_owner()?
>>
>> Also having this here will set it to true also with
>> VHOST_VSOCK_SET_RUNNING(0), is that right?
>>
>
>That was added here precisely to cover the vhost_vsock_stop() case (see
>above).

I see now, a comment or something in the commit would have helped.

Thanks,
Stefano


^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Alexander H Duyck @ 2026-06-16 16:14 UTC (permalink / raw)
  To: Das, Shubham, Andrew Lunn
  Cc: netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
	Chintalapalle, Balaji
In-Reply-To: <SN7PR11MB81099B4885C10E16D52A2EB9FFE52@SN7PR11MB8109.namprd11.prod.outlook.com>

On Tue, 2026-06-16 at 12:14 +0000, Das, Shubham wrote:
> Hi Andrew,
> 
> Thanks for the feedback.
> 
> Yes, for multi-lane ports we can accept the lane number as an argument like:
> 
> ethtool --phy-test eth1 lane 0 tx-prbs prbs7
> ethtool --phy-test eth2 lane 0 rx-prbs prbs7
> 
> We referred to "Lee Trager's" "Open-Source Tooling for PHY Management and Testing" session:
> https://netdevconf.info/0x19/sessions/talk/open-source-tooling-for-phy-management-and-testing.html?.
> We have been trying to reach "Lee Trager" to seek more input, latest update on the approach and understand if there is a parallel effort in active so we can collaborate.
> If you can, please help me connect with "Lee Trager" and others who expressed interest in Ethernet PRBS. We are happy to align and start implementation.
> 

You aren't going to have much luck if you are trying to reach out via
his Meta address as he has moved onto Nvidia so he is no longer working
on the fbnic driver.

As far as the work done most of it was internal and making use of
debugfs. I don't believe any of the work for fbnic began to approach
the suggested methods for upstreamming the feature as Lee had been
pulled into other efforts.

> About standardizing across other bus like PCIe and USB, I had a quick discussion with our internal designers, but I didn't observe any such SW-level config knobs interest. 
> Looks like Ethernet has clear interest and we are joining that Ethernet PRBS community too.

I think it largely depends on what your implementation looks like. The
point being made was that many of the SerDes PHYs out there are capable
of use in multiple applications. So instead of being a networking
device you would be looking at a SerDes PHY such as those in
"/drivers/phy/".

Also do you know what layer in the PHY you are injecting this PRBS at?
I would be curious if this is PCS or at the PMD level?

If you are referring to the PCS level then yes, it would make sense to
have it in the networking subsystem as the PCS at this point is more a
netdev specific set of drivers, see "/drivers/net/pcs/".

In the case of the PMD that is where things get a bit more interesting.
There is an IEEE c45 register definition that includes PRBS testing
registers, however in the case of our implementation the PMD doesn't
follow that specification and follows more the "/drivers/phy/" model.

> Ethernet PRBS configuration and diagnostics support is well established and already widely used in existing Ethernet SERDES deployments.
> We think Ethernet is the most natural starting point within netdev, as it aligns with current driver practice and existing validation workflows. 

The problem is many of these parts used as an Ethernet Serdes PMD are
really a multiuse part. So for example in the case of the hardware in
FBNIC we use the same part on the Ethernet PHY as we do for the PCIe
Gen5 PHY.

The complication in our case is that both are buried behind our FW due
to the fact that both are shared between slices. However for testing
purposes and such we could look at disabling the odd slices to
essentially unshare the hardware if you need another platform to test
something like this with.

^ permalink raw reply

* Re: [PATCH RFC 4/9] net: stmmac: qcom-ethqos: add per-platform NOC clock voting
From: Mohd Ayaan Anwar @ 2026-06-16 16:17 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Richard Cochran, Bjorn Andersson, Konrad Dybcio, Maxime Coquelin,
	Alexandre Torgue, Russell King, linux-arm-msm, netdev, devicetree,
	linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <45d7faac-7c0f-4f89-808e-06129e8420e4@oss.qualcomm.com>

Hi Konrad,
On Mon, Jun 15, 2026 at 02:13:05PM +0200, Konrad Dybcio wrote:
> On 6/11/26 8:37 PM, Mohd Ayaan Anwar wrote:
> > Some SoCs gate the EMAC's path to the System NOC behind dedicated clocks
> > that must be enabled before the DMA can reach memory.  Add
> > ethqos_noc_clk_cfg and the corresponding fields in the driver-data and
> > runtime structs so each compatible can declare its own set with per-clock
> > rates.  The clocks are acquired during probe and enabled/disabled
> > alongside the existing link clock in ethqos_clks_config().
> 
> Sounds like we should use an OPP table instead, we can't just do 
> set_rate() on qcom, as that will not propagate the required perf
> state to the clock controller's supplier power domain (i.e. VDDCX)
> 

Understood, I will test this out for v2.

	Ayaan

^ permalink raw reply

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
From: Ruinskiy, Dima @ 2026-06-16 16:20 UTC (permalink / raw)
  To: Helge Deller, Andrew Lunn, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev
In-Reply-To: <9d80ed59-5483-4c33-9d27-52fdf24aac6e@gmx.de>

On 15/06/2026 23:36, Helge Deller wrote:
> On 6/15/26 18:41, Andrew Lunn wrote:
>> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
>>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
>>> with my on-board intel e1000e NIC hardware.
>>> Since none of he various tips on the internet helped, I had the idea
>>> to setup a master/slave bond networking to fail over to another NIC when
>>> the Intel chip hangs.
>>>
>>> Sadly this doesn't work as intended, because the link of the intel NIC
>>> isn't reported "down", so the failover never happens, unless I manually
>>> start "ifconfig eno1 down".
>>>
>>> My question: Shouldn't the intel NIC ideally report Link Down if we know
>>> it hangs? That way a fail-over should at least happen, right?
>>>
>>> Below is a completely untested patch.
>>> Does it make sense that I try to test and/or develop such a patch, or
>>> are there things I miss?
>>
>> If the interface is dead, then setting the carrier down makes a lot of
>> sense. 
> 
> That's what I think as well. Thanks for confirming.
> 
>> One question i have is, what do you need to do to recover the
>> hardware? Will it correctly set the carrier up when you do the
>> recovery?
> 
> The only way I could recover was to plug the network cable and re-insert 
> it.
> I have not tested bringing the NIC down.
> But in both cases the driver will need to re-detect the media & link
> 
>> Also, just looking at your proposed change, it is not clear to me why
>> such an assignment will result in carrier down. It would be good to
>> explain it in the commit message.
> 
> Sure. The patch I attached was completely untested and just based on
> the analysis of the flow and how to make the Link possibly report to be 
> down.
> Maybe someone knowledgeable of the driver has a better suggestion how to
> report the link down situation in a clean way?
> 
> Helge
This does not seem like the right direction to me.

The "Detected Hardware Unit Hang" print does not indicate that the 
interface is dead, but that the transmitter is stalled.

This can be due to an unusually high load, or a HW fault / race 
condition with another component, etc.

When a hang is detected, the transmitter is stopped with 
netif_stop_queue() and eventually ndo_tx_timeout triggers a full reset 
to the device, which in many cases recovers it from the hang.

If the hang is persistent, we try to understand the cause and debug it. 
Permanently marking the device as 'down' because it hung once is not 
going to be the optimal solution.


^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: convert udp_lib_getsockopt to sockopt_t
From: Breno Leitao @ 2026-06-16 16:22 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Willem de Bruijn, Shuah Khan, netdev, linux-kernel,
	linux-kselftest, kernel-team
In-Reply-To: <aiy7ZR7Yz2Z4Ioyd@devvm7509.cco0.facebook.com>

On Fri, Jun 12, 2026 at 07:10:15PM -0700, Stanislav Fomichev wrote:
> On 06/12, Breno Leitao wrote:

> >  int udp_lib_getsockopt(struct sock *sk, int level, int optname,
> > -		       char __user *optval, int __user *optlen)
> > +		       sockopt_t *opt)
> >  {
> >  	struct udp_sock *up = udp_sk(sk);
> >  	int val, len;
> >  
> > -	if (get_user(len, optlen))
> > -		return -EFAULT;
> 
> [..]
> 
> > -	if (len < 0)
> > -		return -EINVAL;
> 
> I see this part now in sockopt_init_user, but you mention that it's a
> transitional helper. When we drop it, will we loose this <0 check?
> Maybe keep `if ((int)opt->optlen < 0))` here for backwards
> compatibility?

Good idea. I will do it and respin (once net-next reopens).

Thanks for the review,
--breno

^ permalink raw reply

* [TEST] vlan-bridge-binding-sh flaky
From: Jakub Kicinski @ 2026-06-16 16:27 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, Petr Machata

Hi Ido!

tools/testing/selftests/net/vlan_bridge_binding.sh got flaky recently.

https://netdev.bots.linux.dev/contest.html?test=vlan-bridge-binding-sh

Is it due to Eric's change or something else? Used to be pretty solid.
First lake a week ago and 8 more since, before that for 3 weeks there
were 0 flakes.

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Breno Leitao @ 2026-06-16 16:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, john.ogness, pmladek
  Cc: Jakub Kicinski, Petr Mladek, John Ogness, Sergey Senozhatsky,
	Peter Zijlstra, Vlad Poenaru, Thomas Gleixner, netdev,
	David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616103529.Yh9Dxsjp@linutronix.de>

On Tue, Jun 16, 2026 at 12:35:29PM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> > On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> > > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > > +	local_bh_disable();
> > > + 	poll_napi(dev);
> > > +	_local_bh_enable();
> >
> > tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> > softirq into not waking ksoftirqd? The problematic path is:
> >
> >   scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)
> >
> > so the softirq may never get serviced.
> >
> > In netcons we try to avoid touching the network driver if the Tx path
> > locks are already held. Ideally we'd do something similar with the
> > scheduler. Try to do bare minimum if we may be in the scheduler.
> > Failing that - don't poll the driver if we were called with irqs
> > already disabled.
> >
> > Or maybe we only poll from console->write_thread ?
>
> So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> to NBCON console infrastructure"). Because from here now on writes are
> deferred to the nbcon thread. So this purely about -stable in this case.

Does the nbcon thread handle defer even for consoles that support atomic
operations?

netconsole is marked with CON_NBCON_ATOMIC_UNSAFE, which means it rarely
performs inline/direct printk and instead pushes to the thread, which
flushes in a safe context.

For drivers that behave correctly, I'd like to be able to drop
CON_NBCON_ATOMIC_UNSAFE, potentially setting it at runtime based on the
underlying driver capabilities. If netconsole is backed by a well-behaving
network driver, we could eventually remove the flag (!?)

Would that approach cause any issues?

^ permalink raw reply

* Re: [PATCH RFC 3/9] net: stmmac: qcom-ethqos: fix RGMII_ID mode to use DLL bypass
From: Mohd Ayaan Anwar @ 2026-06-16 16:32 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Richard Cochran, Bjorn Andersson, Konrad Dybcio, Maxime Coquelin,
	Alexandre Torgue, Russell King, linux-arm-msm, netdev, devicetree,
	linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <82705420-771d-41bf-a4d9-ed94dff86ff0@lunn.ch>

On Mon, Jun 15, 2026 at 06:48:55PM +0200, Andrew Lunn wrote:
> > > I'm curious how this works at the moment? Do no boards make use of
> > > RGMII ID? Are all current boards broken?
> > 
> > Searching through the DTS, I found that we have two boards using "rgmii"
> > (qcs404-evb-4000.dts and sa8155-adp.dts) and another board using
> > "rgmii-txid" (sa8540p-ride.dts). No board which uses RGMII ID.
> 
> So this causes problems. We cannot break existing boards, yet it would
> be good to fix the current broken behaviour.

I am trying to track down the sa8155-adp and sa8540p-ride boards. The
EMAC on QCS404 is extremely similar to QCS615 Ride [0], and I got that
board to work with this series (with RGMII ID mode). So I am fairly
confident that QCS404 would not break (if its even booting up with the
upstream kernel currently). Also, I think we could change the phy-mode
for QCS404 to "rgmii-id" from "rgmii" if these fixes go in.

> It could be the best way forward is that you issue a warning when
> "rgmii" is found and pass rgmii-id to the PHY. And you also change the
> two boards to use rgmii-id. Lets think about the rgmii-txid case once
> we better understand it.
> 

As Konrad mentioned, it would be great to know if we can test out these
boards. Looking at the different versions of the ETHQOS programming
guide, stopping MAC side delay should be as simple as what we are doing
in this commit. But whether the two boards work directly with the
default PHY delays is unknown.

	Ayaan

[0] The proposed RGMII fixes would help enable ethernet on QCS615 Ride
as well. I see that the original series had a lot of issues:
https://lore.kernel.org/all/20250121-dts_qcs615-v3-0-fa4496950d8a@quicinc.com/

^ permalink raw reply

* Re: [PATCH net] appletalk: fix use-after-free in atalk_find_primary()
From: Simon Horman @ 2026-06-16 16:34 UTC (permalink / raw)
  To: Yizhou Zhao
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Kees Cook, Kito Xu, linux-kernel, Yuxiang Yang,
	Ao Wang, Xuewei Feng, Qi Li, Ke Xu, stable
In-Reply-To: <20260615103930.1484-1-zhaoyz24@mails.tsinghua.edu.cn>

On Mon, Jun 15, 2026 at 06:39:28PM +0800, Yizhou Zhao wrote:
> atalk_find_primary() walks the global AppleTalk interface list under
> atalk_interfaces_lock, but returns a pointer to iface->address after
> dropping that lock.  Both atalk_autobind() and atalk_bind() then
> dereference the returned pointer without any lifetime protection.
> 
> The interface can be removed concurrently through the normal AppleTalk
> interface ioctl path.  SIOCATALKDIFADDR calls atalk_dev_down(), which
> eventually reaches atif_drop_device() and frees the same struct
> atalk_iface that owns the returned address field.  A racing bind can
> therefore read from freed memory.
> 
> This is reachable with a configured AppleTalk interface; reproducing the
> race does not require a malicious device or driver.  The configuration
> ioctls require CAP_NET_ADMIN in the initial user namespace, and
> AF_APPLETALK sockets are limited to init_net.
> 
> Fix the lifetime issue without changing the returned address pointer
> type.  Rename the helper to atalk_find_primary_locked() and keep
> atalk_interfaces_lock held across the return.  The callers now copy
> s_net and s_node while the lock is still held, then immediately release
> the lock before doing any further work.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: stable@vger.kernel.org
> Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
> Reported-by: Ao Wang <wangao@seu.edu.cn>
> Reported-by: Xuewei Feng <fengxw06@126.com>
> Reported-by: Qi Li <qli01@tsinghua.edu.cn>
> Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
> Assisted-by: GLM:GLM-5.1
> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
> ---
> diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
> index 30a6dc06291c..4d6576cd0ae8 100644
> --- a/net/appletalk/ddp.c
> +++ b/net/appletalk/ddp.c
> @@ -351,7 +351,7 @@ struct atalk_addr *atalk_find_dev_addr(struct net_device *dev)
>  	return iface ? &iface->address : NULL;
>  }
>  

A kernel doc for atalk_find_dev_addr which describes the locking
expectations is probably warranted here.

> -static struct atalk_addr *atalk_find_primary(void)
> +static struct atalk_addr *atalk_find_primary_locked(void)
>  {
>  	struct atalk_iface *fiface = NULL;
>  	struct atalk_addr *retval;
> @@ -378,7 +378,6 @@ static struct atalk_addr *atalk_find_primary(void)
>  	else
>  		retval = NULL;
>  out:
> -	read_unlock_bh(&atalk_interfaces_lock);

This function still acquires atalk_interfaces_lock but I don't think that
asymmetry is justified. If the critical section needs to be expanded then I
think it would be best to both acquire and release the lock in the caller.

>  	return retval;
>  }
>  
> @@ -1132,20 +1131,24 @@ static int atalk_autobind(struct sock *sk)
>  {
>  	struct atalk_sock *at = at_sk(sk);
>  	struct sockaddr_at sat;
> -	struct atalk_addr *ap = atalk_find_primary();
> +	struct atalk_addr *ap = atalk_find_primary_locked();
>  	int n = -EADDRNOTAVAIL;

We could take this opportunity to move towards reverse xmas tree here.

>  
>  	if (!ap || ap->s_net == htons(ATADDR_ANYNET))
> -		goto out;
> +		goto unlock_and_out;
>  
>  	at->src_net  = sat.sat_addr.s_net  = ap->s_net;
>  	at->src_node = sat.sat_addr.s_node = ap->s_node;
> +	read_unlock_bh(&atalk_interfaces_lock);

The unlock_and_out label applies to the critical section which ends here.
But in my mind the goto construct is best used for handling errors 
that apply to, and in general accumulate during, the flow of a function.

Combining that with my earlier comments would go for something like the
following (completely untested!). Similarly in atalk_bind().

	struct atalk_sock *at = at_sk(sk);
	struct sockaddr_at sat;
	int n = -EADDRNOTAVAIL;
	struct atalk_addr *ap;

	read_lock_bh(&atalk_interfaces_lock);
	ap = atalk_find_primary_locked();

	if (ap && ap->s_net != htons(ATADDR_ANYNET)) {
		at->src_net  = sat.sat_addr.s_net  = ap->s_net;
		at->src_node = sat.sat_addr.s_node = ap->s_node;
	}

	read_unlock_bh(&atalk_interfaces_lock);

>  
>  	n = atalk_pick_and_bind_port(sk, &sat);
>  	if (!n)
>  		sock_reset_flag(sk, SOCK_ZAPPED);
>  out:
>  	return n;
> +unlock_and_out:
> +	read_unlock_bh(&atalk_interfaces_lock);
> +	goto out;
>  }
>  
>  /* Set the address 'our end' of the connection */

...

-- 
pw-bot: changes-requested

^ permalink raw reply

* Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
From: Kory Maincent @ 2026-06-16 16:42 UTC (permalink / raw)
  To: Carlo Szelinsky
  Cc: Corey Leavitt, Jakub Kicinski, Russell King, Oleksij Rempel,
	Andrew Lunn, Heiner Kallweit, David S . Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, netdev, linux-kernel
In-Reply-To: <20260615180812.829678-1-github@szelinsky.de>

Hello Carlo,

On Mon, 15 Jun 2026 20:08:12 +0200
Carlo Szelinsky <github@szelinsky.de> wrote:

> Hi Corey,
> 
> just checking in on this one. Did you get a chance to continue with the
> series, or is there anything I can help with to move it forward? I'm
> happy to test a v2, and I can still run the SFP path on the
> S600WP-5GT-2SX-SE once it's back on my desk.
> 
> Kory, Jakub, Russell :-) it would be great to hear your view on the
> approach so Corey can plan the next version. The series fixed the probe
> loop in my testing and I'd really like to see it land.

I haven't heard from Corey since this patch series, but I am in favor of this
notifier design.
Corey, do you have time to continue this work? If not, would it be okay for
Carlo to continue it for you?

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Mohd Ayaan Anwar @ 2026-06-16 16:50 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Richard Cochran, Bjorn Andersson, Konrad Dybcio, Maxime Coquelin,
	Alexandre Torgue, Russell King, linux-arm-msm, netdev, devicetree,
	linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <2cb658f3-f564-4396-884d-d025eaa674a1@oss.qualcomm.com>

On Tue, Jun 16, 2026 at 11:50:26AM +0200, Konrad Dybcio wrote:
> On 6/11/26 8:37 PM, Mohd Ayaan Anwar wrote:
> 
> > +&tlmm {
> > +	ethernet0_defaults: ethernet0-defaults-state {
> 
> s/defaults/default
> 
> Please move this definition to shikra.dtsi
> 

The CQM and CQS variants have identical GPIO mapping but the IQS is
different. So should I keep this in shikra.dtsi and overwrite for IQS in
shikra-iqs-evk.dts?


> > +
> > +	emac0_phy_en_hog: emac0-phy-en-hog {
> > +		gpio-hog;
> > +		gpios = <149 GPIO_ACTIVE_HIGH>;
> > +		output-high;
> > +		line-name = "emac0-phy-en";
> > +	};
> 
> This looks like a hack - what does this pin actually do?
> 

The power supply to both PHYs on Shikra is gated by a GPIO pin. I am
unsure whether they should be modelled as a fixed, enable-on-boot
regulator or just like this. They need to be powered on early so that
MDIO can detect them.

Thank you for the review. I will fix the stylistic issues in v2.

	Ayaan

^ permalink raw reply

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
From: Helge Deller @ 2026-06-16 16:55 UTC (permalink / raw)
  To: Ruinskiy, Dima, Andrew Lunn, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev
In-Reply-To: <51828156-e859-44db-9926-c076796d0f75@intel.com>

Hello Dima,

On 6/16/26 18:20, Ruinskiy, Dima wrote:
> On 15/06/2026 23:36, Helge Deller wrote:
>> On 6/15/26 18:41, Andrew Lunn wrote:
>>> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
>>>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
>>>> with my on-board intel e1000e NIC hardware.
>>>> Since none of he various tips on the internet helped, I had the idea
>>>> to setup a master/slave bond networking to fail over to another NIC when
>>>> the Intel chip hangs.
>>>>
>>>> Sadly this doesn't work as intended, because the link of the intel NIC
>>>> isn't reported "down", so the failover never happens, unless I manually
>>>> start "ifconfig eno1 down".
>>>>
>>>> My question: Shouldn't the intel NIC ideally report Link Down if we know
>>>> it hangs? That way a fail-over should at least happen, right?
>>>>
>>>> Below is a completely untested patch.
>>>> Does it make sense that I try to test and/or develop such a patch, or
>>>> are there things I miss?
>>>
>>> If the interface is dead, then setting the carrier down makes a lot of
>>> sense. 
>>
>> That's what I think as well. Thanks for confirming.
>>
>>> One question i have is, what do you need to do to recover the
>>> hardware? Will it correctly set the carrier up when you do the
>>> recovery?
>>
>> The only way I could recover was to plug the network cable and re-insert it.
>> I have not tested bringing the NIC down.
>> But in both cases the driver will need to re-detect the media & link
>>
>>> Also, just looking at your proposed change, it is not clear to me why
>>> such an assignment will result in carrier down. It would be good to
>>> explain it in the commit message.
>>
>> Sure. The patch I attached was completely untested and just based on
>> the analysis of the flow and how to make the Link possibly report to be down.
>> Maybe someone knowledgeable of the driver has a better suggestion how to
>> report the link down situation in a clean way?
>>
>> Helge
> This does not seem like the right direction to me.
> 
> The "Detected Hardware Unit Hang" print does not indicate that the
> interface is dead, but that the transmitter is stalled.

Ok. But effectively it means there can nothing be transmitted then at this stage,
which somehow is the same as if the Link would be down.

> This can be due to an unusually high load, or a HW fault / race condition with another component, etc.
>
> When a hang is detected, the transmitter is stopped with
> netif_stop_queue() and eventually ndo_tx_timeout triggers a full
> reset to the device, which in many cases recovers it from the hang.

That would be optimal, but I have never seen it recovering from such stalls since years.
Also looking at the many reports in the internet, people say it just
hangs and does not recover until the cable is plugged out (I might be wrong!).

> If the hang is persistent, we try to understand the cause and debug
> it. Permanently marking the device as 'down' because it hung once is
> not going to be the optimal solution.

Of course debugging this situation is preferred but it does not help when
the productive remote system stays unreachable forever.
Right now it just fills the syslog with the same stuck message.
Even an module option like "report_link_down_on_hang after 5 automatic re-tries"
would be good compromise.... You still should be able to get the necessary
debug info then.

Helge

^ permalink raw reply

* Re: [PATCH net v3] tipc: fix slab-use-after-free Read in tipc_aead_decrypt_done
From: Simon Horman @ 2026-06-16 16:58 UTC (permalink / raw)
  To: doruk
  Cc: Simon Horman, jmaloy, davem, edumazet, kuba, pabeni,
	aleksander.lobakin, tipc-discussion, netdev, linux-kernel, stable
In-Reply-To: <20260615114618.71249-1-doruk@0sec.ai>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
tipc: fix slab-use-after-free Read in tipc_aead_decrypt_done

This commit attempts to fix a slab-use-after-free read in
tipc_aead_decrypt_done by taking a reference on the network namespace
before starting asynchronous decryption. It mirrors the encrypt-side fix
by keeping the reference across the async return and dropping it when done.

> diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c
> index 6d3b6b89b1d17..84a6489da0360 100644
> --- a/net/tipc/crypto.c
> +++ b/net/tipc/crypto.c
> @@ -941,12 +941,20 @@ static int tipc_aead_decrypt(struct net *net, struct tipc_aead *aead,
>  		goto exit;
>  	}
>  
> +	/* Get net to avoid freed tipc_crypto when delete namespace */
> +	if (!maybe_get_net(aead->crypto->net)) {

[Severity: High]
Could accessing aead->crypto->net read freed memory if the namespace
teardown races with packet decryption?

If aead is a TX cluster key, its crypto pointer points to the per-netns
crypto_tx structure. During a concurrent namespace teardown, tipc_exit_net()
synchronously frees crypto_tx without waiting for an RCU grace period.

Since packet reception (tipc_rcv) executes in softirq under rcu_read_lock()
but without holding a namespace reference, dereferencing aead->crypto to get
the net pointer could result in a use-after-free.

Because tipc_aead_decrypt() already receives the struct net *net directly as
its first parameter, would it be safer to use maybe_get_net(net) instead to
avoid accessing the potentially freed aead->crypto structure entirely?

> +		tipc_bearer_put(b);
> +		rc = -ENODEV;
> +		goto exit;
> +	}
> +
>  	/* Now, do decrypt */
>  	rc = crypto_aead_decrypt(req);
>  	if (rc == -EINPROGRESS || rc == -EBUSY)
>  		return rc;
>  
>  	tipc_bearer_put(b);
> +	put_net(aead->crypto->net);

[Severity: High]
Does this put_net() call have the same use-after-free risk? Could this also
use the net parameter directly instead of dereferencing aead->crypto->net?

[ ... ]

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Peter Zijlstra @ 2026-06-16 17:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jakub Kicinski, Petr Mladek, John Ogness, Sergey Senozhatsky,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616103529.Yh9Dxsjp@linutronix.de>

On Tue, Jun 16, 2026 at 12:35:29PM +0200, Sebastian Andrzej Siewior wrote:

> So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> to NBCON console infrastructure"). Because from here now on writes are
> deferred to the nbcon thread. So this purely about -stable in this case.

Hmm, I thought netconsole had some reserved skbs and could to writes
'atomic' like? That said, it was 2.6 era the last time I looked at
netconsole.

> Now. The scheduler usually does printk_deferred() because of the rq lock
> so it does not deadlock for various reasons. It is kind of a pity that
> the various WARN macros don't do that.

People have tried, last time was here:

  https://lkml.kernel.org/r/20260611074344.GG48970@noisy.programming.kicks-ass.net

and I hate deferred with a passion. It means you'll never see the
message when you wreck the machine.

> We could add printk_deferred_enter/exit() to all the rq_lock() variants.
> I think PeterZ loves this the most. And Greg will appreciate it too
> while backporting because of all the context changes.

No, not going to happen, ever, sorry. Instead printk should delete
console sem and have printk() itself be atomic safe.

As stated, printk deferred is an abomination and needs to die a horrible
painful death.

As described here:

  https://lkml.kernel.org/r/20260611191922.GK187714@noisy.programming.kicks-ass.net

"So printk should:

 - stick msg in buffer (lockless)
 - print to atomic consoles (lockless)
 - use irq_work to wake console kthreads (lockless)
 - each kthread then tries to flush buffer to its own non-atomic console
   in non-atomic context."




^ permalink raw reply

* Re: [syzbot] [net?] KASAN: slab-use-after-free Read in fib_rules_lookup
From: Kuniyuki Iwashima @ 2026-06-16 17:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ido Schimmel, syzbot, davem, dsahern, horms, kuba, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <CANn89iJ7S1op9FJeaEqdR0KDiPu08PbFP7CqJ8NLVRgcPt370A@mail.gmail.com>

On Tue, Jun 16, 2026 at 8:55 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jun 16, 2026 at 8:31 AM Ido Schimmel <idosch@nvidia.com> wrote:
> >
> > On Tue, Jun 16, 2026 at 07:05:24AM -0700, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    72dfa4700f78 net: dsa: sja1105: fix lastused timestamp in ..
> >
> > This includes commit 759923cf03b0 ("ipv4: fib: Convert
> > fib_net_exit_batch() to ->exit_rtnl().") that moved ip_fib_net_exit()
> > (and therefore fib4_rules_exit()) earlier in the netns dismantle path.
> >
> > Kuniyuki, can you please take a look?
> >
> > You can use this to reproduce:
> >
> > #!/bin/bash
> >
> > while true; do
> >         ip netns add ns1
> >         ip -n ns1 link set dev lo up
> >         ip -n ns1 address add 192.0.2.1/24 dev lo
> >         ip -n ns1 link add name dummy1 up type dummy
> >         ip -n ns1 address add 198.51.100.1/24 dev dummy1
> >         ip -n ns1 rule add ipproto tcp sport 12345 table 12345
> >         ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321
> >         ip netns del ns1
> > done
> >
>
> Oh right.
>
> While looking at this syzbot report I also found an old issue.
>
> https://lore.kernel.org/netdev/20260616141317.407791-1-edumazet@google.com/T/#u
>
> I guess adding some delays in enqueue_to_backlog() could trigger a
> similar bug even if we revert Kuniyuki's patch.

I'll look into it, thank you both !

>
>
>
>
> > Thanks
> >
> > > git tree:       net-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=15794bd2580000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=a0842261b62cdea8
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=965506b59a2de0b6905c
> > > compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/d4e16f50a97c/disk-72dfa470.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/6cd4a736e796/vmlinux-72dfa470.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/548b0011c8e8/bzImage-72dfa470.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com
> > >
> > > bond0 (unregistering): Released all slaves
> > > bond1 (unregistering): Released all slaves
> > > bond2 (unregistering): (slave dummy0): Releasing active interface
> > > bond2 (unregistering): Released all slaves
> > > ==================================================================
> > > BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
> > > Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641
> > >
> > > CPU: 0 UID: 0 PID: 12641 Comm: kworker/u8:21 Not tainted syzkaller #0 PREEMPT(full)
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
> > > Workqueue: netns cleanup_net
> > > Call Trace:
> > >  <TASK>
> > >  dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
> > >  print_address_description+0x55/0x1e0 mm/kasan/report.c:378
> > >  print_report+0x58/0x70 mm/kasan/report.c:482
> > >  kasan_report+0x117/0x150 mm/kasan/report.c:595
> > >  fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
> > >  __fib_lookup+0x106/0x210 net/ipv4/fib_rules.c:96
> > >  ip_route_output_key_hash_rcu+0x294/0x2720 net/ipv4/route.c:2811
> > >  ip_route_output_key_hash+0x18d/0x2a0 net/ipv4/route.c:2702
> > >  __ip_route_output_key include/net/route.h:169 [inline]
> > >  ip_route_output_flow+0x2a/0x150 net/ipv4/route.c:2929
> > >  ip4_datagram_release_cb+0x89d/0xbe0 net/ipv4/datagram.c:118
> > >  release_sock+0x206/0x260 net/core/sock.c:3861
> > >  inet_shutdown+0x2b1/0x390 net/ipv4/af_inet.c:950
> > >  udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
> > >  fou_release net/ipv4/fou_core.c:562 [inline]
> > >  fou_exit_net+0x17d/0x1f0 net/ipv4/fou_core.c:1230
> > >  ops_exit_list net/core/net_namespace.c:199 [inline]
> > >  ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252
> > >  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
> > >  process_one_work kernel/workqueue.c:3314 [inline]
> > >  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
> > >  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
> > >  kthread+0x389/0x470 kernel/kthread.c:436
> > >  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> > >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> > >  </TASK>
> > >
> > > Allocated by task 19121:
> > >  kasan_save_stack mm/kasan/common.c:57 [inline]
> > >  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
> > >  poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
> > >  __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
> > >  kasan_kmalloc include/linux/kasan.h:263 [inline]
> > >  __do_kmalloc_node mm/slub.c:5296 [inline]
> > >  __kmalloc_node_track_caller_noprof+0x4d7/0x7b0 mm/slub.c:5408
> > >  kmemdup_noprof+0x2b/0x70 mm/util.c:138
> > >  kmemdup_noprof include/linux/fortify-string.h:763 [inline]
> > >  fib_rules_register+0x2f/0x400 net/core/fib_rules.c:170
> > >  fib4_rules_init+0x21/0x160 net/ipv4/fib_rules.c:508
> > >  ip_fib_net_init net/ipv4/fib_frontend.c:1578 [inline]
> > >  fib_net_init+0x17a/0x3e0 net/ipv4/fib_frontend.c:1628
> > >  ops_init+0x35d/0x5d0 net/core/net_namespace.c:137
> > >  setup_net+0x118/0x350 net/core/net_namespace.c:446
> > >  copy_net_ns+0x4f9/0x720 net/core/net_namespace.c:579
> > >  create_new_namespaces+0x3f0/0x6b0 kernel/nsproxy.c:132
> > >  unshare_nsproxy_namespaces+0x149/0x190 kernel/nsproxy.c:234
> > >  ksys_unshare+0x57d/0xa00 kernel/fork.c:3242
> > >  __do_sys_unshare kernel/fork.c:3316 [inline]
> > >  __se_sys_unshare kernel/fork.c:3314 [inline]
> > >  __x64_sys_unshare+0x38/0x50 kernel/fork.c:3314
> > >  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > >  do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > >
> > > Freed by task 12641:
> > >  kasan_save_stack mm/kasan/common.c:57 [inline]
> > >  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
> > >  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:584
> > >  poison_slab_object mm/kasan/common.c:253 [inline]
> > >  __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
> > >  kasan_slab_free include/linux/kasan.h:235 [inline]
> > >  slab_free_hook mm/slub.c:2689 [inline]
> > >  __rcu_free_sheaf_prepare+0x12d/0x2a0 mm/slub.c:2940
> > >  rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
> > >  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
> > >  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
> > >  handle_softirqs+0x225/0x840 kernel/softirq.c:622
> > >  do_softirq+0x76/0xd0 kernel/softirq.c:523
> > >  __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
> > >  unregister_netdevice_many_notify+0x1874/0x2150 net/core/dev.c:12445
> > >  ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
> > >  ops_undo_list+0x391/0x8d0 net/core/net_namespace.c:248
> > >  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
> > >  process_one_work kernel/workqueue.c:3314 [inline]
> > >  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
> > >  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
> > >  kthread+0x389/0x470 kernel/kthread.c:436
> > >  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> > >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> > >
> > > The buggy address belongs to the object at ffff88804ec4c600
> > >  which belongs to the cache kmalloc-192 of size 192
> > > The buggy address is located 128 bytes inside of
> > >  freed 192-byte region [ffff88804ec4c600, ffff88804ec4c6c0)
> > >
> > > The buggy address belongs to the physical page:
> > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4ec4c
> > > flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
> > > page_type: f5(slab)
> > > raw: 00fff00000000000 ffff88813fe163c0 dead000000000100 dead000000000122
> > > raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
> > > page dumped because: kasan: bad access detected
> > > page_owner tracks the page as allocated
> > > page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 13856, tgid 13853 (syz.3.2144), ts 351172300879, free_ts 351133053454
> > >  set_page_owner include/linux/page_owner.h:32 [inline]
> > >  post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
> > >  prep_new_page mm/page_alloc.c:1861 [inline]
> > >  get_page_from_freelist+0x24ae/0x2530 mm/page_alloc.c:3941
> > >  __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
> > >  alloc_slab_page mm/slub.c:3278 [inline]
> > >  allocate_slab+0x77/0x660 mm/slub.c:3467
> > >  new_slab mm/slub.c:3525 [inline]
> > >  refill_objects+0x336/0x3d0 mm/slub.c:7272
> > >  refill_sheaf mm/slub.c:2816 [inline]
> > >  __pcs_replace_empty_main+0x320/0x720 mm/slub.c:4652
> > >  alloc_from_pcs mm/slub.c:4750 [inline]
> > >  slab_alloc_node mm/slub.c:4884 [inline]
> > >  __do_kmalloc_node mm/slub.c:5295 [inline]
> > >  __kmalloc_noprof+0x464/0x750 mm/slub.c:5308
> > >  kmalloc_noprof include/linux/slab.h:954 [inline]
> > >  kzalloc_noprof include/linux/slab.h:1188 [inline]
> > >  new_dir fs/proc/proc_sysctl.c:966 [inline]
> > >  get_subdir fs/proc/proc_sysctl.c:1010 [inline]
> > >  sysctl_mkdir_p fs/proc/proc_sysctl.c:1320 [inline]
> > >  __register_sysctl_table+0xc02/0x1370 fs/proc/proc_sysctl.c:1395
> > >  neigh_sysctl_register+0x9b1/0xa90 net/core/neighbour.c:3915
> > >  addrconf_sysctl_register+0xb3/0x1c0 net/ipv6/addrconf.c:7396
> > >  ipv6_add_dev+0xd26/0x13a0 net/ipv6/addrconf.c:460
> > >  addrconf_notify+0x771/0x1050 net/ipv6/addrconf.c:3679
> > >  notifier_call_chain+0x1a5/0x3d0 kernel/notifier.c:85
> > >  call_netdevice_notifiers_extack net/core/dev.c:2288 [inline]
> > >  call_netdevice_notifiers net/core/dev.c:2302 [inline]
> > >  register_netdevice+0x18db/0x1f00 net/core/dev.c:11474
> > >  macsec_newlink+0x706/0x1200 drivers/net/macsec.c:4218
> > >  rtnl_newlink_create+0x310/0xb00 net/core/rtnetlink.c:3905
> > > page last free pid 12657 tgid 12657 stack trace:
> > >  reset_page_owner include/linux/page_owner.h:25 [inline]
> > >  __free_pages_prepare mm/page_alloc.c:1397 [inline]
> > >  __free_frozen_pages+0xc0d/0xd20 mm/page_alloc.c:2938
> > >  __tlb_remove_table_free mm/mmu_gather.c:228 [inline]
> > >  tlb_remove_table_rcu+0x85/0x100 mm/mmu_gather.c:291
> > >  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
> > >  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
> > >  handle_softirqs+0x225/0x840 kernel/softirq.c:622
> > >  __do_softirq kernel/softirq.c:656 [inline]
> > >  invoke_softirq kernel/softirq.c:496 [inline]
> > >  __irq_exit_rcu+0xca/0x220 kernel/softirq.c:735
> > >  irq_exit_rcu+0x9/0x30 kernel/softirq.c:752
> > >  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1061 [inline]
> > >  sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1061
> > >  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
> > >
> > > Memory state around the buggy address:
> > >  ffff88804ec4c580: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
> > >  ffff88804ec4c600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >ffff88804ec4c680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> > >                    ^
> > >  ffff88804ec4c700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >  ffff88804ec4c780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
> > > ==================================================================
> > >
> > >
> > > ---
> > > This report is generated by a bot. It may contain errors.
> > > See https://goo.gl/tpsmEJ for more information about syzbot.
> > > syzbot engineers can be reached at syzkaller@googlegroups.com.
> > >
> > > syzbot will keep track of this issue. See:
> > > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> > >
> > > If the report is already addressed, let syzbot know by replying with:
> > > #syz fix: exact-commit-title
> > >
> > > If you want to overwrite report's subsystems, reply with:
> > > #syz set subsystems: new-subsystem
> > > (See the list of subsystem names on the web dashboard)
> > >
> > > If the report is a duplicate of another one, reply with:
> > > #syz dup: exact-subject-of-another-report
> > >
> > > If you want to undo deduplication, reply with:
> > > #syz undup

^ permalink raw reply

* Re: [PATCH bpf v2 1/2] bpf: Fix partial copy of non-linear test_run output
From: sun jian @ 2026-06-16 17:16 UTC (permalink / raw)
  To: Paul Chaignon
  Cc: bpf, netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo
In-Reply-To: <ajFQvedGURQuKqbX@mail.gmail.com>

On Tue, Jun 16, 2026 at 9:33 PM Paul Chaignon <paul.chaignon@gmail.com> wrote:
>
> On Tue, Jun 16, 2026 at 05:31:02PM +0800, Sun Jian wrote:
> > For non-linear test_run output, bpf_test_finish() derives the linear
> > data copy length from copy_size - frag_size. This only matches the
> > linear data length when copy_size is the full packet size.
> >
> > When userspace provides a short data_out buffer, copy_size is clamped to
> > that buffer size. If copy_size is smaller than frag_size, the computed
> > length becomes negative and bpf_test_finish() returns -ENOSPC before
> > copying the packet prefix or updating data_size_out.
> >
> > Compute the linear data length from the packet layout instead, and clamp
> > the linear copy length to copy_size. This preserves the expected
> > partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
> > in data_out, and report the full packet length through data_size_out.
> >
> > Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
> > Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
> > ---
> >  net/bpf/test_run.c | 11 ++++-------
> >  1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> > index 2bc04feadfab..976e8fa31bc9 100644
> > --- a/net/bpf/test_run.c
> > +++ b/net/bpf/test_run.c
> > @@ -453,19 +453,16 @@ static int bpf_test_finish(const union bpf_attr *kattr,
> >       }
> >
> >       if (data_out) {
> > -             int len = sinfo ? copy_size - frag_size : copy_size;
> > -
> > -             if (len < 0) {
> > -                     err = -ENOSPC;
> > -                     goto out;
> > -             }
> > +             u32 head_len = size - frag_size;
> > +             u32 len = min(copy_size, head_len);
> >
> >               if (copy_to_user(data_out, data, len))
> >                       goto out;
> >
> >               if (sinfo) {
> > -                     int i, offset = len;
> > +                     u32 offset = len;
> >                       u32 data_len;
> > +                     int i;
>
> That doesn't look needed.
>
> >
> >                       for (i = 0; i < sinfo->nr_frags; i++) {
> >                               skb_frag_t *frag = &sinfo->frags[i];
> > --
> > 2.43.0
> >

Hi Paul,

Thanks for taking another look.

Agreed, I'll keep the fix patch minimal and leave offset as-is.

For the selftest patch, I'll try to reuse pkt_v4 and the existing TC
program where possible, and keep only the minimal XDP frags program for the
XDP case.

Thanks,
Sun Jian

^ permalink raw reply

* Re: [PATCH net-next V3 2/7] netdevsim: Register devlink after device init
From: Mark Bloch @ 2026-06-16 17:29 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller,
	Jonathan Corbet, Shuah Khan, Jiri Pirko, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Ethan Nelson-Moore, linux-doc,
	netdev, linux-rdma
In-Reply-To: <f266dfa5-0c6c-4be0-b73e-b2185dadd6a7@nvidia.com>



On 11/06/2026 20:43, Mark Bloch wrote:
> 
> 
> On 11/06/2026 18:54, Jakub Kicinski wrote:
>> On Thu, 11 Jun 2026 09:02:03 +0300 Mark Bloch wrote:
>>> On 11/06/2026 2:50, Jakub Kicinski wrote:
>>>> On Fri, 5 Jun 2026 21:10:25 +0300 Mark Bloch wrote:  
>>>>> devl_register() makes the devlink instance visible to userspace. A later
>>>>> patch also makes registration the point where devlink core may call
>>>>> eswitch_mode_set() to apply a boot-time default eswitch mode.
>>>>>
>>>>> Move netdevsim registration after all objects (resources, params, regions,
>>>>> traps, debugfs etc) are initialized, and after the initial eswitch mode is
>>>>> set to legacy.
>>>>>
>>>>> Move devl_unregister() to the beginning of nsim_drv_remove(), before those
>>>>> devlink objects are torn down. This keeps devlink register/unregister as
>>>>> the notification barrier and makes the later object teardown paths run
>>>>> after devlink is no longer registered, so they do not emit their own
>>>>> netlink DEL notifications.  
>>>>
>>>> This is going backwards. At some point someone from nVidia thought that
>>>> we can order our way out of locking, so mlx5 is likely ordered this way,
>>>> but this must not be required, or in any way normalized.
>>>> We (syzbot) quickly discovered that it doesn't cover all corner cases.
>>>> devl_lock() is exposed specifically to allow the driver to finish
>>>> whatever init it needs without letting user space invoke callbacks, yet.
>>>> Almost (?) all driver callbacks hold devl_lock(), so maybe the devlink
>>>> instance is "visible" to user space but that should not matter.  
>>>
>>> Let me clarify.
>>>
>>> No locking is changed here, and I don't want to make register/unregister
>>> ordering a substitute for devl_lock().
>>>
>>> The only requirement I have for this series is that devl_register() is called
>>> only once the driver is ready for devlink core to call eswitch_mode_set().
>>> That follows from the earlier direction to have the core apply the default
>>> mode from devl_register() instead of adding an explicit driver call.
>>
>> This is exactly what I'm objecting to. AFAIU we are trading off
>> explicit call to get the default value for an implicit behavior
>> depending on order of calls. We want to optimize for how easy it
>> is to get the API wrong, not for LoC.
> 
> Right, the reason I moved in this direction is that in v1 I had
> the explicit driver call, and Jiri asked to make this transparent
> from devlink core instead.
> 
>>
>> If we don't have a clean way to implement this without driver
>> changes let's add the explicit API to get the default value.
>> If driver doesn't call it schedule a work to go via the callback
>> once devl_lock() is dropped. That way drivers which care can optimize
>> themselves by reading the default value upfront. Drivers which don't 
>> care will work correctly, and there's no API call order trap.
> 
> The workqueue fallback is possible, but I think it makes the semantics
> more complicated.
> 
> We would need to track devlink instances which still need the default
> applied, and the worker would have to skip/remove them once handled.
> 
> More importantly, the worker can race with userspace setting the
> eswitch mode, so we would also need some state to tell whether the user
> already changed the mode. That feels more fragile than an explicit
> driver call.
> 
>>
>> Not ideal, but isn't that best we can do here?
>> I still have flashbacks of the fallout from the call ordering games, 
>> we have too many drivers to keep this straight...
> 
> That's why I started with the explicit call in the first place.
> 
> I can switch back to this model: drivers which support boot time eswitch
> defaults will opt in and call the helper once they are ready. This keeps
> the support explicit per driver and avoids making it depend on where
> devl_register() happens in the init path.
> 
> With that, devlink can tell at register time whether the instance supports
> boot time eswitch defaults. If the user configured a default for an instance
> whose driver did not opt in, devlink can write to dmesg from
> devl_register().
> 
> Not perfect, but at least the user gets a visible failure instead of the
> config being silently ignored.
> 
> Mark

Jakub, Jiri, any thoughts?

I think the explicit helper is the cleanest option here, without any
workqueue fallback inside devlink. It avoids depending on devl_register()
ordering, and makes the support explicit per driver.

Does that sound like an acceptable direction?

Mark

> 
>>
>>> So if the objection is to the commit message wording, I can fix that and drop
>>> the "notification barrier" language.
>>>
>>> For unregister, I can probably leave the old ordering as-is. I moved it only
>>> to mirror the register path, which felt cleaner, but it is not required for
>>> the default-mode change and as the lock is held I see no issue with doing
>>> that.
> 
> 


^ permalink raw reply

* [PATCH net] net: thunderbolt: Fix frags[] overflow by bounding frame_count
From: Maoyi Xie @ 2026-06-16 17:38 UTC (permalink / raw)
  To: Mika Westerberg, Yehezkel Bernat, Andrew Lunn, Jakub Kicinski,
	Paolo Abeni
  Cc: David S. Miller, Eric Dumazet, netdev, linux-kernel

tbnet_poll() assembles a multi-frame ThunderboltIP packet into one skb. The
first frame goes into the skb linear area and every further frame is added as
a page fragment.

	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
			page, hdr_size, frame_size,
			TBNET_RX_PAGE_SIZE - hdr_size);

A packet of frame_count frames therefore ends up with frame_count - 1
fragments. tbnet_check_frame() only bounds the peer supplied frame_count to
TBNET_RING_SIZE / 4 (64), which is far above MAX_SKB_FRAGS (17 by default). A
peer that sends a packet of 19 or more small frames pushes nr_frags past
MAX_SKB_FRAGS, so skb_add_rx_frag() writes past skb_shinfo()->frags[] and
corrupts memory after the shared info.

Tighten the start of packet bound to MAX_SKB_FRAGS + 1 so a packet can never
produce more fragments than frags[] can hold. This matches the recent skb
frags overflow fixes in other receive paths, for example f0813bcd2d9d ("net:
wwan: t7xx: fix potential skb->frags overflow in RX path") and 600dc40554dc
("net: usb: cdc-phonet: fix skb frags[] overflow in rx_complete()").

Fixes: e69b6c02b4c3 ("net: Add support for networking over Thunderbolt cable")
Cc: stable@vger.kernel.org
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
---
Mika preferred the bound in tbnet_check_frame() over the nr_frags <
MAX_SKB_FRAGS guard in tbnet_poll() that I first floated on the list, so this
rejects the oversized packet up front. Reproduced under KASAN with a harness
that mirrors the per-frame skb_add_rx_frag() loop.

 drivers/net/thunderbolt/main.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderbolt/main.c b/drivers/net/thunderbolt/main.c
index 7aae5d915a1e..ac016890646c 100644
--- a/drivers/net/thunderbolt/main.c
+++ b/drivers/net/thunderbolt/main.c
@@ -787,8 +787,12 @@ static bool tbnet_check_frame(struct tbnet *net, const struct tbnet_frame *tf,
 		return true;
 	}
 
-	/* Start of packet, validate the frame header */
-	if (frame_count == 0 || frame_count > TBNET_RING_SIZE / 4) {
+	/* Start of packet, validate the frame header. tbnet_poll() puts the
+	 * first frame in the skb linear area and every further frame in a page
+	 * fragment, so a packet may not span more than MAX_SKB_FRAGS + 1 frames
+	 * without overflowing skb_shinfo()->frags[].
+	 */
+	if (frame_count == 0 || frame_count > MAX_SKB_FRAGS + 1) {
 		net->stats.rx_length_errors++;
 		return false;
 	}
-- 
2.34.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox