Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v7 0/4] net: rnpgbe: Add TX/RX and link status support
From: Simon Horman @ 2026-06-12 18:44 UTC (permalink / raw)
  To: Dong Yibo
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, danishanwar,
	vadim.fedorenko, u.kleine-koenig, linux-kernel, netdev, yaojun
In-Reply-To: <20260611100036.36370-1-dong100@mucse.com>

On Thu, Jun 11, 2026 at 06:00:32PM +0800, Dong Yibo wrote:
> Hi maintainers,
> 
> This patch series adds the packet transmission, reception, and link status
> management features to the RNPGBE driver, building upon the previously
> introduced mailbox communication and basic driver infrastructure.
> 
> The series introduces:
> - Msix/msi interrupt handling with NAPI support
> - TX path with scatter-gather DMA and completion handling
> - RX path with page pool buffer management
> - Link status monitoring and carrier management
> 
> These changes enable the RNPGBE driver to support basic tx/rx
> network operations.
> 
> Changelog:
> v6 -> v7:
> [patch 2/4]:
> 1. Fix 'frag_idx' error in rnpgbe_tx_map. (Sashiko-gemini)
> [patch 3/4]:
> 1. Fix skb leak in invalid size path in rnpgbe_clean_rx_irq.
>    (Sashiko-gemini)
> 2. Fix invalid size range check for rxdesc. (Sashiko-gemini)
> [patch 4/4]:
> 1. Fix 'data race on the reply payload'. (Sashiko-gemini)
> 2. Fix 'asymmetric behaviour' when report up/down. (andrew)
> 
> links:
> ---
> v1: https://lore.kernel.org/netdev/20260325091204.94015-1-dong100@mucse.com/
> v2: https://lore.kernel.org/netdev/20260403025713.527841-1-dong100@mucse.com/
> v3: https://lore.kernel.org/netdev/20260507081539.171844-1-dong100@mucse.com/
> v4: https://lore.kernel.org/netdev/20260526033539.164061-1-dong100@mucse.com/
> v5: https://lore.kernel.org/netdev/20260528023150.239532-1-dong100@mucse.com/
> v6: https://lore.kernel.org/netdev/20260604112750.769215-1-dong100@mucse.com/
> 
> Additional Notes:

Thanks for the update and the notes.

There is another round of AI-generated review of this patch-set available
on both https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/

I would appreciate it if you could look over that too. With a view to
addressing any issues that directly affect this patch.

...

^ permalink raw reply

* Re: [PATCH v2 2/5] binder: Make shrinker rely solely on per-VMA lock
From: Dave Hansen @ 2026-06-12 18:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka (SUSE), Alice Ryhl, Dave Hansen, linux-kernel,
	Andrew Morton, Arve Hjønnevåg, Carlos Llamas,
	Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Todd Kjos
In-Reply-To: <CAJuCfpFo_avdhpOviX7EsPqLgDJ3DfeGpth+yu1-ahfawqaSzw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 728 bytes --]

On 6/12/26 10:44, Suren Baghdasaryan wrote:
>> It's not impossible, but I do think it is irrelevant. Or at least that
>> the *VMA* is irrelevant in this case. binder_alloc_is_mapped()==false
>> means that the binder VMA is gone. It's not in the maple tree, and it's
>> not coming back. If a VMA is found, it's an impostor.
> Right, but before your change we were bailing out early. With your
> change we would be generating the traces and freeing the page. I think
> that's a functional change. Was that your intention?

Yeah, it was intentional.

I think the existing behavior is buggy. It also complicates the goal of
removing the mmap lock fallback. I've broken that behavior change out
into a separate patch. (attached here)

[-- Attachment #2: binder-impostor-fix.patch --]
[-- Type: text/x-patch, Size: 2462 bytes --]


tl;dr: Stop relying on VMA lookups to determine when to reclaim
pages. Instead, use binder-internal metadata.

== Background ==

Each 'struct binder_alloc' has one and only one place where it is
recorded as having been mapped. It can be munmap()'d. But after that,
binder_alloc_mmap_handler() will return errors for it being "already
mapped". So, binder mmap()s are a one-shot thing.

But, the original mmap() location is special even after munmap(). It
is still recorded in alloc->vm_start and never cleared out.
binder_alloc_free_page() continues to look up VMAs at that address.

== Problem ==

That leads to some suboptimal behavior. The moment an "impostor" VMA
is created at the old binder address, the shrinker function will
always hit the:

	if (vma && !binder_alloc_is_mapped(alloc))

case and LRU_SKIP all pages.

== Solution ==

Stop using the VMA to drive zapping decisions. Instead, use
binder_alloc_is_mapped().

== Discussion ==

Here's some pseudocode for how this behavior could be triggered:

	addr = mmap(..., len, binder_fd);
	// pages can be reclaimed
	munmap(addr, len);
	// pages can still be reclaimed
	mmap(addr, len, MAP_ANONYMOUS|MAP_PRIVATE, -1, ...);
	// Pages can no longer be reclaimed

There are plenty of ways the code could be restructured now
that it is less dependent on VMAs. But I've left that for future
patches.

---

 b/drivers/android/binder_alloc.c |   10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff -puN drivers/android/binder_alloc.c~binder-impostor-fix drivers/android/binder_alloc.c
--- a/drivers/android/binder_alloc.c~binder-impostor-fix	2026-06-12 10:46:06.704707233 -0700
+++ b/drivers/android/binder_alloc.c	2026-06-12 11:34:15.304460520 -0700
@@ -1164,14 +1164,6 @@ enum lru_status binder_alloc_free_page(s
 	if (!mutex_trylock(&alloc->mutex))
 		goto err_get_alloc_mutex_failed;
 
-	/*
-	 * Since a binder_alloc can only be mapped once, we ensure
-	 * the vma corresponds to this mapping by checking whether
-	 * the binder_alloc is still mapped.
-	 */
-	if (vma && !binder_alloc_is_mapped(alloc))
-		goto err_invalid_vma;
-
 	trace_binder_unmap_kernel_start(alloc, index);
 
 	page_to_free = alloc->pages[index];
@@ -1182,7 +1174,7 @@ enum lru_status binder_alloc_free_page(s
 	list_lru_isolate(lru, item);
 	spin_unlock(&lru->lock);
 
-	if (vma) {
+	if (binder_alloc_is_mapped(alloc)) {
 		trace_binder_unmap_user_start(alloc, index);
 
 		zap_vma_range(vma, page_addr, PAGE_SIZE);
_

^ permalink raw reply

* Re: [net-next 6/9] net: ethernet: ravb: Add callback for gPTP probe
From: Sergey Shtylyov @ 2026-06-12 18:48 UTC (permalink / raw)
  To: Niklas Söderlund, Paul Barker, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Richard Cochran,
	Geert Uytterhoeven, Magnus Damm, netdev, linux-renesas-soc,
	devicetree, linux-kernel
In-Reply-To: <20260610102432.3538432-7-niklas.soderlund+renesas@ragnatech.se>

On 6/10/26 1:24 PM, Niklas Söderlund wrote:

> Different generations of the RAVB IP have different needs when it probes
> the gPTP timer clock. Add a callback in the PTP information to allow
> each generation to probe its own way.
> 
> With this the last gPTP specific flag (gptp_ref_clk) can be removed.
> However the primary motivation for the change is to prepare for Gen4
> support, which compared to other generations with gPTP support does not
> have the clock as part of the IP itself.
> 
> Gen4 will not need to compute GTI value as it have no where to write it,

   Nowhere.

> as the gPTP clock is external. For this reason move the computation of
> it into the newly gPTP probe specific callbacks for the RAVB IP's that
> support it.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

Reviewed-by: Sergey Shtylyov <sergei.shtylyov@gmail.com>

[...]

MBR, Sergey


^ permalink raw reply

* Re: [RFC] connectat()/bindat() or an alternative design
From: Cong Wang @ 2026-06-12 18:50 UTC (permalink / raw)
  To: John Ericson; +Cc: network dev, Li Chen
In-Reply-To: <455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com>

On Wed, Jun 10, 2026 at 10:08:57PM -0400, John Ericson wrote:
> Hi Cong,
> 
> On Mon, Jun 8, 2026, at 3:45 PM, Cong Wang wrote:
> > Hi John,
> >
> > [...]
> >
> > Thanks for bringing this up.
> 
> Sure, thanks for replying to me!
> 
> > I have no doubt connectat()/bindat() helps closing TOCTOU for Unix
> > sockets. However, it would be nicer to describe your use case here,
> > especially what the problems are without it. This would help more to
> > jusify your proposal here than just getting aligned with openat() or
> > BSD.
> >
> > Hope this helps.
> >
> > Regards,
> > Cong
> 
> Yeah, happy to talk about that. Hope this is not too long a reply!
> 
> First, for some background context, I am a developer of the Nix package
> manager. And this, plus my own personal taste, always has me thinking
> about ways we can run processes with fewer privileges. The
> no-ambient-authority capsicum/cloudabi/wasi/whatever dream has lived in
> my head rent-free for many years :). Now these days, with LLMs, it feels
> like these nice-to-have yak shaves of mine are finally worth dusting off
> and striking off the bucket list.
> 
> Also in recent months, we Nix developers have been putting a bunch of
> work into using more `openat2` and friends, and I have no doubt that we
> will continue down this path (even on Windows!). We aim to be an
> exemplar program for following the "always work relative to a file
> descriptor" discipline. It's good for security, but also makes for code
> that --- I believe --- is just more elegant and nicer to read.
> 
> ----
> 
> Nearer term use case: slightly less ugly long path socket opening in
> Nix:

"Nix needs it" is a much better justification than "BSD already has it".
:) So please add this to your patch description/cover letter.

> 
> If you look at [1] you can see a PR I've asked my coworker to draft to
> improve binding and connecting code to cope with longer file paths,
> something which does come up in practice when we are running multiple
> tests with multiple daemons in parallel.
> 
> Now, I think it is safe to say that this code was already quite complex,
> and in this patch only gets *more* complex. The current interfaces make
> supporting longer paths quite annoying. (Though, once we remove the
> `open` and switch to an `*at`-style interface in the wrapper (if macOS
> lets us), it will get less bad.)
> 
> So the first use case would be getting something nicer than the
> `/proc/self/fd/<N>` dance the linked code falls back to. It is good that
> `/proc/self/fd/<N>` exists for legacy code, but it is an unergonomic way
> to do file-descriptor-relative paths, and should be a fallback, never
> the first choice. A real fd parameter along with a regular path pointer
> would buy two concrete wins:
> 
> 1. A clean, separate file descriptor parameter, the way `openat` has one
>    --- rather than assembling a `/proc` path by hand.
> 
> 2. Normal `PATH_MAX` room for the real pathname, rather than cramming
>    `/proc/self/fd/<N>` (plus any residual path after it) into the small
>    `sun_path` field of `struct sockaddr_un`.
> 
> ----
> 
> Longer term use case: anonymous listening sockets, avoiding advertising
> sockets to potential clients using ambient authority mechanisms
> altogether:
> 
> Some more background: I think this whole business of listening
> unix sockets necessarily living in the file system is a bit silly, since
> there is nothing to put on disk --- it's just a mechanism to communicate
> to clients where they should connect. Now ostensibly, Linux agrees ---
> that is why Linux's *abstract* Unix domain sockets were created. But I
> really don't like this because we have just replaced one ambient
> authority contraption (the root filesystem) with another (the abstract
> socket name space in the network namespace). The problems with ambient
> authority remain all the same (and indeed, our experience with Nix has
> been that network namespace unsharing when you do want to do some
> outside world network access is much more work than filesystem namespace
> unsharing).

Indeed, it would be very hard to change since it is coded in UDS API since
probably day 1.

Just curious: any reason not to use TCP loopback here?

> 
> What I would really like to do is go further than what I proposed, and
> separate the binding of a unix socket from the placing in the file
> system.
> 
> Today, with only existing UAPIs, the closest you can get is a scratch
> path you pin with `O_PATH` and immediately unlink:
> 
>     /* server */
>     int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
>     struct sockaddr_un a = { .sun_family = AF_UNIX };
>     strcpy(a.sun_path, "/tmp/scratchXXXXXX");
>     bind(lfd, (struct sockaddr *)&a, sizeof a);

Any reason not to use abstract socket?

       abstract
              an abstract socket address is distinguished (from a
              pathname socket) by the fact that sun_path[0] is a null
              byte ('\0').  The socket's address in this namespace is
              given by the additional bytes in sun_path that are covered
              by the specified length of the address structure.  (Null
              bytes in the name have no special significance.)  The name
              has no connection with filesystem pathnames.  When the
              address of an abstract socket is returned, the returned
              addrlen is greater than sizeof(sa_family_t) (i.e., greater
              than 2), and the name of the socket is contained in the
              first (addrlen - sizeof(sa_family_t)) bytes of sun_path.


>     int addrfd = open(a.sun_path, O_PATH | O_CLOEXEC); /* pin the socket inode */
>     unlink(a.sun_path);                                /* nameless now */
>     listen(lfd, 64);
> 
>     /* client, handed `addrfd` -- but still has to *name* it, via /proc magic */
>     struct sockaddr_un c = { .sun_family = AF_UNIX };
>     sprintf(c.sun_path, "/proc/self/fd/%d", addrfd);
>     connect(cfd, (struct sockaddr *)&c, sizeof c);
> 
> So even though I hold the socket by descriptor, I still route a pathname
> (`/proc/self/fd/...`) to reach it, and I have to deal with the
> `/tmp/scratchXXXXXX` proper temp file usage.
> 
> What I'd actually want is to sidestep all those nuisances entirely.
> 
> The important piece is a `bind` variation: like binding an abstract unix
> socket, except that it publishes no abstract socket name, so the *only*
> way to connect to the socket is to be given an fd referring to it.
> 
> A matching `connect` variation is more of a nice-to-have: it lets a
> client connect straight through that fd, rather than having to name it
> via `/proc/self/fd` as above.
> 
> Put together:
> 
>     /* server */
>     int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
>     int addrfd = bind_anon(lfd, /*flags, for the future*/0);  /* proposed: no filesystem or abstract name */
>     listen(lfd, 64);
> 
>     /* client, handed `addrfd` -- connect straight to the descriptor */
>     connectat(addrfd, cfd, NULL, 0, AT_EMPTY_PATH);   /* proposed */
> 
> I would use this *a lot*! First of all, in our testing code, I would use
> this, and not even bother (on Linux at least) putting the test daemon
> socket on a (probably quite long) path; I would just rig up the test
> harness to pass the fd to the client process with an environment
> variable (local not global naming!) indicating to the process which file
> descriptor it should connect to.
> 
> If that sounds vaguely like systemd socket activation, yes it should.
> Socket activating *servers*, as we do today, is great, but I would also
> modify my init system to pass these listening sockets to *client*
> services. At that point, servers should ditch any sort of `getsockopt`
> authentication (which they are likely to implement incorrectly or in an
> ad-hoc manner), and instead rely on the init system to make sure only
> services/users which are authorized to connect to a given server have
> been given its listening socket file descriptor.
> 

Thanks,
Cong

^ permalink raw reply

* [PATCH net-next 0/2] selftests/vsock: improve vng version and quirk handling
From: Bobby Eshleman @ 2026-06-12 19:08 UTC (permalink / raw)
  To: Stefano Garzarella, Shuah Khan
  Cc: virtualization, netdev, linux-kselftest, linux-kernel,
	Bobby Eshleman

As vng has continued updating, there have been two things in our
selftests that have been affected. One is that newer versions always
emit the vng version warning, and two is that we have a workaround that
is not needed in newer versions.

This series just updates the version handling to allow all newer
versions without warning and version-gates the workaround to only those
versions that don't have the commit that fixed the root cause.

Additionally, we add function for comparing major.minor versions which
is used in both patches.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Bobby Eshleman (2):
      selftests/vsock: accept vng 1.33 or >= 1.36
      selftests/vsock: skip vng setsid workaround on >= 1.41

 tools/testing/selftests/vsock/vmtest.sh | 47 +++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 17 deletions(-)
---
base-commit: dfcc2ff12925d99e858eaf539eaa4aaaf81fe2a6
change-id: 20260612-vsock-test-update-fcae9ffced52

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>


^ permalink raw reply

* [PATCH net-next 1/2] selftests/vsock: accept vng 1.33 or >= 1.36
From: Bobby Eshleman @ 2026-06-12 19:08 UTC (permalink / raw)
  To: Stefano Garzarella, Shuah Khan
  Cc: virtualization, netdev, linux-kselftest, linux-kernel,
	Bobby Eshleman
In-Reply-To: <20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

The current vng version check uses a discrete allowlist of "1.33",
"1.36", and "1.37", which forces a script update on every new release
even though all post-1.36 releases work.

Replace the discrete list with: "1.33", or any version >= 1.36. 1.34
and 1.35 are skipped because they were not tested. Add a version_lt()
helper that compares MAJOR.MINOR numerically, so the check reads as a
straightforward version comparison.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/vsock/vmtest.sh | 39 +++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index d97913a6bdc7..ee69ac9dd3dc 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -330,27 +330,34 @@ check_netns() {
 	return 0
 }
 
+# Compare MAJOR.MINOR versions numerically. Returns 0 (true) if $1 < $2.
+version_lt() {
+	local -a a=(${1//./ })
+	local -a b=(${2//./ })
+
+	if [[ "${a[0]}" -lt "${b[0]}" ]]; then
+		return 0
+	elif [[ "${a[0]}" -gt "${b[0]}" ]]; then
+		return 1
+	elif [[ "${a[1]}" -lt "${b[1]}" ]]; then
+		return 0
+	fi
+
+	return 1
+}
+
 check_vng() {
-	local tested_versions
 	local version
-	local ok
 
-	tested_versions=("1.33" "1.36" "1.37")
-	version="$(vng --version)"
+	version="$(vng --version | awk '{print $2}')"
 
-	ok=0
-	for tv in "${tested_versions[@]}"; do
-		if [[ "${version}" == *"${tv}"* ]]; then
-			ok=1
-			break
-		fi
-	done
-
-	if [[ ! "${ok}" -eq 1 ]]; then
-		printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
-		printf "not function properly.\n\tThe following versions have been tested: " >&2
-		echo "${tested_versions[@]}" >&2
+	# Supported: 1.33, or any version >= 1.36. 1.34 and 1.35 are untested.
+	if [[ "${version}" == "1.33" ]] || ! version_lt "${version}" "1.36"; then
+		return
 	fi
+
+	printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
+	printf "not function properly.\n\tSupported: 1.33 or >= 1.36\n" >&2
 }
 
 check_socat() {

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next 2/2] selftests/vsock: skip vng setsid workaround on >= 1.41
From: Bobby Eshleman @ 2026-06-12 19:08 UTC (permalink / raw)
  To: Stefano Garzarella, Shuah Khan
  Cc: virtualization, netdev, linux-kselftest, linux-kernel,
	Bobby Eshleman
In-Reply-To: <20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

virtme-ng 1.41 ships the upstream fix for the SIGTTOU hang
(https://github.com/arighi/virtme-ng/pull/453), so the setsid wrapper in
vng_dry_run() is no longer needed there. Gate the workaround on the vng
version: setsid is used for vng < 1.41, and vng is invoked directly on
>= 1.41.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/vsock/vmtest.sh | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index ee69ac9dd3dc..310dfc2a39ad 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -445,8 +445,14 @@ vng_dry_run() {
 	# stopped with SIGTTOU and hangs until kselftest's timer expires.
 	# setsid works around this by launching vng in a new session that has
 	# no controlling terminal, so tcsetattr() succeeds.
+	#
+	# Fixed in 1.41 (https://github.com/arighi/virtme-ng/pull/453).
 
-	setsid -w vng --run "$@" --dry-run &>/dev/null
+	if version_lt "$(vng --version | awk '{print $2}')" "1.41"; then
+		setsid -w vng --run "$@" --dry-run &>/dev/null
+	else
+		vng --run "$@" --dry-run &>/dev/null
+	fi
 }
 
 vm_start() {

-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH net 0/2] net/stmmac: Fixes for maximum TX/RX queues to use by driver
From: Simon Horman @ 2026-06-12 19:14 UTC (permalink / raw)
  To: Jakub Raczynski
  Cc: netdev, andrew+netdev, davem, edumazet, kuba, pabeni,
	mcoquelin.stm32, alexandre.torgue, linux-kernel, linux-arm-kernel,
	k.domagalski, k.tegowski
In-Reply-To: <20260611113358.3379518-1-j.raczynski@samsung.com>

On Thu, Jun 11, 2026 at 01:33:56PM +0200, Jakub Raczynski wrote:
> When contributing other changes preparing functions for new XGMAC hardware
> https://lore.kernel.org/netdev/20260601162537.553512-1-j.raczynski@samsung.com/
> there have been reports by Sashiko AI review about pre-existing issues
> in the code. These problems are non-insignificant and are 'net' material fixes,
> rather than net-next features.
> One issue in this patchset was reported by Sashiko AI, while other
> technically part of new patchset, but is reasonable related fix.
> All of issues are wrong DTS configuration, but kernel needs to handle it.
> 
> Jakub Raczynski (2):
>   net/stmmac: Apply TBS config only to used queues
>   net/stmmac: Apply MTL_MAX queue limit if config missing

For the series;

Reviewed-by: Simon Horman <horms@kernel.org>

FTR, there is AI-generated review of this patch-set available on sashiko.dev
However I believe that feedback can be viewed in the context of possible
follow-up and should not impede the progress of this patchset.

^ permalink raw reply

* Re: [PATCH net-next v13 09/15] quic: add congestion control
From: Xin Long @ 2026-06-12 19:37 UTC (permalink / raw)
  To: network dev, quic
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
	Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
	Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
	Steve Dickson, Hannes Reinecke, Alexander Aring, David Howells,
	Matthieu Baerts, John Ericson, Cong Wang, D . Wythe, Jason Baron,
	illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
	Daniel Stenberg, Andy Gospodarek, mef, paul
In-Reply-To: <465489e5fafa9326b4a21c4851e420d344bbfdfe.1780855297.git.lucien.xin@gmail.com>

[sashiko-gemini]

> + case QUIC_CONG_CONGESTION_AVOIDANCE:
> + /* cong->window is never zero; it is initialized by
> + * quic_packet_route() during connect/accept.
> + */
> + cong->window += cong->mss * bytes / cong->window;
Can this arithmetic permanently stall congestion window growth or cause an
overflow? Since all variables are 32-bit integers, the division could
truncate to 0 as the window grows larger than cong->mss * bytes. Also,
if bytes is large due to coalesced ACKs, could the multiplication wrap
around U32_MAX?

This looks a legit one, will confirm and fix it.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v1] e1000: Initialize phy_data to avoid unexpected values
From: Andrew Lunn @ 2026-06-12 19:39 UTC (permalink / raw)
  To: Rongguang Wei
  Cc: przemyslaw.kitszel, anthony.l.nguyen, netdev, intel-wired-lan,
	Rongguang Wei
In-Reply-To: <20260612080331.120096-1-clementwei90@163.com>

On Fri, Jun 12, 2026 at 04:03:31PM +0800, Rongguang Wei wrote:
> From: Rongguang Wei <weirongguang@kylinos.cn>
> 
> The phy_data variable is not initialized. If e1000_read_phy_reg
> returns an error, phy_data will not point to a valid value from
> the PHY register, which may cause the regs_buff array to be populated
> with unexpected values.
> 
> Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn>
> Change-Id: I46071b3b21a566f8da650168d38d6968251b077d

What does this Change-Id mean?

> ---
>  drivers/net/ethernet/intel/e1000/e1000_ethtool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_ethtool.c b/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
> index 4dcbeabb3ad2..f068108c5004 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
> @@ -327,7 +327,7 @@ static void e1000_get_regs(struct net_device *netdev, struct ethtool_regs *regs,
>  	struct e1000_adapter *adapter = netdev_priv(netdev);
>  	struct e1000_hw *hw = &adapter->hw;
>  	u32 *regs_buff = p;
> -	u16 phy_data;
> +	u16 phy_data = 0;

	if (hw->phy_type == e1000_phy_igp) {
		e1000_write_phy_reg(hw, IGP01E1000_PHY_PAGE_SELECT,
				    IGP01E1000_PHY_AGC_A);
		e1000_read_phy_reg(hw, IGP01E1000_PHY_AGC_A &
				   IGP01E1000_PHY_PAGE_SELECT, &phy_data);
		regs_buff[13] = (u32)phy_data; /* cable length */

Isn't a cable length of 0 also unexpected?

How does this patch actually make the situation better?

    
    Andrew

---
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v13 04/15] quic: provide family ops for address and protocol
From: Xin Long @ 2026-06-12 19:41 UTC (permalink / raw)
  To: network dev, quic
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
	Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
	Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
	Steve Dickson, Hannes Reinecke, Alexander Aring, David Howells,
	Matthieu Baerts, John Ericson, Cong Wang, D . Wythe, Jason Baron,
	illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
	Daniel Stenberg, Andy Gospodarek, mef, paul
In-Reply-To: <e73d8f69b67fdeeec076fbe772ca465d7e1af4cb.1780855297.git.lucien.xin@gmail.com>

[sashiko-claude]

> +static void quic_v4_lower_xmit(struct sock *sk, struct sk_buff *skb,
> +        struct flowi *fl)
> +{
> + struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
> + u8 tos = (inet_sk(sk)->tos | cb->ecn), ttl;
> + struct flowi4 *fl4 = &fl->u.ip4;
> + struct dst_entry *dst;
> + __be16 df = 0;
> +
> + pr_debug("%s: skb: %p, len: %d, num: %llu, %pI4:%d -> %pI4:%d\n",
> + __func__, skb, skb->len, cb->number, &fl4->saddr,
> + ntohs(fl4->fl4_sport), &fl4->daddr, ntohs(fl4->fl4_dport));
Should cb->number use %lld here rather than %llu? Looking at struct
quic_skb_cb in net/quic/common.h:
    s64 number;  /* Parsed packet number, or the largest previously seen */
Since number is s64, a negative initial value (such as -1 for "largest
previously seen") would be printed as 18446744073709551615 with %llu, and
compilers with -Wformat-signedness flag the mismatch.
[ ... ]
> +static void quic_v6_lower_xmit(struct sock *sk, struct sk_buff *skb,
> +        struct flowi *fl)
> +{
> + struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
> + u8 tc = (inet6_sk(sk)->tclass | cb->ecn), ttl;
> + struct flowi6 *fl6 = &fl->u.ip6;
> + struct dst_entry *dst;
> + __be32 label;
> +
> + pr_debug("%s: skb: %p, len: %d, num: %llu, %pI6c:%d -> %pI6c:%d\n",
> + __func__, skb, skb->len, cb->number, &fl6->saddr,
> + ntohs(fl6->fl6_sport), &fl6->daddr, ntohs(fl6->fl6_dport));
Same question about cb->number being printed with %llu while the field is
declared s64.

A minor one but legit, will fix it.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v13 06/15] quic: add stream management
From: Xin Long @ 2026-06-12 19:43 UTC (permalink / raw)
  To: network dev, quic
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
	Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
	Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
	Steve Dickson, Hannes Reinecke, Alexander Aring, David Howells,
	Matthieu Baerts, John Ericson, Cong Wang, D . Wythe, Jason Baron,
	illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
	Daniel Stenberg, Andy Gospodarek, mef, paul
In-Reply-To: <1b5480358209643e574188ccba83bc9ef0a84aee.1780855297.git.lucien.xin@gmail.com>

[sashiko-claude]

> +/* Get or create a send or recv stream by ID. Requires sock lock held. */
> +struct quic_stream *quic_stream_get(struct quic_stream_table *streams,
> + s64 stream_id, u32 flags, bool is_serv,
> + bool send)
> +{
> + struct quic_stream *stream;
> +
> + if (!quic_stream_id_valid(stream_id, is_serv, send))
> + return ERR_PTR(-EINVAL);
> +
> + stream = quic_stream_find(streams, stream_id);
> + if (stream)
> + return stream;
> +
> + if (!send && quic_stream_id_local(stream_id, is_serv)) {
> + if (quic_stream_id_closed(streams, stream_id, !send))
> + return ERR_PTR(-ENOSTR);
> + return ERR_PTR(-EINVAL);
> + }
The recv path here rejects stream IDs whose initiator does not match the
caller's role (a server cannot accept a SERVER_BIDI on the recv path
without it being closed/already-open). Should the send path have the
symmetric guard for peer-initiated bidi stream IDs?
Walking through quic_stream_get() with is_serv=true, send=true,
stream_id=4 (CLIENT_BIDI), flags=MSG_QUIC_STREAM_NEW:
- quic_stream_id_valid(4, true, true) only rejects CLIENT_UNI on the
server send path, so CLIENT_BIDI passes.
- The !send && local block is skipped because send is true.
- quic_stream_id_closed() compares 4 < next_bidi_stream_id (initially
SERVER_BIDI = 1) and returns false.
- quic_stream_id_exceeds() can pass when the peer's max_streams_bidi
is non-trivial.
Control then reaches quic_stream_create():
> + stream_id = limits->next_bidi_stream_id;
> + if (quic_stream_id_uni(max_stream_id))
> + stream_id = limits->next_uni_stream_id;
For a non-uni max_stream_id, stream_id is initialised from
streams->send.next_bidi_stream_id (SERVER_BIDI = 1), so the loop creates
a stream with id=1 instead of the requested id=4, increments
streams->send.streams_bidi as if locally initiated, and advances
streams->send.next_bidi_stream_id past 4.


This is a good catch, will add quic_stream_id_local() for send.

Thanks.

^ permalink raw reply

* Re: [PATCH v2 2/5] binder: Make shrinker rely solely on per-VMA lock
From: Alice Ryhl @ 2026-06-12 19:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Suren Baghdasaryan, Vlastimil Babka (SUSE), Dave Hansen,
	linux-kernel, Andrew Morton, Arve Hjønnevåg,
	Carlos Llamas, Christian Brauner, David Ahern, David S. Miller,
	Greg Kroah-Hartman, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	netdev, Shakeel Butt, Todd Kjos
In-Reply-To: <2da031dd-4442-45b7-9515-72ffc60e8d8c@intel.com>

On Fri, Jun 12, 2026 at 11:47:59AM -0700, Dave Hansen wrote:
> On 6/12/26 10:44, Suren Baghdasaryan wrote:
> >> It's not impossible, but I do think it is irrelevant. Or at least that
> >> the *VMA* is irrelevant in this case. binder_alloc_is_mapped()==false
> >> means that the binder VMA is gone. It's not in the maple tree, and it's
> >> not coming back. If a VMA is found, it's an impostor.
> > Right, but before your change we were bailing out early. With your
> > change we would be generating the traces and freeing the page. I think
> > that's a functional change. Was that your intention?
> 
> Yeah, it was intentional.
> 
> I think the existing behavior is buggy. It also complicates the goal of
> removing the mmap lock fallback. I've broken that behavior change out
> into a separate patch. (attached here)

I think you can just:

1. do a lock_vma_under_rcu().
2. if it fails, check binder_alloc_is_mapped().
3. if still mapped, return LRU_SKIP, otherwise behave like a failed
   vma_lookup() does today under the mmap read lock.

Or you can even skip steps 2 and 3 and treat failed lock_vma_under_rcu()
as LRU_SKIP because processes that unmap their Binder vma without
immediately closing the fd (freeing all the pages) does not really exist
in practice.

Alice

^ permalink raw reply

* Re: [PATCH net-next 0/3] Introduce HSR/PRP HW offload support for PRU-ICSSM Ethernet driver
From: Simon Horman @ 2026-06-12 20:01 UTC (permalink / raw)
  To: Parvathi Pudi
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, danishanwar, rogerq,
	pmohan, afd, basharath, arnd, linux-kernel, netdev,
	linux-arm-kernel, pratheesh, j-rameshbabu, vigneshr, praneeth,
	srk, rogerq, m-malladi, krishna, mohan
In-Reply-To: <20260611123636.376577-1-parvathi@couthit.com>

On Thu, Jun 11, 2026 at 06:03:25PM +0530, Parvathi Pudi wrote:
> Hi,
> 
> This series introduces HSR and PRP protocol HW offload support for
> ICSSM-Prueth driver.  HW offload support for HSR/PRP is implemented using
> dedicated HSR/PRP firmware running on 2 PRU cores(PRU-ICSS) as a "DAN"
> available in AM57xx, AM437x and AM335x.

Hi Parvathi,

There is AI-generated review of this patch-set available on
https://sashiko.dev

I would appreciate it if you could look over that with a view
to addressing any issues that directly affect this patch-set.

^ permalink raw reply

* Re: [PATCH net-next v13 07/15] quic: add connection id management
From: Xin Long @ 2026-06-12 20:02 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, quic, davem, kuba, edumazet, horms, metze, mbuhl,
	tfanelli, hepengtao, dreibh, linux-cifs, smfrench, linkinjeon,
	tom, kernel-tls-handshake, chuck.lever, jlayton, steved, hare,
	aahringo, dhowells, matttbe, mail, xiyou.wangcong, alibuda,
	jbaron, illiliti, sd, marcelo.leitner, daniel, andrew.gospodarek,
	mef, paul
In-Reply-To: <20260611081928.137149-1-pabeni@redhat.com>

On Thu, Jun 11, 2026 at 4:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> From: AI Reviewer <ai@example.com>
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> Now I have enough information to produce the report.
>
> commit e00266d0eea858a744fd827548eb8ebc6868f92c
> Author: Xin Long <lucien.xin@gmail.com>
>
> quic: add connection id management
>
> This commit introduces struct quic_conn_id_set for managing QUIC
> Connection IDs along with helpers to add, remove, and look up source
> and destination CIDs, including a global hash table for source CID
> lookup on the RX path.
>
> > diff --git a/net/quic/connid.c b/net/quic/connid.c
> > new file mode 100644
> > index 000000000000..b1a39feb827e
> > --- /dev/null
> > +++ b/net/quic/connid.c
> > @@ -0,0 +1,256 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/* QUIC kernel implementation
> > + * (C) Copyright Red Hat Corp. 2023
> > + *
> > + * This file is part of the QUIC kernel implementation
> > + *
> > + * Initialization/cleanup for QUIC protocol support.
> > + *
> > + * Written or modified by:
> > + *    Xin Long <lucien.xin@gmail.com>
> > + */
> > +
> > +#include <linux/quic.h>
> > +#include <net/sock.h>
> > +
> > +#include "common.h"
> > +#include "connid.h"
> > +
> > +/* Lookup a source connection ID (scid) in the global source connection ID hash
> > + * table.
> > + */
> > +struct quic_conn_id *quic_conn_id_lookup(struct net *net, u8 *scid, u32 len)
> > +{
> > +     struct quic_shash_head *head = quic_source_conn_id_head(net, scid, len);
> > +     struct quic_source_conn_id *s_conn_id;
> > +     struct quic_conn_id *conn_id = NULL;
> > +     struct hlist_nulls_node *node;
> > +
> > +     hlist_nulls_for_each_entry_rcu(s_conn_id, node, &head->head, node) {
> > +             if (net != sock_net(s_conn_id->sk))
> > +                     continue;
> > +             if (s_conn_id->common.id.len != len ||
> > +                 memcmp(scid, &s_conn_id->common.id.data, len))
> > +                     continue;
> > +             if (likely(refcount_inc_not_zero(&s_conn_id->sk->sk_refcnt)))
> > +                     conn_id = &s_conn_id->common.id;
> > +             break;
> > +     }
>
> [Critical]
> Can s_conn_id->sk be freed while this RCU walk is running?  The
> s_conn_id itself is freed via call_rcu() in quic_source_conn_id_free(),
> but the struct sock it points to does not appear to be RCU-protected.
>
> A grep across this series for SOCK_RCU_FREE and SLAB_TYPESAFE_BY_RCU
> under net/quic returns no matches, and quic_init_sock() does not call
> sock_set_flag(sk, SOCK_RCU_FREE), so sk_destruct() will free the sock
> synchronously rather than after a grace period.
>
> Consider this sequence:
>
>   CPU0 (RX path, rcu_read_lock held)        CPU1 (close path)
>   hlist_nulls_for_each_entry_rcu(..)
>     /* sees s_conn_id still linked */
>                                             sk_destruct()
>                                               quic_destroy_sock()
>                                                 quic_conn_id_set_free()
>                                                   /* unhash + call_rcu
>                                                      defers s_conn_id
>                                                      free, but not sk */
>                                               sk_prot_free()  /* sk gone */
>     sock_net(s_conn_id->sk)         /* UAF read of freed sock */
>     refcount_inc_not_zero(
>         &s_conn_id->sk->sk_refcnt)  /* UAF refcount op */
>
> Once quic_packet_get_sock() (in the later 'quic: add packet parser
> base' patch) wires this lookup into every short- and long-header
> packet, an incoming packet racing with close() on the matching socket
> would dereference freed sock memory.
>
> Would setting SOCK_RCU_FREE on the QUIC sockets, switching the proto
> slabs to SLAB_TYPESAFE_BY_RCU, taking sock_hold() when the sk pointer
> is stored in s_conn_id, or otherwise gating the sk free behind an RCU
> grace period address this?
>

This is not an issue, as sock_set_flag(sk, SOCK_RCU_FREE) will be called
in quic_hash(), which will be added in the patchset-2.


> > +
> > +/* Check if a given stateless reset token exists in any connection ID in the
> > + * connection ID set.
> > + */
> > +bool quic_conn_id_token_exists(struct quic_conn_id_set *id_set, u8 *token)
> > +{
> > +     struct quic_common_conn_id *common;
> > +     struct quic_dest_conn_id *dcid;
> > +
> > +     dcid = (struct quic_dest_conn_id *)id_set->active;
> > +     if (memchr_inv(dcid->token, 0, QUIC_CONN_ID_TOKEN_LEN) &&
> > +         !memcmp(dcid->token, token, QUIC_CONN_ID_TOKEN_LEN))
> > +             return true; /* Fast path. */
> > +
> > +     list_for_each_entry(common, &id_set->head, list) {
> > +             dcid = (struct quic_dest_conn_id *)common;
> > +             if (common == id_set->active)
> > +                     continue;
> > +             if (memchr_inv(dcid->token, 0, QUIC_CONN_ID_TOKEN_LEN) &&
> > +                 !memcmp(dcid->token, token, QUIC_CONN_ID_TOKEN_LEN))
> > +                     return true;
> > +     }
> > +     return false;
> > +}
>
> [Medium]
> Is memcmp() the right comparison primitive for the stateless reset
> token here?  RFC 9000 section 10.3 treats the token as a secret that
> authorises tearing down a connection, and the commit message states
> this helper is used during stateless reset processing on incoming
> packets.
>
> lib/string.c's memcmp() returns at the first differing byte (and
> performs unaligned word-sized loads on architectures that allow it),
> so the time taken to return false leaks how many leading bytes of the
> supplied token match a stored token.  Would crypto_memneq() from
> <crypto/algapi.h>, which is constant-time, be more appropriate for
> comparing the token bytes?
>

OK, will change to crypto_memneq() and replace 4 similar places.

> > +
> [ ... ]
> > +/* Add a connection ID with sequence number and associated private data to the
> > + * connection ID set.
> > + */
> > +int quic_conn_id_add(struct quic_conn_id_set *id_set,
> > +                  struct quic_conn_id *conn_id, u32 number, void *data)
> > +{
> [ ... ]
> > +     /* Locate insertion point to keep list ordered by number. */
> > +     list = &id_set->head;
> > +     list_for_each_entry(common, list, list) {
> > +             if (number == common->number)
> > +                     return 0; /* Ignore if it already exists on the list. */
> > +             if (number < common->number) {
> > +                     list = &common->list;
> > +                     break;
> > +             }
> > +     }
> [ ... ]
> > +     list_add_tail(&common->list, list);
> > +
> > +     if (number == quic_conn_id_last_number(id_set) + 1) {
> > +             if (!id_set->active)
> > +                     id_set->active = common;
> > +             id_set->count++;
> > +
> > +             /* Increment count for consecutive following IDs. */
> > +             list_for_each_entry_continue(common, &id_set->head, list) {
> > +                     if (common->number != ++number)
> > +                             break;
> > +                     id_set->count++;
> > +             }
> > +     }
> > +     return 0;
> > +}
>
> [Medium]
> Does this function correctly maintain id_set->count when a CID is
> inserted before all existing entries?  The loop above explicitly
> handles the "number < common->number" case, suggesting out-of-order
> insertion is intended to be supported.
>
> Consider starting from list=[5], count=1 (so first=5, last=5) and
> calling quic_conn_id_add(..., number=3, ...):
>
>   - The location loop sets list = &common(5)->list and breaks.
>   - list_add_tail() inserts common(3) before common(5),
>     yielding list=[3, 5].
>   - The post-insert check evaluates:
>         quic_conn_id_last_number(id_set)
>             = quic_conn_id_first_number(id_set) + count - 1
>             = 3 + 1 - 1
>             = 3
>     so the test "number == last + 1" becomes "3 == 4" and the count
>     update branch is skipped.
>
> The end state is list=[3, 5] with count still 1, so entry 5 is
> silently uncounted.
This is expected, count only count contiguous connection IDs.

> A subsequent quic_conn_id_remove(id_set, 3)
> then trips the WARN_ON_ONCE(number >= last_number) because
> last_number reads as 3.
>

This will never happen, the callers will prevent passing the last
seqno to quic_conn_id_remove().

Thanks.

> Should the count adjustment also handle the case where the inserted
> CID lowers first_number?
>
> > +
> > +/* Remove consecutive connection IDs from the set with sequence numbers less
> > + * than or equal to a number.
> > + */
> > +void quic_conn_id_remove(struct quic_conn_id_set *id_set, u32 number)
> [ ... ]
> --
> This is an AI-generated review.
>

^ permalink raw reply

* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
From: Cong Wang @ 2026-06-12 20:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Cong Wang, Jakub Kicinski, Network Development, bpf,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Hemanth Malla,
	zijianzhang
In-Reply-To: <CAADnVQLzEopmoR7vdTnno25gCqT99ja3jPaMGUnW+edg9P9Sew@mail.gmail.com>

On Fri, Jun 12, 2026 at 11:34:25AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 12, 2026 at 11:12 AM Cong Wang <cwang@multikernel.io> wrote:
> >
> > On Fri, Jun 12, 2026 at 09:01:43AM -0700, Alexei Starovoitov wrote:
> > > Just saying that the code is free nowadays, so whether it's 1k lines
> > > or 10 lines is irrelevant for the discussion.
> > >
> > > As far as the idea goes, I think, it would be interesting in pre-AI era,
> > > but today splice and friends are a prime target for bugs and more bugs.
> > > skmsg and tcp_bpf are reeling from unfixed bugs too,
> > > so my take is that we should not add any new features to skmsg
> > > and instead deprecate what is already there.
> >
> > I guess maybe the name misleads you, it has nothing related to splice()
> > syscall. Its ring buffer was developed on top of include/linux/circ_buf.h
> > which again has nothing related to splice()/vmsplice()/pipe().
> >
> > In case it is not obvious, this patchset does not add any new user-space
> > interface, only a kfunc which is visible to only sockmap eBPF programs
> > which already require CAP_BPF privilege.
> 
> Not the name, but the concept. Taking from one socket and feeding
> into another already caused a ton of issues for the networking stack.
> If you can convince Kuba we can entertain it.

If you could be specific and provide examples, I could provide better
answer and take better actions.

Until that, all I can say is Copy Fail leverages page *references*,
bpf_sock_splice_pair() shares no pages, it is a private kernel allocation,
with no pipe_buffer or page-cache involvement at all. Probably the most
common thing between these 2 is the name "splice".

In fact, it has 2 copies (not 0, not 1) by design, see details here:
https://multikernel.io/2026/06/11/bpf-sock-splice-pair-two-copies/

Or if you mean skmsg or sockmap has a lot of bugs, this is true but it
is mostly due to TLS (which codebase is already a mess) and the
complication of skmsg itself, none of them is related to
bpf_sock_splice_pair().

For your reference, this is the data sheet I collected with AI:

  ┌─────────────────────┬─────────┬──────────┬
  │ Code path the fixes │  ~Fix   │ Splice   │
  │       live in       │ commits │  ring    │
  │                     │         │ uses it? │
  ├─────────────────────┼─────────┼──────────┼
  │ sk_msg / verdict /  │         │          │
  │ strparser / skb     │     ~59 │    No    │
  │ redirect            │         │          │
  ├─────────────────────┼─────────┼──────────┼
  │ TLS / ULP layering  │       8 │    No    │
  ├─────────────────────┼─────────┼──────────┼
  │ psock / sock_map    │         │          │
  │ teardown (close,    │     ~10 │   Yes    │
  │ unhash, destroy,    │         │          │
  │ replace, free)      │         │          │
  └─────────────────────┴─────────┴──────────┴

Thanks for your comments!
Cong

^ permalink raw reply

* Re: [PATCH net-next v13 12/15] quic: add crypto packet encryption and decryption
From: Xin Long @ 2026-06-12 20:23 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, quic, davem, kuba, edumazet, horms, metze, mbuhl,
	tfanelli, hepengtao, dreibh, linux-cifs, smfrench, linkinjeon,
	tom, kernel-tls-handshake, chuck.lever, jlayton, steved, hare,
	aahringo, dhowells, matttbe, mail, xiyou.wangcong, alibuda,
	jbaron, illiliti, sd, marcelo.leitner, daniel, andrew.gospodarek,
	mef, paul
In-Reply-To: <20260611081948.137325-1-pabeni@redhat.com>

On Thu, Jun 11, 2026 at 4:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> From: AI Reviewer <ai@example.com>
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> quic: add crypto packet encryption and decryption
>
> This patch adds AEAD-based packet encryption/decryption helpers, QUIC
> header protection, and Retry/address-validation token generation and
> verification primitives.
>
> > diff --git a/net/quic/crypto.c b/net/quic/crypto.c
> > index c68a0b570c72..77e633096c2a 100644
> > --- a/net/quic/crypto.c
> > +++ b/net/quic/crypto.c
>
> [ ... ]
>
> > +/* Retry Packet Integrity. */
> > +int quic_crypto_get_retry_tag(struct quic_crypto *crypto, struct sk_buff *skb,
> > +                           struct quic_conn_id *odcid, u32 version, u8 *tag)
> > +{
> > +     /* Reuse RX AEAD (phase 1) in Initial crypto. */
> > +     struct crypto_aead *tfm = crypto->rx_tfm[1];
> > +     u8 *pseudo_retry, *p, *iv, *key;
> > +     DECLARE_CRYPTO_WAIT(wait);
> > +     struct aead_request *req;
> > +     struct scatterlist *sg;
> > +     u32 plen;
> > +     int err;
> > +
> > +     if (skb->len < QUIC_TAG_LEN)
> > +             return -EINVAL;
>
> [Low]
> The guard rejects skb->len < QUIC_TAG_LEN, but a later comment in the
> same function says:
>
> > +     /* The caller must ensure skb->len > QUIC_TAG_LEN. */
> > +     plen = 1 + odcid->len + skb->len - QUIC_TAG_LEN;
>
> Should the check be <= QUIC_TAG_LEN, or should the comment say >= to
> match the strict-less-than test?
>
> In the boundary case where skb->len == QUIC_TAG_LEN, plen becomes
> 1 + odcid->len, and quic_put_data(p, skb->data, skb->len - QUIC_TAG_LEN)
> copies zero bytes, so the integrity tag would be computed over a
> pseudo-packet containing only the ODCID length and ODCID with no actual
> Retry header bytes.  Is that intended?
>
The callers will make sure skb->len > QUIC_TAG_LEN, and the check
(skb->len < QUIC_TAG_LEN) was added to make the last AI happy.

I will change (skb->len <= QUIC_TAG_LEN) to match the comment.


> [ ... ]
>
> > +/* Derives a key and IV using HKDF, configures the AEAD transform and performs
> > + * AEAD encryption/decryption for the provided token.
> > + */
> > +static int quic_crypto_token_protect(struct quic_crypto *crypto, u8 *token,
> > +                                  u32 len, u32 adlen, bool enc)
> > +{
> > +     u8 key[TLS_CIPHER_AES_GCM_128_KEY_SIZE], iv[QUIC_IV_LEN], *tiv;
> > +     /* Reuse TX AEAD (phase 1) in Initial crypto. */
> > +     struct crypto_aead *tfm = crypto->tx_tfm[1];
> > +     u32 extra = enc ? QUIC_TAG_LEN : 0;
> > +     struct quic_data srt = {}, k, i;
> > +     DECLARE_CRYPTO_WAIT(wait);
> > +     struct aead_request *req;
> > +     struct scatterlist *sg;
> > +     void *ctx = NULL;
> > +     int err;
> > +
> > +     quic_data(&srt, quic_random_data, QUIC_RANDOM_DATA_LEN);
> > +     quic_data(&k, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> > +     quic_data(&i, iv, QUIC_IV_LEN);
> > +     err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &i, NULL,
> > +                                   QUIC_VERSION_V1);
>
> [Critical]
> Can this lead to AES-GCM nonce reuse across all generated tokens?
>
> quic_random_data is a module-global buffer initialized once in
> quic_crypto_init() via get_random_bytes() and never refreshed:
>
>     static u8 quic_random_data[QUIC_RANDOM_DATA_LEN] __read_mostly;
>
> quic_crypto_keys_derive() runs HKDF-Expand-Label over that constant
> secret with constant labels "quic key" and "quic iv", so both the
> derived key and the derived IV are deterministic and identical on every
> invocation of quic_crypto_token_protect().
>
> > +     if (err)
> > +             goto out;
> > +     err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
> > +     if (err)
> > +             goto out;
> > +     err = crypto_aead_setkey(tfm, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> > +     if (err)
> > +             goto out;
> > +     ctx = quic_crypto_aead_mem_alloc(tfm, 0, &tiv, &req, &sg, 1);
> > +     if (!ctx) {
> > +             err = -ENOMEM;
> > +             goto out;
> > +     }
> > +     memcpy(tiv, iv, QUIC_IV_LEN);
> > +
> > +     sg_init_one(sg, token, len);
> > +     aead_request_set_tfm(req, tfm);
> > +     aead_request_set_ad(req, adlen);
> > +     aead_request_set_crypt(req, sg, sg, len - adlen - extra, tiv);
>
> Every call ends up using the same (K, N) pair under AES-GCM, with no
> per-token random prefix and no packet-number XOR like
> quic_crypto_payload_protect() does via cb->number.
>
> NIST SP 800-38D forbids reuse of (K, N) under GCM; from two valid
> ciphertexts an attacker can recover the GHASH authentication subkey H
> and forge tokens for arbitrary client addresses, which would defeat
> the address validation in quic_crypto_verify_token() and the
> anti-amplification protections in RFC 9000 section 8.
>
> Could a unique per-token nonce be derived (for example from a random
> prefix stored in the token itself), or a nonce-misuse-resistant AEAD
> be used here instead?
>

I currently don't have a better solution for this. we can't have it in the
socket, as even after the socket closes, this quic_random_data is still
needed to validate tokens.

Thanks.

> > +     aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +                               crypto_req_done, &wait);
> > +     err = enc ? crypto_aead_encrypt(req) : crypto_aead_decrypt(req);
> > +     if (err == -EINPROGRESS || err == -EBUSY)
> > +             err = crypto_wait_req(err, &wait);
> > +
> > +out:
> > +     memzero_explicit(key, sizeof(key));
> > +     memzero_explicit(iv, sizeof(iv));
> > +     kfree_sensitive(ctx);
> > +     return err;
> > +}
>
> [ ... ]
> --
> This is an AI-generated review.
>

^ permalink raw reply

* [syzbot] [mm?] KASAN: invalid-free in rcu_free_sheaf
From: syzbot @ 2026-06-12 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev, peterz, syzkaller-bugs, tglx

Hello,

syzbot found the following issue on:

HEAD commit:    0ce346bf2f68 Merge branch 'net-shaper-follow-ups-to-recent..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15a50986580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2c40e3284434c6f1
dashboard link: https://syzkaller.appspot.com/bug?extid=60332fd095f8bb2946ad
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/fcb2184c91a0/disk-0ce346bf.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/44a7baa2dfdd/vmlinux-0ce346bf.xz
kernel image: https://storage.googleapis.com/syzbot-assets/8ef85b3aefe3/bzImage-0ce346bf.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+60332fd095f8bb2946ad@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: double-free in rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
Free of addr ffff888026c2d380 by task ksoftirqd/0/15

CPU: 0 UID: 0 PID: 15 Comm: ksoftirqd/0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report_invalid_free+0xea/0x110 mm/kasan/report.c:557
 check_slab_allocation mm/kasan/common.c:-1 [inline]
 __kasan_slab_pre_free+0x104/0x120 mm/kasan/common.c:261
 kasan_slab_pre_free include/linux/kasan.h:199 [inline]
 slab_free_hook mm/slub.c:2634 [inline]
 __rcu_free_sheaf_prepare+0x10a/0x2a0 mm/slub.c:2940
 rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
 rcu_do_batch kernel/rcu/tree.c:2617 [inline]
 rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
 handle_softirqs+0x22a/0x840 kernel/softirq.c:622
 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1076
 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 5613:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __kmalloc_cache_noprof+0x31c/0x660 mm/slub.c:5420
 kmalloc_noprof include/linux/slab.h:950 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 hwsim_update_pib+0x88/0x450 drivers/net/ieee802154/mac802154_hwsim.c:101
 hwsim_set_promiscuous_mode+0x196/0x2e0 drivers/net/ieee802154/mac802154_hwsim.c:323
 drv_set_promiscuous_mode+0x159/0x2e0 net/mac802154/driver-ops.h:127
 drv_start net/mac802154/driver-ops.h:195 [inline]
 mac802154_slave_open net/mac802154/iface.c:196 [inline]
 mac802154_wpan_open+0x19b3/0x2a70 net/mac802154/iface.c:295
 __dev_open+0x44d/0x830 net/core/dev.c:1702
 __dev_change_flags+0x2fa/0x7e0 net/core/dev.c:9752
 netif_change_flags+0x88/0x1a0 net/core/dev.c:9815
 do_setlink+0xfa5/0x45a0 net/core/rtnetlink.c:3207
 rtnl_changelink net/core/rtnetlink.c:3841 [inline]
 __rtnl_newlink net/core/rtnetlink.c:4014 [inline]
 rtnl_newlink+0x15ad/0x1bb0 net/core/rtnetlink.c:4151
 rtnetlink_rcv_msg+0x7d5/0xbe0 net/core/rtnetlink.c:7068
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2556
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x75c/0x8e0 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:787 [inline]
 __sock_sendmsg net/socket.c:802 [inline]
 __sys_sendto+0x672/0x710 net/socket.c:2265
 __do_sys_sendto net/socket.c:2272 [inline]
 __se_sys_sendto net/socket.c:2268 [inline]
 __x64_sys_sendto+0xde/0x100 net/socket.c:2268
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 15:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2689 [inline]
 __rcu_free_sheaf_prepare+0x12d/0x2a0 mm/slub.c:2940
 rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
 rcu_do_batch kernel/rcu/tree.c:2617 [inline]
 rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
 handle_softirqs+0x22a/0x840 kernel/softirq.c:622
 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1076
 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff888026c2d380
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 0 bytes inside of
 64-byte region [ffff888026c2d380, ffff888026c2d3c0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888026c2d780 pfn:0x26c2d
flags: 0xfff00000000200(workingset|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000200 ffff88813fe1a8c0 ffffea0000d74490 ffffea0001fcda50
raw: ffff888026c2d780 0000000800200010 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 137, tgid 137 (kworker/u8:5), ts 5542876120, free_ts 5542851482
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
 prep_new_page mm/page_alloc.c:1861 [inline]
 get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
 alloc_slab_page mm/slub.c:3278 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3467
 new_slab mm/slub.c:3525 [inline]
 refill_objects+0x339/0x3d0 mm/slub.c:7272
 refill_sheaf mm/slub.c:2816 [inline]
 __pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
 alloc_from_pcs mm/slub.c:4750 [inline]
 slab_alloc_node mm/slub.c:4884 [inline]
 __do_kmalloc_node mm/slub.c:5295 [inline]
 __kmalloc_node_noprof+0x577/0x7c0 mm/slub.c:5302
 kmalloc_node_noprof include/linux/slab.h:1081 [inline]
 __vmalloc_area_node mm/vmalloc.c:3857 [inline]
 __vmalloc_node_range_noprof+0x5ef/0x1750 mm/vmalloc.c:4064
 __vmalloc_node_noprof+0xc2/0x100 mm/vmalloc.c:4124
 alloc_thread_stack_node kernel/fork.c:357 [inline]
 dup_task_struct+0x298/0x840 kernel/fork.c:926
 copy_process+0x89b/0x4440 kernel/fork.c:2090
 kernel_clone+0x2d7/0x940 kernel/fork.c:2722
 user_mode_thread+0x110/0x180 kernel/fork.c:2798
 call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
page last free pid 137 tgid 137 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1397 [inline]
 __free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
 __kasan_populate_vmalloc_do mm/kasan/shadow.c:393 [inline]
 __kasan_populate_vmalloc+0x1b2/0x1d0 mm/kasan/shadow.c:424
 kasan_populate_vmalloc include/linux/kasan.h:580 [inline]
 alloc_vmap_area+0xd47/0x1480 mm/vmalloc.c:2123
 __get_vm_area_node+0x1f8/0x300 mm/vmalloc.c:3226
 __vmalloc_node_range_noprof+0x36a/0x1750 mm/vmalloc.c:4024
 __vmalloc_node_noprof+0xc2/0x100 mm/vmalloc.c:4124
 alloc_thread_stack_node kernel/fork.c:357 [inline]
 dup_task_struct+0x298/0x840 kernel/fork.c:926
 copy_process+0x89b/0x4440 kernel/fork.c:2090
 kernel_clone+0x2d7/0x940 kernel/fork.c:2722
 user_mode_thread+0x110/0x180 kernel/fork.c:2798
 call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Memory state around the buggy address:
 ffff888026c2d280: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff888026c2d300: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888026c2d380: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                   ^
 ffff888026c2d400: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff888026c2d480: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH net-next v13 15/15] quic: add packet parser base
From: Xin Long @ 2026-06-12 20:48 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, quic, davem, kuba, edumazet, horms, metze, mbuhl,
	tfanelli, hepengtao, dreibh, linux-cifs, smfrench, linkinjeon,
	tom, kernel-tls-handshake, chuck.lever, jlayton, steved, hare,
	aahringo, dhowells, matttbe, mail, xiyou.wangcong, alibuda,
	jbaron, illiliti, sd, marcelo.leitner, daniel, andrew.gospodarek,
	mef, paul
In-Reply-To: <20260611082008.138377-1-pabeni@redhat.com>

On Thu, Jun 11, 2026 at 4:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> From: AI Reviewer <ai@example.com>
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> quic: add packet parser base
>
> This patch adds the QUIC packet RX parser, including ALPN extraction
> from the Client Initial to demux to the right listener, an ICMP-driven
> PMTU update path, and a per-net backlog/work for deferred parsing.
>
> > diff --git a/net/quic/packet.c b/net/quic/packet.c
> > index ba5204aceda6..430fc06d0012 100644
> > --- a/net/quic/packet.c
> > +++ b/net/quic/packet.c
>
> [ ... ]
>
> > +static int quic_packet_get_alpn(struct quic_data *alpn, u8 *p, u32 len)
> > +{
> > +     int err = -EINVAL, found = 0, exts = 0;
> > +     u64 length, type;
> > +
>
> [ ... ]
>
> > +     while (len >= 4) { /* Scan extensions for ALPN (TLS_EXT_alpn). */
> > +             if (!quic_get_int(&p, &len, &type, 2))
> > +                     break;
> > +             if (!quic_get_int(&p, &len, &length, 2))
> > +                     break;
> > +             if (len < (u32)length) /* Incomplete TLS extensions. */
> > +                     return 0;
>
> [Medium]
> Should this branch be treated as a parse failure rather than a success?
>

The rest of the TLS extensions could be in the CRYPTO frame of
the next initial packet.

Returning error will cause packets to be dropped in the callers.

> The other length-bound checks in this same function return err (-EINVAL)
> when a declared length exceeds the remaining buffer, but this one returns
> 0 without populating alpn.

The splitted TLS CLIENT_HELLO messages are very rare.
The ones I saw from some QUIC implementations are splitted from these
TLS EXT points. I will improve it when seeing more different split points.

> The callers initialise alpn as
>
>         struct quic_data alpns = {};
>
> and quic_listen_sock_lookup() then takes the no-ALPN branch:
>
>     if (!alpns->len) { /* No ALPNs or parse failed */
>         sk_nulls_for_each_rcu(tmp, node, &head->head) {
>             ...
>             if (... && (!alpns->data || !quic_alpn(tmp)->len)) {
>
> Because alpns->data is still NULL, the match collapses to true and any
> listener bound to the address/port will match, including ones that
> registered a specific ALPN list. Can a peer reach an ALPN-restricted
> listener by sending a Client Initial with one extension whose declared
> length is larger than what is left in the ClientHello?
>
> Would returning err here (or always assigning alpn->data before
> returning) match the comment in quic_listen_sock_lookup() that
> distinguishes "TLS parsing succeeded but no ALPN was found" from "parse
> failed"?
>
ALPN demux is a feature trying its best to dispatch the first initial
to the right process. If it goes to the "incorrect" because the peer
doesn't provide proper ALPNs, the userspace handshake will reject it.

It's kind of the peer responsibility if it wants to get processed in the
right process, send with the right ALPNs then.

> > +             if (type == TLS_EXT_alpn) { /* Found ALPN extension. */
> > +                     if (length > QUIC_ALPN_MAX_LEN)
> > +                             return err;
> > +                     len = length;
> > +                     found = 1;
> > +                     break;
> > +             }
> > +             /* Skip non-ALPN extensions. */
> > +             p += length;
> > +             len -= length;
> > +             if (exts++ >= TLS_MAX_EXTENSIONS)
> > +                     return err;
>
> [Low]
> Is the post-increment intended here? With TLS_MAX_EXTENSIONS == 128, the
> check (exts++ >= 128) first becomes true when exts has already reached
> 129, so 129 extensions are consumed before bail-out even though the
> constant name suggests a cap of 128. Would
>
>         if (++exts > TLS_MAX_EXTENSIONS)
>                 return err;
>
> (or moving the check to the loop top) better reflect the intended limit?
>

will change to ++exts.

> > +     }
>
> [ ... ]
>
> > +static int quic_packet_parse_alpn(struct sk_buff *skb, struct quic_data *alpn)
>
> [ ... ]
>
> > diff --git a/net/quic/socket.c b/net/quic/socket.c
> > index 8f348d7c678b..19b164baf55d 100644
> > --- a/net/quic/socket.c
> > +++ b/net/quic/socket.c
>
> [ ... ]
>
> > +struct sock *quic_sock_lookup(struct sk_buff *skb, union quic_addr *sa,
> > +                           union quic_addr *da, struct sock *usk,
> > +                           struct quic_conn_id *dcid)
> > +{
>
> [ ... ]
>
> > +     rcu_read_lock();
> > +begin:
> > +     sk_nulls_for_each_rcu(tmp, node, &head->head) {
> > +             if (net != sock_net(tmp))
> > +                     continue;
> > +             paths = quic_paths(tmp);
> > +             odcid = quic_path_orig_dcid(paths);
> > +             if (quic_cmp_sk_addr(tmp, quic_path_saddr(paths, 0), sa) &&
> > +                 quic_cmp_sk_addr(tmp, quic_path_daddr(paths, 0), da) &&
> > +                 quic_path_usock(paths, 0) == usk &&
> > +                 (!dcid || !quic_conn_id_cmp(odcid, dcid))) {
> > +                     sk = tmp;
> > +                     break;
> > +             }
> > +     }
>
> [High]
> Is the sk memory itself RCU-protected here? sk_nulls_for_each_rcu()
> dereferences fields of tmp (sock_net(tmp), quic_paths(tmp),
> quic_path_saddr(...), etc.) before any reference is taken. That is only
> safe when the sk allocation is freed via RCU.
>
> quic_init_sock() does not call sock_set_flag(sk, SOCK_RCU_FREE), and
> neither quic_prot nor quicv6_prot sets slab_flags to SLAB_TYPESAFE_BY_RCU,
> so sk_destruct() (and quic_sock_destruct()) runs synchronously on the
> last sock_put(). Can a parallel close on tmp free the sock between
> sk_nulls_for_each_rcu() exposing it and the field dereferences above?
>
> The same shape appears in quic_listen_sock_lookup() further down, where
> quic_alpn(tmp)->len is read inside the RCU walk before
> refcount_inc_not_zero().
>

SOCK_RCU_FREE will be set in the patchset-2 via quic_hash().

> > +     if (sk && unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > +             sk = NULL;
> > +     rcu_read_unlock();
> > +     return sk;
> > +}
>
> [ ... ]
>
> > +struct sock *quic_listen_sock_lookup(struct sk_buff *skb, union quic_addr *sa,
> > +                                  union quic_addr *da,
> > +                                  struct quic_data *alpns)
> > +{
>
> [ ... ]
>
> > +     rcu_read_lock();
> > +     if (!alpns->len) { /* No ALPNs or parse failed */
> > +             sk_nulls_for_each_rcu(tmp, node, &head->head) {
>
> [ ... ]
>
> > +                     a = quic_path_saddr(quic_paths(tmp), 0);
> > +                     if (net == sock_net(tmp) &&
> > +                         quic_cmp_sk_addr(tmp, a, sa) &&
> > +                         quic_path_usock(quic_paths(tmp), 0) == skb->sk &&
> > +                         (!alpns->data || !quic_alpn(tmp)->len)) {
>
> [ ... ]
>
> > @@ -48,6 +196,9 @@ static void quic_sock_destruct(struct sock *sk)
> >       for (i = 0; i < QUIC_CRYPTO_MAX; i++)
> >               quic_crypto_free(quic_crypto(sk, i));
> >
> > +     /* Deferred ALPN free for RCU readers in quic_listen_sock_lookup(). */
> > +     quic_data_free(quic_alpn(sk));
> > +
>
> Without SOCK_RCU_FREE or SLAB_TYPESAFE_BY_RCU on quic_prot/quicv6_prot,
> does moving the alpn free into the destruct callback actually defer it
> past an RCU grace period? quic_sock_destruct() is invoked from
> sk_destruct() on the final sock_put() and is not itself an RCU callback,
> so the comment's "Deferred ALPN free for RCU readers" assumption seems to
> rely on a guarantee that is not configured on this protocol. Should this
> patch also set sock_set_flag(sk, SOCK_RCU_FREE) in quic_init_sock(), or
> alternatively take refcount_inc_not_zero() in the lookup helpers before
> touching any fields of tmp?
>

SOCK_RCU_FREE will be set in the patchset-2 via quic_hash().

> >       quic_sk_destruct(sk);
> >  }
> >
>
> [ ... ]
> --
> This is an AI-generated review.
>

^ permalink raw reply

* [PATCH net-next 0/3] net: bcmgenet: collapse TX priority queues
From: Nicolai Buchwitz @ 2026-06-12 20:59 UTC (permalink / raw)
  To: Doug Berger, Florian Fainelli, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel,
	Nicolai Buchwitz

The strict-priority TX queues can starve under multi-queue load and
trip NETDEV_WATCHDOG. Justin's earlier series [1] tried to mitigate
the timeouts but kept the multi-queue design. Ovidiu Panait recently
proposed a WRR stop-gap [2]. This series drops the priority queues
entirely. Justin confirmed they are no longer required.

Patch 1 collapses v2-v4 hw_params to the same single-queue path v1
already uses. Patch 2 removes the now-dead priority register writes,
helper macros, and the dead "flow period for ring != 0" branch in
bcmgenet_init_tx_ring(); the DMA_ARBITER_{RR,WRR,SP} and
DMA_RING_BUF_PRIORITY_* HW defines are kept as register
documentation. Patch 3 switches the netdev allocation from
alloc_etherdev_mqs(., 5, 5) to alloc_etherdev(), since only one
TX/RX queue is ever used.

Tested on Raspberry Pi CM4 (BCM2711):
  - Ovidiu's reproducer (iperf3 -u -b0 -P16 -t60) no longer trips
    NETDEV_WATCHDOG.
  - UDP sustains 956 Mbit/s line rate over 60 s with 0 datagrams
    lost (0/4952890).
  - Single-stream TCP throughput unchanged at 943 Mbit/s.

[1] https://lore.kernel.org/netdev/20260406175756.134567-1-justin.chen@broadcom.com/
[2] https://lore.kernel.org/netdev/20260610085238.56300-1-ovidiu.panait.rb@renesas.com/

Nicolai Buchwitz (3):
  net: bcmgenet: collapse TX priority queues to a single queue
  net: bcmgenet: remove dead priority queue plumbing
  net: bcmgenet: allocate a single-queue netdev

 .../net/ethernet/broadcom/genet/bcmgenet.c    | 100 +++---------------
 .../net/ethernet/broadcom/genet/bcmgenet.h    |   2 -
 2 files changed, 16 insertions(+), 86 deletions(-)

-- 
2.53.0


^ permalink raw reply

* [PATCH net-next 1/3] net: bcmgenet: collapse TX priority queues to a single queue
From: Nicolai Buchwitz @ 2026-06-12 20:59 UTC (permalink / raw)
  To: Doug Berger, Florian Fainelli, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel,
	Nicolai Buchwitz
In-Reply-To: <20260612205915.3156127-1-nb@tipi-net.de>

The strict-priority TX queues can starve under multi-queue load and
trip NETDEV_WATCHDOG. Justin's earlier series [1] worked around the
symptom but kept the design.

The multi-queue design was originally used for STB use cases that are
no longer needed, as confirmed by Justin. v1 hw_params already
exercises a single-queue path. Point v2-v4 at the same configuration:
ring 0 takes the full BD pool, every per-ring loop collapses to one
iteration, and netif_set_real_num_tx_queues drops to 1 via the
existing tx_queues + 1 arithmetic.

Tested on Raspberry Pi CM4 (BCM2711). The baseline kernel trips
NETDEV_WATCHDOG within seconds under iperf3 UDP saturation
(-u -b0 -P16 -t60). After the change the same test completes
without a watchdog, and a single-stream 60 s UDP run sustains
956 Mbit/s with 0/4952890 datagrams lost. Single-stream TCP
throughput is unchanged at 943 Mbit/s.

[1] https://lore.kernel.org/netdev/20260406175756.134567-1-justin.chen@broadcom.com/

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index ca403581357d..c892734b4cd0 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3751,8 +3751,8 @@ static const struct bcmgenet_hw_params bcmgenet_hw_params_v1 = {
 };
 
 static const struct bcmgenet_hw_params bcmgenet_hw_params_v2 = {
-	.tx_queues = 4,
-	.tx_bds_per_q = 32,
+	.tx_queues = 0,
+	.tx_bds_per_q = 0,
 	.rx_queues = 0,
 	.rx_bds_per_q = 0,
 	.bp_in_en_shift = 16,
@@ -3769,8 +3769,8 @@ static const struct bcmgenet_hw_params bcmgenet_hw_params_v2 = {
 };
 
 static const struct bcmgenet_hw_params bcmgenet_hw_params_v3 = {
-	.tx_queues = 4,
-	.tx_bds_per_q = 32,
+	.tx_queues = 0,
+	.tx_bds_per_q = 0,
 	.rx_queues = 0,
 	.rx_bds_per_q = 0,
 	.bp_in_en_shift = 17,
@@ -3787,8 +3787,8 @@ static const struct bcmgenet_hw_params bcmgenet_hw_params_v3 = {
 };
 
 static const struct bcmgenet_hw_params bcmgenet_hw_params_v4 = {
-	.tx_queues = 4,
-	.tx_bds_per_q = 32,
+	.tx_queues = 0,
+	.tx_bds_per_q = 0,
 	.rx_queues = 0,
 	.rx_bds_per_q = 0,
 	.bp_in_en_shift = 17,
-- 
2.53.0


^ permalink raw reply related

* [PATCH net-next 2/3] net: bcmgenet: remove dead priority queue plumbing
From: Nicolai Buchwitz @ 2026-06-12 20:59 UTC (permalink / raw)
  To: Doug Berger, Florian Fainelli, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel,
	Nicolai Buchwitz
In-Reply-To: <20260612205915.3156127-1-nb@tipi-net.de>

With a single TX ring there is nothing left to prioritize. Drop the
unused register writes, enum entries, helper macros, and the dead
"flow period for ring != 0" branch in bcmgenet_init_tx_ring().

The DMA_ARBITER_{RR,WRR,SP} and DMA_RING_BUF_PRIORITY_* HW defines
are kept as register documentation.

No functional change.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
---
 .../net/ethernet/broadcom/genet/bcmgenet.c    | 84 ++-----------------
 .../net/ethernet/broadcom/genet/bcmgenet.h    |  2 -
 2 files changed, 9 insertions(+), 77 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index c892734b4cd0..25f339eb304f 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -40,10 +40,6 @@
 
 #include "bcmgenet.h"
 
-/* Default highest priority queue for multi queue support */
-#define GENET_Q1_PRIORITY	0
-#define GENET_Q0_PRIORITY	1
-
 #define GENET_Q0_RX_BD_CNT	\
 	(TOTAL_DESC - priv->hw_params->rx_queues * priv->hw_params->rx_bds_per_q)
 #define GENET_Q0_TX_BD_CNT	\
@@ -187,10 +183,6 @@ enum dma_reg {
 	DMA_CTRL,
 	DMA_STATUS,
 	DMA_SCB_BURST_SIZE,
-	DMA_ARB_CTRL,
-	DMA_PRIORITY_0,
-	DMA_PRIORITY_1,
-	DMA_PRIORITY_2,
 	DMA_INDEX2RING_0,
 	DMA_INDEX2RING_1,
 	DMA_INDEX2RING_2,
@@ -223,10 +215,6 @@ static const u8 bcmgenet_dma_regs_v3plus[] = {
 	[DMA_CTRL]		= 0x04,
 	[DMA_STATUS]		= 0x08,
 	[DMA_SCB_BURST_SIZE]	= 0x0C,
-	[DMA_ARB_CTRL]		= 0x2C,
-	[DMA_PRIORITY_0]	= 0x30,
-	[DMA_PRIORITY_1]	= 0x34,
-	[DMA_PRIORITY_2]	= 0x38,
 	[DMA_RING0_TIMEOUT]	= 0x2C,
 	[DMA_RING1_TIMEOUT]	= 0x30,
 	[DMA_RING2_TIMEOUT]	= 0x34,
@@ -259,10 +247,6 @@ static const u8 bcmgenet_dma_regs_v2[] = {
 	[DMA_CTRL]		= 0x04,
 	[DMA_STATUS]		= 0x08,
 	[DMA_SCB_BURST_SIZE]	= 0x0C,
-	[DMA_ARB_CTRL]		= 0x30,
-	[DMA_PRIORITY_0]	= 0x34,
-	[DMA_PRIORITY_1]	= 0x38,
-	[DMA_PRIORITY_2]	= 0x3C,
 	[DMA_RING0_TIMEOUT]	= 0x2C,
 	[DMA_RING1_TIMEOUT]	= 0x30,
 	[DMA_RING2_TIMEOUT]	= 0x34,
@@ -286,10 +270,6 @@ static const u8 bcmgenet_dma_regs_v1[] = {
 	[DMA_CTRL]		= 0x00,
 	[DMA_STATUS]		= 0x04,
 	[DMA_SCB_BURST_SIZE]	= 0x0C,
-	[DMA_ARB_CTRL]		= 0x30,
-	[DMA_PRIORITY_0]	= 0x34,
-	[DMA_PRIORITY_1]	= 0x38,
-	[DMA_PRIORITY_2]	= 0x3C,
 	[DMA_RING0_TIMEOUT]	= 0x2C,
 	[DMA_RING1_TIMEOUT]	= 0x30,
 	[DMA_RING2_TIMEOUT]	= 0x34,
@@ -2126,13 +2106,6 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
 	int i;
 
 	index = skb_get_queue_mapping(skb);
-	/* Mapping strategy:
-	 * queue_mapping = 0, unclassified, packet xmited through ring 0
-	 * queue_mapping = 1, goes to ring 1. (highest priority queue)
-	 * queue_mapping = 2, goes to ring 2.
-	 * queue_mapping = 3, goes to ring 3.
-	 * queue_mapping = 4, goes to ring 4.
-	 */
 	ring = &priv->tx_rings[index];
 	txq = netdev_get_tx_queue(dev, index);
 
@@ -2712,7 +2685,6 @@ static void bcmgenet_init_tx_ring(struct bcmgenet_priv *priv,
 {
 	struct bcmgenet_tx_ring *ring = &priv->tx_rings[index];
 	u32 words_per_bd = WORDS_PER_BD(priv);
-	u32 flow_period_val = 0;
 
 	spin_lock_init(&ring->lock);
 	ring->priv = priv;
@@ -2727,16 +2699,11 @@ static void bcmgenet_init_tx_ring(struct bcmgenet_priv *priv,
 	ring->end_ptr = end_ptr - 1;
 	ring->prod_index = 0;
 
-	/* Set flow period for ring != 0 */
-	if (index)
-		flow_period_val = ENET_MAX_MTU_SIZE << 16;
-
 	bcmgenet_tdma_ring_writel(priv, index, 0, TDMA_PROD_INDEX);
 	bcmgenet_tdma_ring_writel(priv, index, 0, TDMA_CONS_INDEX);
 	bcmgenet_tdma_ring_writel(priv, index, 1, DMA_MBUF_DONE_THRESH);
-	/* Disable rate control for now */
-	bcmgenet_tdma_ring_writel(priv, index, flow_period_val,
-				  TDMA_FLOW_PERIOD);
+	/* Rate control disabled */
+	bcmgenet_tdma_ring_writel(priv, index, 0, TDMA_FLOW_PERIOD);
 	bcmgenet_tdma_ring_writel(priv, index,
 				  ((size << DMA_RING_SIZE_SHIFT) |
 				   RX_BUF_LENGTH), DMA_RING_BUF_SIZE);
@@ -2919,52 +2886,20 @@ static int bcmgenet_rdma_disable(struct bcmgenet_priv *priv)
 	return -ETIMEDOUT;
 }
 
-/* Initialize Tx queues
+/* Initialize the single Tx queue.
  *
- * Queues 1-4 are priority-based, each one has 32 descriptors,
- * with queue 1 being the highest priority queue.
- *
- * Queue 0 is the default Tx queue with
- * GENET_Q0_TX_BD_CNT = 256 - 4 * 32 = 128 descriptors.
- *
- * The transmit control block pool is then partitioned as follows:
- * - Tx queue 0 uses tx_cbs[0..127]
- * - Tx queue 1 uses tx_cbs[128..159]
- * - Tx queue 2 uses tx_cbs[160..191]
- * - Tx queue 3 uses tx_cbs[192..223]
- * - Tx queue 4 uses tx_cbs[224..255]
+ * Queue 0 owns the full TX descriptor pool (GENET_Q0_TX_BD_CNT BDs)
+ * and is the only ring enabled in DMA_RING_CFG / DMA_CTRL.
  */
 static void bcmgenet_init_tx_queues(struct net_device *dev)
 {
 	struct bcmgenet_priv *priv = netdev_priv(dev);
-	unsigned int start = 0, end = GENET_Q0_TX_BD_CNT;
-	u32 i, ring_mask, dma_priority[3] = {0, 0, 0};
-
-	/* Enable strict priority arbiter mode */
-	bcmgenet_tdma_writel(priv, DMA_ARBITER_SP, DMA_ARB_CTRL);
 
-	/* Initialize Tx priority queues */
-	for (i = 0; i <= priv->hw_params->tx_queues; i++) {
-		bcmgenet_init_tx_ring(priv, i, end - start, start, end);
-		start = end;
-		end += priv->hw_params->tx_bds_per_q;
-		dma_priority[DMA_PRIO_REG_INDEX(i)] |=
-			(i ? GENET_Q1_PRIORITY : GENET_Q0_PRIORITY)
-			<< DMA_PRIO_REG_SHIFT(i);
-	}
+	bcmgenet_init_tx_ring(priv, 0, GENET_Q0_TX_BD_CNT, 0,
+			      GENET_Q0_TX_BD_CNT);
 
-	/* Set Tx queue priorities */
-	bcmgenet_tdma_writel(priv, dma_priority[0], DMA_PRIORITY_0);
-	bcmgenet_tdma_writel(priv, dma_priority[1], DMA_PRIORITY_1);
-	bcmgenet_tdma_writel(priv, dma_priority[2], DMA_PRIORITY_2);
-
-	/* Configure Tx queues as descriptor rings */
-	ring_mask = (1 << (priv->hw_params->tx_queues + 1)) - 1;
-	bcmgenet_tdma_writel(priv, ring_mask, DMA_RING_CFG);
-
-	/* Enable Tx rings */
-	ring_mask <<= DMA_RING_BUF_EN_SHIFT;
-	bcmgenet_tdma_writel(priv, ring_mask, DMA_CTRL);
+	bcmgenet_tdma_writel(priv, BIT(0), DMA_RING_CFG);
+	bcmgenet_tdma_writel(priv, BIT(DMA_RING_BUF_EN_SHIFT), DMA_CTRL);
 }
 
 static void bcmgenet_enable_rx_napi(struct bcmgenet_priv *priv)
@@ -4123,7 +4058,6 @@ static int bcmgenet_probe(struct platform_device *pdev)
 	if (err)
 		goto err_clk_disable;
 
-	/* setup number of real queues + 1 */
 	netif_set_real_num_tx_queues(priv->dev, priv->hw_params->tx_queues + 1);
 	netif_set_real_num_rx_queues(priv->dev, priv->hw_params->rx_queues + 1);
 
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index 22a958ba9902..ce449ea0b40b 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -431,8 +431,6 @@ struct bcmgenet_rx_stats64 {
 #define DMA_ARBITER_MODE_MASK		0x03
 #define DMA_RING_BUF_PRIORITY_MASK	0x1F
 #define DMA_RING_BUF_PRIORITY_SHIFT	5
-#define DMA_PRIO_REG_INDEX(q)		((q) / 6)
-#define DMA_PRIO_REG_SHIFT(q)		(((q) % 6) * DMA_RING_BUF_PRIORITY_SHIFT)
 #define DMA_RATE_ADJ_MASK		0xFF
 
 /* Tx/Rx Dma Descriptor common bits*/
-- 
2.53.0


^ permalink raw reply related

* [PATCH net-next 3/3] net: bcmgenet: allocate a single-queue netdev
From: Nicolai Buchwitz @ 2026-06-12 20:59 UTC (permalink / raw)
  To: Doug Berger, Florian Fainelli, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel,
	Nicolai Buchwitz
In-Reply-To: <20260612205915.3156127-1-nb@tipi-net.de>

The driver only uses TX ring 0 and RX ring 0, so allocating a netdev
with GENET_MAX_MQ_CNT + 1 = 5 TX and 5 RX slots leaves four of each
unused. Switch to alloc_etherdev() which allocates exactly one queue
of each kind.

No functional change: netif_set_real_num_{tx,rx}_queues() already
clamps the visible queue count to 1.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 25f339eb304f..001bd445b110 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3915,9 +3915,7 @@ static int bcmgenet_probe(struct platform_device *pdev)
 	unsigned int i;
 	int err = -EIO;
 
-	/* Up to GENET_MAX_MQ_CNT + 1 TX queues and RX queues */
-	dev = alloc_etherdev_mqs(sizeof(*priv), GENET_MAX_MQ_CNT + 1,
-				 GENET_MAX_MQ_CNT + 1);
+	dev = alloc_etherdev(sizeof(*priv));
 	if (!dev) {
 		dev_err(&pdev->dev, "can't allocate net device\n");
 		return -ENOMEM;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net-next v13 00/15] net: introduce QUIC infrastructure and core subcomponents
From: Xin Long @ 2026-06-12 21:06 UTC (permalink / raw)
  To: network dev, quic
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
	Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
	Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
	Steve Dickson, Hannes Reinecke, Alexander Aring, David Howells,
	Matthieu Baerts, John Ericson, Cong Wang, D . Wythe, Jason Baron,
	illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
	Daniel Stenberg, Andy Gospodarek, mef, paul
In-Reply-To: <cover.1780855296.git.lucien.xin@gmail.com>

On Sun, Jun 7, 2026 at 2:03 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> Introduction
> ============
>
> The QUIC protocol, defined in RFC 9000, is a secure, multiplexed transport
> built on top of UDP. It enables low-latency connection establishment,
> stream-based communication with flow control, and supports connection
> migration across network paths, while ensuring confidentiality, integrity,
> and availability.
>
> This implementation introduces QUIC support in Linux Kernel, offering
> several key advantages:
>
> - In-Kernel QUIC Support for Subsystems: Enables kernel subsystems
>   such as SMB and NFS to operate over QUIC with minimal changes. Once the
>   handshake is complete via the net/handshake APIs, data exchange proceeds
>   over standard in-kernel transport interfaces.
>
> - Standard Socket API Semantics: Implements core socket operations
>   (listen(), accept(), connect(), sendmsg(), recvmsg(), close(),
>   getsockopt(), setsockopt(), getsockname(), and getpeername()),
>   allowing user space to interact with QUIC sockets in a familiar,
>   POSIX-compliant way.
>
> - ALPN-Based Connection Dispatching: Supports in-kernel ALPN
>   (Application-Layer Protocol Negotiation) routing, allowing demultiplexing
>   of QUIC connections across different user-space processes based
>   on the ALPN identifiers.
>
> - Performance Enhancements: Handles all control messages in-kernel
>   to reduce syscall overhead, incorporates zero-copy mechanisms such as
>   sendfile() to minimize data movement, and is also structured to support
>   future crypto hardware offloads.
>
> This implementation offers fundamental support for the following RFCs:
>
> - RFC9000 - QUIC: A UDP-Based Multiplexed and Secure Transport
> - RFC9001 - Using TLS to Secure QUIC
> - RFC9002 - QUIC Loss Detection and Congestion Control
> - RFC9221 - An Unreliable Datagram Extension to QUIC
> - RFC9287 - Greasing the QUIC Bit
> - RFC9368 - Compatible Version Negotiation for QUIC
> - RFC9369 - QUIC Version 2
>
> The socket APIs for QUIC follow the RFC draft [1]:
>
> - The Sockets API Extensions for In-kernel QUIC Implementations
>
> Implementation
> ==============
>
> The central design is to implement QUIC within the kernel while delegating
> the handshake to userspace.
>
> Only the processing and creation of raw TLS Handshake Messages are handled
> in userspace, facilitated by a TLS library like GnuTLS. These messages are
> exchanged between kernel and userspace via sendmsg() and recvmsg(), with
> cryptographic details conveyed through control messages (cmsg).
>
> The entire QUIC protocol, aside from the TLS Handshake Messages processing
> and creation, is managed in the kernel. Rather than using an Upper Layer
> Protocol (ULP) layer, this implementation establishes a socket of type
> IPPROTO_QUIC (similar to IPPROTO_MPTCP), operating over UDP tunnels.
>
> For kernel consumers, they can initiate a handshake request from the kernel
> to userspace using the existing net/handshake netlink. The userspace
> component, such as tlshd service [2], then manages the processing
> of the QUIC handshake request.
>
> - Handshake Architecture:
>
>   ┌──────┐  ┌──────┐
>   │ APP1 │  │ APP2 │ ...
>   └──────┘  └──────┘
>   ┌──────────────────────────────────────────┐
>   │     {quic_client/server_handshake()}     │<─────────────┐
>   └──────────────────────────────────────────┘       ┌─────────────┐
>    {send/recvmsg()}      {set/getsockopt()}          │    tlshd    │
>    [CMSG handshake_info] [SOCKOPT_CRYPTO_SECRET]     └─────────────┘
>                          [SOCKOPT_TRANSPORT_PARAM_EXT]    │   ^
>                 │ ^                  │ ^                  │   │
>   Userspace     │ │                  │ │                  │   │
>   ──────────────│─│──────────────────│─│──────────────────│───│───────
>   Kernel        │ │                  │ │                  │   │
>                 v │                  v │                  v   │
>   ┌──────────────────┬───────────────────────┐       ┌─────────────┐
>   │ protocol, timer, │ socket (IPPROTO_QUIC) │<──┐   │ handshake   │
>   │                  ├───────────────────────┤   │   │netlink APIs │
>   │ common, family,  │ outqueue  |  inqueue  │   │   └─────────────┘
>   │                  ├───────────────────────┤   │      │       │
>   │ stream, connid,  │         frame         │   │   ┌─────┐ ┌─────┐
>   │                  ├───────────────────────┤   │   │     │ │     │
>   │ path, pnspace,   │         packet        │   │───│ SMB │ │ NFS │...
>   │                  ├───────────────────────┤   │   │     │ │     │
>   │ cong, crypto     │       UDP tunnels     │   │   └─────┘ └─────┘
>   └──────────────────┴───────────────────────┘   └──────┴───────┘
>
> - User Data Architecture:
>
>   ┌──────┐  ┌──────┐
>   │ APP1 │  │ APP2 │ ...
>   └──────┘  └──────┘
>    {send/recvmsg()}   {set/getsockopt()}              {recvmsg()}
>    [CMSG stream_info] [SOCKOPT_KEY_UPDATE]            [EVENT conn update]
>                       [SOCKOPT_CONNECTION_MIGRATION]  [EVENT stream update]
>                       [SOCKOPT_STREAM_OPEN/RESET/STOP]
>                 │ ^               │ ^                     ^
>   Userspace     │ │               │ │                     │
>   ──────────────│─│───────────────│─│─────────────────────│───────────
>   Kernel        │ │               │ │                     │
>                 v │               v │  ┌──────────────────┘
>   ┌──────────────────┬───────────────────────┐
>   │ protocol, timer, │ socket (IPPROTO_QUIC) │<──┐{kernel_send/recvmsg()}
>   │                  ├───────────────────────┤   │{kernel_set/getsockopt()}
>   │ common, family,  │ outqueue  |  inqueue  │   │{kernel_recvmsg()}
>   │                  ├───────────────────────┤   │
>   │ stream, connid,  │         frame         │   │   ┌─────┐ ┌─────┐
>   │                  ├───────────────────────┤   │   │     │ │     │
>   │ path, pnspace,   │         packet        │   │───│ SMB │ │ NFS │...
>   │                  ├───────────────────────┤   │   │     │ │     │
>   │ cong, crypto     │       UDP tunnels     │   │   └─────┘ └─────┘
>   └──────────────────┴───────────────────────┘   └──────┴───────┘
>
> Interface
> =========
>
> This implementation supports a mapping of QUIC into sockets APIs. Similar
> to TCP and SCTP, a typical Server and Client use the following system call
> sequence to communicate:
>
>     Client                             Server
>   ──────────────────────────────────────────────────────────────────────
>   sockfd = socket(IPPROTO_QUIC)      listenfd = socket(IPPROTO_QUIC)
>   bind(sockfd)                       bind(listenfd)
>                                      listen(listenfd)
>   connect(sockfd)
>   quic_client_handshake(sockfd)
>                                      sockfd = accept(listenfd)
>                                      quic_server_handshake(sockfd, cert)
>
>   sendmsg(sockfd)                    recvmsg(sockfd)
>   close(sockfd)                      close(sockfd)
>                                      close(listenfd)
>
> Please note that quic_client_handshake() and quic_server_handshake()
> functions are currently sourced from libquic [3]. These functions are
> responsible for receiving and processing the raw TLS handshake messages
> until the completion of the handshake process.
>
> For utilization by kernel consumers, it is essential to have tlshd
> service [2] installed and running in userspace. This service receives
> and manages kernel handshake requests for kernel sockets. In the kernel,
> the APIs closely resemble those used in userspace:
>
>     Client                             Server
>   ────────────────────────────────────────────────────────────────────────
>   __sock_create(IPPROTO_QUIC, &sock)  __sock_create(IPPROTO_QUIC, &sock)
>   kernel_bind(sock)                   kernel_bind(sock)
>                                       kernel_listen(sock)
>   kernel_connect(sock)
>   tls_client_hello_x509(args:{sock})
>                                       kernel_accept(sock, &newsock)
>                                       tls_server_hello_x509(args:{newsock})
>
>   kernel_sendmsg(sock)                kernel_recvmsg(newsock)
>   sock_release(sock)                  sock_release(newsock)
>                                       sock_release(sock)
>
> Please be aware that tls_client_hello_x509() and tls_server_hello_x509()
> are APIs from net/handshake/. They are used to dispatch the handshake
> request to the userspace tlshd service and subsequently block until the
> handshake process is completed.
>
> Use Cases
> =========
>
> - Samba
>
>   Stefan Metzmacher has integrated Linux QUIC into Samba for both client
>   and server roles [4].
>
> - tlshd
>
>   The tlshd daemon [2] facilitates Linux QUIC handshake requests from
>   kernel sockets. This is essential for enabling protocols like SMB
>   and NFS over QUIC.
>
> - curl
>
>   Linux QUIC is being integrated into curl [5] for HTTP/3. Example usage:
>
>   # curl --http3-only https://nghttp2.org:4433/
>   # curl --http3-only https://www.google.com/
>   # curl --http3-only https://facebook.com/
>   # curl --http3-only https://outlook.office.com/
>   # curl --http3-only https://cloudflare-quic.com/
>
> - httpd-portable
>
>   Moritz Buhl has deployed an HTTP/3 server over Linux QUIC [6] that is
>   accessible via Firefox and curl:
>
>   https://d.moritzbuhl.de/pub
>
> - NetPerfMeter
>
>   The latest NetPerfMeter release supports Linux QUIC and can be used to
>   run performance evaluations [10].
>
> Test Coverage
> =============
>
> The Coverage (gcov) of Functional and Interop Tests:
>
> https://d.moritzbuhl.de/lcov
>
> - Functional Tests
>
>   The libquic self-tests (make check) pass on all major architectures:
>   x86_64, i386, s390x, aarch64, ppc64le.
>
> - Interop tests
>
>   Interoperability was validated using the QUIC Interop Runner [7] against
>   all major userland QUIC stacks. Results are available at:
>
>   https://d.moritzbuhl.de/
>
> - Fuzzing via Syzkaller
>
>   Syzkaller has been running kernel fuzzing with QUIC for weeks using
>   tests/syzkaller/ in libquic [3].
>
> - Performance Testing
>
>   Performance was benchmarked using iperf [8] over a 100G NIC using
>   various MTUs and packet sizes:
>
>   - QUIC vs. kTLS:
>
>     UNIT        size:1024      size:4096      size:16384     size:65536
>     Gbits/sec   QUIC | kTLS    QUIC | kTLS    QUIC | kTLS    QUIC | kTLS
>     ────────────────────────────────────────────────────────────────────
>     mtu:1500    2.27 | 3.26    3.02 | 6.97    3.36 | 9.74    3.48 | 10.8
>     ────────────────────────────────────────────────────────────────────
>     mtu:9000    3.66 | 3.72    5.87 | 8.92    7.03 | 11.2    8.04 | 11.4
>
>   - QUIC(disable_1rtt_encryption) vs. TCP:
>
>     UNIT        size:1024      size:4096      size:16384     size:65536
>     Gbits/sec   QUIC | TCP     QUIC | TCP     QUIC | TCP     QUIC | TCP
>     ────────────────────────────────────────────────────────────────────
>     mtu:1500    3.09 | 4.59    4.46 | 14.2    5.07 | 21.3    5.18 | 23.9
>     ────────────────────────────────────────────────────────────────────
>     mtu:9000    4.60 | 4.65    8.41 | 14.0    11.3 | 28.9    13.5 | 39.2
>
>
>   The performance gap between QUIC and kTLS may be attributed to:
>
>   - The absence of Generic Segmentation Offload (GSO) for QUIC.
>   - An additional data copy on the transmission (TX) path.
>   - Extra encryption required for header protection in QUIC.
>   - A longer header length for the stream data in QUIC.
>
> Patches
> =======
>
> Note: This implementation is organized into five parts and submitted across
> two patchsets for review. This patchset includes Parts 1–2, while Parts 3–5
> will be submitted in a subsequent patchset. For complete series, see [9].
>
> 1. Infrastructure (2):
>
>   net: define IPPROTO_QUIC and SOL_QUIC constants
>   net: build socket infrastructure for QUIC protocol
>
> 2. Subcomponents (13):
>
>   quic: provide common utilities and data structures
>   quic: provide family ops for address and protocol
>   quic: provide quic.h header files for kernel and userspace
>   quic: add stream management
>   quic: add connection id management
>   quic: add path management
>   quic: add congestion control
>   quic: add packet number space
>   quic: add crypto key derivation and installation
>   quic: add crypto packet encryption and decryption
>   quic: add timer management
>   quic: add packet builder base
>   quic: add packet parser base
>
> 3. Data Processing (8):
>
>   quic: add frame encoder and decoder base
>   quic: implement outqueue transmission and flow control
>   quic: implement outqueue sack and retransmission
>   quic: implement inqueue receiving and flow control
>   quic: implement frame creation functions
>   quic: implement frame processing functions
>   quic: implement packet creation functions
>   quic: implement packet processing functions
>
> 4. Socket APIs (6):
>
>   quic: support bind/listen/connect/accept/close()
>   quic: support sendmsg() and recvmsg()
>   quic: support socket options related to interaction after handshake
>   quic: support socket options related to settings prior to handshake
>   quic: support socket options related to setup during handshake
>   quic: support socket ioctls and socket dump via procfs
>
> 5. Documentation and Selftests (3):
>
>   Documentation: describe QUIC protocol interface in quic.rst
>   quic: create sample test using handshake APIs for kernel consumers
>   selftests: net: add tests for QUIC protocol
>
> Notice: The QUIC module is currently labeled as "EXPERIMENTAL".
>
> All contributors are recognized in the respective patches with the tag of
> 'Signed-off-by:'. Special thanks to Moritz Buhl and Stefan Metzmacher whose
> practical use cases and insightful feedback have been instrumental in
> shaping the design and advancing the development.
>
> References
> ==========
>
> [1]  https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket-apis
> [2]  https://github.com/oracle/ktls-utils
> [3]  https://github.com/lxin/quic
> [4]  https://gitlab.com/samba-team/samba/-/merge_requests/4019
> [5]  https://github.com/moritzbuhl/curl/tree/linux_curl
> [6]  https://github.com/moritzbuhl/httpd-portable
> [7]  https://github.com/quic-interop/quic-interop-runner
> [8]  https://github.com/lxin/iperf
> [9]  https://github.com/lxin/net-next/commits/quic/
> [10] https://www.nntb.no/~dreibh/netperfmeter/
>
> Changes in v2-v13: See individual patch changelogs for details.
>
> Xin Long (15):
>   net: define IPPROTO_QUIC and SOL_QUIC constants
>   net: build socket infrastructure for QUIC protocol
>   quic: provide common utilities and data structures
>   quic: provide family ops for address and protocol
>   quic: provide quic.h header files for kernel and userspace
>   quic: add stream management
>   quic: add connection id management
>   quic: add path management
>   quic: add congestion control
>   quic: add packet number space
>   quic: add crypto key derivation and installation
>   quic: add crypto packet encryption and decryption
>   quic: add timer management
>   quic: add packet builder base
>   quic: add packet parser base
>
>  Documentation/networking/ip-sysctl.rst |   39 +
>  MAINTAINERS                            |    9 +
>  include/linux/quic.h                   |   24 +
>  include/linux/socket.h                 |    1 +
>  include/uapi/linux/in.h                |    2 +
>  include/uapi/linux/quic.h              |  241 +++++
>  net/Kconfig                            |    1 +
>  net/Makefile                           |    1 +
>  net/quic/Kconfig                       |   35 +
>  net/quic/Makefile                      |    9 +
>  net/quic/common.c                      |  559 +++++++++++
>  net/quic/common.h                      |  220 +++++
>  net/quic/cong.c                        |  333 +++++++
>  net/quic/cong.h                        |  130 +++
>  net/quic/connid.c                      |  256 +++++
>  net/quic/connid.h                      |  182 ++++
>  net/quic/crypto.c                      | 1245 ++++++++++++++++++++++++
>  net/quic/crypto.h                      |   89 ++
>  net/quic/family.c                      |  452 +++++++++
>  net/quic/family.h                      |   44 +
>  net/quic/packet.c                      |  887 +++++++++++++++++
>  net/quic/packet.h                      |  120 +++
>  net/quic/path.c                        |  568 +++++++++++
>  net/quic/path.h                        |  190 ++++
>  net/quic/pnspace.c                     |  253 +++++
>  net/quic/pnspace.h                     |  201 ++++
>  net/quic/protocol.c                    |  418 ++++++++
>  net/quic/protocol.h                    |   63 ++
>  net/quic/socket.c                      |  477 +++++++++
>  net/quic/socket.h                      |  209 ++++
>  net/quic/stream.c                      |  416 ++++++++
>  net/quic/stream.h                      |  131 +++
>  net/quic/timer.c                       |  154 +++
>  net/quic/timer.h                       |   45 +
>  usr/include/Makefile                   |    1 +
>  35 files changed, 8005 insertions(+)
>  create mode 100644 include/linux/quic.h
>  create mode 100644 include/uapi/linux/quic.h
>  create mode 100644 net/quic/Kconfig
>  create mode 100644 net/quic/Makefile
>  create mode 100644 net/quic/common.c
>  create mode 100644 net/quic/common.h
>  create mode 100644 net/quic/cong.c
>  create mode 100644 net/quic/cong.h
>  create mode 100644 net/quic/connid.c
>  create mode 100644 net/quic/connid.h
>  create mode 100644 net/quic/crypto.c
>  create mode 100644 net/quic/crypto.h
>  create mode 100644 net/quic/family.c
>  create mode 100644 net/quic/family.h
>  create mode 100644 net/quic/packet.c
>  create mode 100644 net/quic/packet.h
>  create mode 100644 net/quic/path.c
>  create mode 100644 net/quic/path.h
>  create mode 100644 net/quic/pnspace.c
>  create mode 100644 net/quic/pnspace.h
>  create mode 100644 net/quic/protocol.c
>  create mode 100644 net/quic/protocol.h
>  create mode 100644 net/quic/socket.c
>  create mode 100644 net/quic/socket.h
>  create mode 100644 net/quic/stream.c
>  create mode 100644 net/quic/stream.h
>  create mode 100644 net/quic/timer.c
>  create mode 100644 net/quic/timer.h
>
> --
> 2.47.1
>
Sorry for the late check for the sashiko reports.

[sashiko-gemini] reported 1 important legit issue.
[sashiko-claude] reported 1 important and 4 minor legit issues.
(See the comments on the related patches)

will fix them in the next post.

^ permalink raw reply

* [PATCH net-next v2 0/2] netdev: expose page pool order via netlink
From: Dragos Tatulea @ 2026-06-12 21:17 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Donald Hunter, Andrew Lunn, Pavel Begunkov,
	Jens Axboe, Shuah Khan
  Cc: Dragos Tatulea, netdev, linux-kernel, io-uring, linux-kselftest

This small series exposes io_uring's high order page configuration
via the page_pool netlink interface and updates the appropriate
selftest to check this value.

---
v2:
- Switched from exposing page_pool order to rx_buf_len via nl_fill of
  the io_uring memory provider.
- Updated selftest to check rx_buf_len.
- v1: https://lore.kernel.org/all/20260611161235.3807332-1-dtatulea@nvidia.com/
---
Dragos Tatulea (2):
  netdev: expose io_uring rx_page_order order via netlink
  io_uring/zcrx: selftests: verify rx_buf_len for large chunks

 Documentation/netlink/specs/netdev.yaml       |  9 ++++++-
 include/uapi/linux/netdev.h                   |  2 ++
 io_uring/zcrx.c                               |  8 ++++++
 tools/include/uapi/linux/netdev.h             |  2 ++
 .../selftests/drivers/net/hw/iou-zcrx.py      | 26 ++++++++++++++++++-
 5 files changed, 45 insertions(+), 2 deletions(-)

-- 
2.54.0


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox