Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net v2] mac802154: llsec: add skb_cow_data() before in-place crypto
From: Stefan Schmidt @ 2026-06-19 20:47 UTC (permalink / raw)
  To: alex.aring, miquel.raynal, Doruk Tan Ozturk
  Cc: Stefan Schmidt, aleksander.lobakin, linux-wpan, netdev, security,
	stable
In-Reply-To: <20260526183726.56100-1-doruk@0sec.ai>

Hello Doruk Tan Ozturk.

On Tue, 26 May 2026 20:37:26 +0200, Doruk Tan Ozturk wrote:
> llsec_do_encrypt_unauth(), llsec_do_encrypt_auth(),
> llsec_do_decrypt_unauth(), and llsec_do_decrypt_auth() all perform
> in-place cryptographic transformations on skb data.  They build a
> scatterlist with sg_init_one() pointing into the skb's linear data area
> and then pass the same scatterlist as both src and dst to the crypto API
> (e.g. crypto_skcipher_encrypt/decrypt, crypto_aead_encrypt/decrypt).
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/1] mac802154: llsec: add skb_cow_data() before in-place crypto
      https://git.kernel.org/wpan/wpan-next/c/84a04eb5b210

regards,
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH net] net: wwan: iosm: bound device offsets in the MUX downlink decoder
From: Loic Poulain @ 2026-06-19 20:48 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: Sergey Ryazanov, Johannes Berg, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel,
	stable
In-Reply-To: <178185979029.4044562.9993615975949055530@maoyixie.com>

On Fri, Jun 19, 2026 at 11:03 AM Maoyi Xie <maoyixie.tju@gmail.com> wrote:
>
> mux_dl_adb_decode() walks a chain of aggregated datagram tables using
> offsets and lengths taken from the modem. first_table_index,
> next_table_index, table_length, datagram_index and datagram_length are
> all device supplied le values. Only first_table_index was checked, and
> only for being non zero. The decoder then formed adth = block +
> adth_index and read the table header and the datagram entries with no
> bound against the received skb. A modem that reports an index or a
> length past the downlink buffer makes the decoder read out of bounds.
>
> The buffer is IPC_MEM_MAX_DL_MUX_LITE_BUF_SIZE and skb->len is at most
> that, so skb->len is the real limit, but none of these in band offsets
> were checked against it.
>
> Validate every device offset and length against skb->len before use.
> The block header must fit. Each table header, on entry and after every
> next_table_index, must lie inside the skb. The datagram table must fit.
> Each datagram index and length must stay inside the skb. The header
> padding must not exceed the datagram length so the receive length does
> not wrap.
>
> This was reproduced under KASAN as a slab out of bounds read on a normal
> downlink receive once the iosm net device is up.
>
> Fixes: 1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
> ---
>  drivers/net/wwan/iosm/iosm_ipc_mux_codec.c | 23 ++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c b/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c
> index bff46f7ca59f..1c021bb0aa7a 100644
> --- a/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c
> +++ b/drivers/net/wwan/iosm/iosm_ipc_mux_codec.c
> @@ -557,15 +557,21 @@ static int mux_dl_process_dg(struct iosm_mux *ipc_mux, struct mux_adbh *adbh,
>                                 < sizeof(struct mux_adbh))
>                         goto dg_error;
>
> -               /* Is the packet inside of the ADB */
> +               /* Is the packet inside of the ADB and the received skb ? */
>                 if (le32_to_cpu(dg->datagram_index) >=
> -                                       le32_to_cpu(adbh->block_length)) {
> +                                       le32_to_cpu(adbh->block_length) ||
> +                   le32_to_cpu(dg->datagram_index) >= skb->len ||
> +                   le16_to_cpu(dg->datagram_length) >
> +                           skb->len - le32_to_cpu(dg->datagram_index)) {

The logic is ok, but for readability, I would suggest to convert
dg->datagram_index and dg->datagram_length into intermediate
native-endian local variables (e.g dg_index, dg_len), making the if
condition cleaner and avoiding repeated conversions.


>                         goto dg_error;
>                 } else {
>                         packet_offset =
>                                 le32_to_cpu(dg->datagram_index) +
>                                 dl_head_pad_len;
>                         dg_len = le16_to_cpu(dg->datagram_length);
> +                       /* The header padding must not exceed the datagram. */
> +                       if (dl_head_pad_len >= dg_len)
> +                               goto dg_error;
>                         /* Pass the packet to the netif layer. */
>                         rc = ipc_mux_net_receive(ipc_mux, if_id, ipc_mux->wwan,
>                                                  packet_offset,
> @@ -595,6 +601,10 @@ static void mux_dl_adb_decode(struct iosm_mux *ipc_mux,
>         block = skb->data;
>         adbh = (struct mux_adbh *)block;
>
> +       /* The block header itself must fit in the received skb. */
> +       if (skb->len < sizeof(struct mux_adbh))
> +               goto adb_decode_err;
> +
>         /* Process the aggregated datagram tables. */
>         adth_index = le32_to_cpu(adbh->first_table_index);
>
> @@ -606,6 +616,11 @@ static void mux_dl_adb_decode(struct iosm_mux *ipc_mux,
>
>         /* Loop through mixed session tables. */
>         while (adth_index) {
> +               /* The table header must lie within the received skb. */
> +               if (adth_index < sizeof(struct mux_adbh) ||
> +                   adth_index > skb->len - sizeof(struct mux_adth))
> +                       goto adb_decode_err;
> +
>                 /* Get the reference to the table header. */
>                 adth = (struct mux_adth *)(block + adth_index);
>
> @@ -629,6 +644,10 @@ static void mux_dl_adb_decode(struct iosm_mux *ipc_mux,
>                 if (le16_to_cpu(adth->table_length) < sizeof(struct mux_adth))
>                         goto adb_decode_err;
>
> +               /* The whole datagram table must fit in the received skb. */
> +               if (le16_to_cpu(adth->table_length) > skb->len - adth_index)
> +                       goto adb_decode_err;
> +
>                 /* Calculate the number of datagrams. */
>                 nr_of_dg = (le16_to_cpu(adth->table_length) -
>                                         sizeof(struct mux_adth)) /
> --
> 2.34.1
>

^ permalink raw reply

* Re: [PATCH 1/2] bpf: preserve rx_queue_index across XDP redirects
From: bot+bpf-ci @ 2026-06-19 20:50 UTC (permalink / raw)
  To: siddharthcibi, ast
  Cc: siddharthcibi, daniel, davem, kuba, hawk, john.fastabend, sdf,
	edumazet, pabeni, horms, andrii, eddyz87, memxor, martin.lau,
	song, yonghong.song, jolsa, emil, netdev, bpf, linux-kernel, ast,
	andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <20260619195759.41254-2-siddharthcibi@icloud.com>

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

> Store rx_queue_index in struct xdp_frame during xdp_buff to
> xdp_frame conversion and restore it when rebuilding xdp_rxq_info
> for cpumap and devmap execution paths.
> This preserves ingress RX queue information for XDP programs
> executed after redirect, allowing access to the original
> rx_queue_index instead of losing queue context.
> Also propagate rx_queue_index for zero-copy XDP frame conversion.
> Signed-off-by: Siddharth_Cibi <siddharthcibi@icloud.com>

This looks like a fix for the cpumap path, where the synthetic
xdp_rxq_info was previously built without a queue_index, leaving the
placeholder:

kernel/bpf/cpumap.c:cpu_map_bpf_prog_run_xdp() {
	rxq.dev = xdpf->dev_rx;
	rxq.mem.type = xdpf->mem_type;
	/* TODO: report queue_index to xdp_rxq_info */
	...
}

That synthetic xdp_rxq_info, with the missing queue_index and the TODO,
came in when cpu_map_bpf_prog_run_xdp() was added.  Should this carry:

  Fixes: 9216477449f3 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")

The devmap path (dev_map_bpf_prog_run) shares the same missing
queue_index, which was added in fbee97feed9b ("bpf: Add support to attach
bpf program to a devmap entry").


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27846609854

^ permalink raw reply

* Re: [PATCH 2/2] selftests/bpf: validate rx_queue_index in xdp_metadata
From: bot+bpf-ci @ 2026-06-19 20:50 UTC (permalink / raw)
  To: siddharthcibi, ast
  Cc: siddharthcibi, daniel, davem, kuba, hawk, john.fastabend, sdf,
	andrii, eddyz87, memxor, martin.lau, song, yonghong.song, jolsa,
	emil, shuah, netdev, bpf, linux-kselftest, linux-kernel, ast,
	andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <20260619195759.41254-3-siddharthcibi@icloud.com>

[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]

> Extend xdp_metadata selftest coverage to validate that
> ctx->rx_queue_index is preserved and observable after XDP redirect
> execution.
>
> Capture rx_queue_index in metadata and assert that it matches the
> expected queue during packet verification.
>
> Signed-off-by: Siddharth_Cibi <siddharthcibi@icloud.com>

The changelog says the test validates that ctx->rx_queue_index is
"preserved and observable after XDP redirect execution".

Looking at the program in progs/xdp_metadata.c, rx() reads
ctx->rx_queue_index and then redirects into an XSKMAP:

	meta->rx_queue_index = ctx->rx_queue_index;
	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);

The AF_XDP delivery path copies the frame and metadata straight into
the umem and does not re-run an XDP program after the redirect, so the
value stored in meta->rx_queue_index is the one read in the original
ingress context rather than a value restored on a later re-execution.

Does the assert in verify_xsk_metadata() actually exercise a
redirect-restore path, or does it pass purely from the value captured
in the ingress program?

This isn't a bug, but could the changelog describe which redirect path
the test exercises, since the AF_XDP (XSKMAP) path does not re-run an
XDP program after the redirect?

> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> index 5c31054ad4a4..f8cabbbe7bb7 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> @@ -309,7 +309,8 @@ static int verify_xsk_metadata(struct xsk *xsk, bool sent_from_af_xdp)
>
>  	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
>  		return -1;
> -
> +	if (!ASSERT_EQ(meta->rx_queue_index, QUEUE_ID, "rx_queue_index"))
> +        	return -1;

This isn't a bug, but the new return is indented with eight spaces
followed by a tab, while the other return -1; statements in
verify_xsk_metadata() use two tabs. Could it use tabs to match the
surrounding lines?

>  	if (!sent_from_af_xdp) {
>  		if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
>  			return -1;
> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index 09bb8a038d52..62ae83860d7f 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> @@ -98,7 +98,7 @@ int rx(struct xdp_md *ctx)
>  	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
>  	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_proto,
>  				     &meta->rx_vlan_tci);
> -
> +	meta->rx_queue_index = ctx->rx_queue_index;
>  	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>  }


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27846609854

^ permalink raw reply

* Re: [PATCH] ieee802154: fix kernel-infoleak in dgram_recvmsg()
From: Stefan Schmidt @ 2026-06-19 20:53 UTC (permalink / raw)
  To: syzkaller-bugs, Alexander Aring, David S. Miller, Eric Dumazet,
	Jakub Kicinski, linux-wpan, Miquel Raynal, netdev, Paolo Abeni,
	syzbot
  Cc: Stefan Schmidt, horms, linux-kernel, syzbot
In-Reply-To: <62795fd9-fc0c-48eb-bb82-05ffc5a57104@mail.kernel.org>

Hello syzbot.

On Wed, 27 May 2026 20:18:18 +0000, syzbot wrote:
> KMSAN reported a kernel-infoleak in move_addr_to_user():
> 
> BUG: KMSAN: kernel-infoleak in instrument_copy_to_user
> include/linux/instrumented.h:131 [inline]
> BUG: KMSAN: kernel-infoleak in _inline_copy_to_user
> include/linux/uaccess.h:205 [inline]
> BUG: KMSAN: kernel-infoleak in _copy_to_user+0xcc/0x120
> lib/usercopy.c:26
>  instrument_copy_to_user include/linux/instrumented.h:131 [inline]
>  _inline_copy_to_user include/linux/uaccess.h:205 [inline]
>  _copy_to_user+0xcc/0x120 lib/usercopy.c:26
>  copy_to_user include/linux/uaccess.h:236 [inline]
>  move_addr_to_user+0x2e7/0x440 net/socket.c:302
>  ____sys_recvmsg+0x232/0x610 net/socket.c:2925
>  ...
>  Uninit was stored to memory at:
>  ieee802154_addr_to_sa include/net/ieee802154_netdev.h:369 [inline]
>  dgram_recvmsg+0xa09/0xbe0 net/ieee802154/socket.c:739
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/1] ieee802154: fix kernel-infoleak in dgram_recvmsg()
      https://git.kernel.org/wpan/wpan-next/c/4db86f8ab11b

regards,
Stefan Schmidt

^ permalink raw reply

* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: David Woodhouse @ 2026-06-19 20:57 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <87v7beb7s3.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]

On Fri, 2026-06-19 at 22:21 +0200, Thomas Gleixner wrote:
> On Fri, Jun 19 2026 at 16:34, David Woodhouse wrote:
> > On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
> > > 
> > > This formatting makes my brain hurt. Can you please split that out into
> > > a separate function?
> > 
> > Yep. There's also a potential error there — an *additional* discrepancy
> > comes from the enforced monotonicity that timekeeping_cycles_to_ns()
> > applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).
> > 
> > I couldn't work out if I cared about the clocksource-is-non-monotonic
> > casse, and even if I did, what I should do about it.
> 
> I think the right thing is just to ignore it.

Yeah, that was basically my conclusion; I had just meant to *mention*
it when posting the RFC.

> The problem is very narrow and mostly related to the historically badly
> synchronized TSC between sockets. The TSC_ADJUST fixup is obviously
> error prone as it adjusts only to the point where the error is not
> longer observable. But in the update transition phase it can result in
> time going backwards because the readout on the other CPU is slightly
> behind tk::tkr_mono::cycles_last. That happens only once in a while and
> we talk about a very low single digit number of TSC cycles.
> 
> > I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
> > or something like that, such that e.g. PTP clients could *ask* for it.
> 
> Hell no!

That was not about the above clocksource nonsense; that was the
question of what the caller (in my example case, the vmclock PTP
snapshot) should *do* with the reported error value.

If I just unconditionally "correct" the CLOCK_REALTIME values then
that's arguably an ABI change. We're silently reporting something
*different* to what we did before.

Maybe that's OK... as I said, in the PPS case we can justify it and
just call it a bug fix?

Or maybe we want a way for callers (not of ktime_get_snapshot_id()
itself, but *their* callers) to *ask* for the "corrected" value
instead. I happened to call that CLOCK_REALTIME_NONMONOTONIC as a straw
man, just because monotonicity is *one* of the reasons why we present
the xtime values that we do, not always the raw "corrected" values.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH] mac802154: Prevent overwrite return code in mac802154_perform_association()
From: Stefan Schmidt @ 2026-06-19 20:58 UTC (permalink / raw)
  To: alex.aring, miquel.raynal, Robertus Diawan Chris
  Cc: Stefan Schmidt, davem, edumazet, kuba, pabeni, horms, linux-wpan,
	netdev, linux-kernel, linux-kernel-mentees, skhan, me
In-Reply-To: <20260602054133.470293-1-robertusdchris@gmail.com>

Hello Robertus Diawan Chris.

On Tue, 02 Jun 2026 12:41:33 +0700, Robertus Diawan Chris wrote:
> When assoc_status not equal to IEEE802154_ASSOCIATION_SUCCESSFUL, the
> return value assigned to either "-ERANGE" or "-EPERM" but this return
> value will be overwritten to 0 after exiting the conditional scope.
> So, jump to clear_assoc label to preserve the return value when
> assoc_status not equal to IEEE802154_ASSOCIATION_SUCCESSFUL.
> 
> This is reported by Coverity Scan as "Unused value".
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/1] mac802154: Prevent overwrite return code in mac802154_perform_association()
      https://git.kernel.org/wpan/wpan-next/c/649147cb3f8b

regards,
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH net 0/2] ieee802154: admin-gate legacy LLSEC dumps + un-deaden ADD/DEL
From: Stefan Schmidt @ 2026-06-19 21:06 UTC (permalink / raw)
  To: Alexander Aring, Miquel Raynal, Michael Bommarito
  Cc: Stefan Schmidt, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Phoebe Buckheister, linux-wpan, netdev,
	linux-kernel
In-Reply-To: <20260520141640.1149513-1-michael.bommarito@gmail.com>

Hello Michael Bommarito.

On Wed, 20 May 2026 10:16:38 -0400, Michael Bommarito wrote:
> The legacy IEEE802154_NL family (net/ieee802154/netlink.c) builds its
> ops table from two macros in net/ieee802154/ieee802154.h. IEEE802154_OP()
> sets .flags = GENL_ADMIN_PERM; IEEE802154_DUMP() sets no flags. Among
> the IEEE802154_DUMP() consumers are four LLSEC dump ops (LIST_KEY,
> LIST_DEV, LIST_DEVKEY, LIST_SECLEVEL), and the LLSEC_LIST_KEY dump
> handler at net/ieee802154/nl-mac.c emits the raw 16-byte AES-128
> keytable bytes (IEEE802154_ATTR_LLSEC_KEY_BYTES, .len = 16, copied
> verbatim from struct ieee802154_llsec_key.key) into the reply skb.
> The modern nl802154 family admin-gates the equivalent reads
> (NL802154_CMD_GET_SEC_KEY at net/ieee802154/nl802154.c:2978 with
> .flags = GENL_ADMIN_PERM) so the legacy interface is the open side.
> 
> [...]

Applied to wpan/wpan-next.git, thanks!

[1/2] ieee802154: admin-gate legacy LLSEC dump operations
      https://git.kernel.org/wpan/wpan-next/c/9c1e0b6d4947
[2/2] ieee802154: allow legacy LLSEC ADD/DEL ops to pass strict validation
      https://git.kernel.org/wpan/wpan-next/c/a6bfdfcc6711

regards,
Stefan Schmidt

^ permalink raw reply

* [PATCH v1 net] ipv4: fib: Don't ignore error route in local/main tables.
From: Kuniyuki Iwashima @ 2026-06-19 21:27 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

When CONFIG_IP_MULTIPLE_TABLES is enabled but no rule is added,
fib_lookup() performs route lookup directly on two tables.

Since the first lookup does not properly bail out, the result
of an error route in the merged local/main table could be
overwritten by another route in the default table:

  # unshare -n
  # ip link set lo up
  # ip route add 192.168.0.0/24 dev lo table 253
  # ip route add unreachable 192.168.0.0/24
  # ip route get 192.168.0.1
  192.168.0.1 dev lo table default uid 0
      cache <local>

Once a random rule is added, the error route is respected:

  # ip rule add table 0
  # ip rule del table 0
  # ip route get 192.168.0.1
  RTNETLINK answers: No route to host

Let's fix the inconsistent behaviour.

Fixes: f4530fa574df ("ipv4: Avoid overhead when no custom FIB rules are installed.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/ip_fib.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index a71a98505650..c63a3c4967ae 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -374,7 +374,7 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 			     struct fib_result *res, unsigned int flags)
 {
 	struct fib_table *tb;
-	int err = -ENETUNREACH;
+	int err = -EAGAIN;
 
 	flags |= FIB_LOOKUP_NOREF;
 	if (net->ipv4.fib_has_custom_rules)
@@ -388,17 +388,16 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 	if (tb)
 		err = fib_table_lookup(tb, flp, res, flags);
 
-	if (!err)
+	if (err != -EAGAIN)
 		goto out;
 
 	tb = rcu_dereference_rtnl(net->ipv4.fib_default);
 	if (tb)
 		err = fib_table_lookup(tb, flp, res, flags);
 
-out:
 	if (err == -EAGAIN)
 		err = -ENETUNREACH;
-
+out:
 	rcu_read_unlock();
 
 	return err;
-- 
2.55.0.rc0.786.g65d90a0328-goog


^ permalink raw reply related

* Re: [PATCH net v2] eth: bnxt: improve the timing of stats
From: Michael Chan @ 2026-06-19 21:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	pavan.chebbi
In-Reply-To: <20260619191538.104165-1-kuba@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]

On Fri, Jun 19, 2026 at 12:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Kernel selftests wait 1.25x of the promised stats refresh time
> (as read from ethtool -c). bnxt reports 1sec by default, but
> the stats update process has two steps. First device DMAs the
> new values, then the service task performs update in full-width
> SW counters. So the worst case delay is actually 2x.
>
> Note that the behavior is different for ring stats and port stats.
> Port stats are fetched synchronously by the service worker, so
> there's no risk of doubling up the delay there.
>
> The problem of stale stats impacts not only tests but real workloads
> which monitor egress bandwidth of a NIC. The inaccuracy causes double
> counting in the next cycle and spurious overload alarms.
>
> Try to read from the DMA buffer more aggressively, to mitigate
> timing issues between DMA and service task. The SW update should
> be cheap.
>
> Fixes: 51f307856b60 ("bnxt_en: Allow statistics DMA to be configurable using ethtool -C.")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Thanks.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH net] net: sit: require CAP_NET_ADMIN in the device netns for changelink
From: Kuniyuki Iwashima @ 2026-06-19 21:29 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Xiao Liang, Nicolas Dichtel, Kees Cook, netdev,
	linux-kernel, stable
In-Reply-To: <20260618070817.3378283-1-maoyixie.tju@gmail.com>

On Thu, Jun 18, 2026 at 12:08 AM Maoyi Xie <maoyixie.tju@gmail.com> wrote:
>
> ipip6_changelink() operates on at most two netns, dev_net(dev) and the
> tunnel link netns t->net. They differ once the device is created in or
> moved to a netns other than the one the request runs in. The rtnl
> changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a
> caller privileged there but not in t->net can rewrite a tunnel that
> lives in t->net.
>
> Gate ipip6_changelink() on rtnl_dev_link_net_capable() at its top,
> before any attribute is parsed. sit was the one tunnel type not covered
> by the recent series that added this check to the other changelink()
> handlers.
>
> Fixes: 5e6700b3bf98 ("sit: add support of x-netns")
> Link: https://lore.kernel.org/netdev/20260612085941.3158249-1-maoyixie.tju@gmail.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH net v2 01/10] rxrpc: input: reject ACKALL outside transmit phase
From: Jeffrey E Altman @ 2026-06-19 21:32 UTC (permalink / raw)
  To: David Howells, netdev
  Cc: Marc Dionne, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, linux-afs, linux-kernel, Wyatt Feng,
	stable, Yuan Tan, Yifan Wu, Juefei Pu, Zhengchuan Liang, Xin Liu,
	Ren Wei
In-Reply-To: <20260618134802.2477777-2-dhowells@redhat.com>

On 6/18/2026 9:47 AM, David Howells wrote:
> From: Wyatt Feng <bronzed_45_vested@icloud.com>
>
> rxrpc_input_ackall() accepts ACKALL packets without checking whether
> the call is in a state that can legitimately have outstanding transmit
> buffers.  A forged ACKALL can therefore reach a new service call in
> RXRPC_CALL_SERVER_RECV_REQUEST before any reply packets have been
> queued.
>
> In that state call->tx_top is zero and call->tx_queue is NULL, so
> rxrpc_rotate_tx_window() dereferences a NULL txqueue and triggers a
> null-pointer dereference.
>
> Fix rxrpc_input_ackall() to mirror the transmit-state gating already
> used for normal ACK processing, and ignore ACKALL when there is no
> outstanding transmit window to rotate.
>
> Fixes: b341a0263b1b ("rxrpc: Implement progressive transmission queue struct")
> Cc: stable@vger.kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Assisted-by: Codex:GPT-5.4
> Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Marc Dionne <marc.dionne@auristor.com>
> cc: linux-afs@lists.infradead.org
> ---
>   net/rxrpc/input.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
> index ce761466b02d..37881dffa898 100644
> --- a/net/rxrpc/input.c
> +++ b/net/rxrpc/input.c
> @@ -1214,8 +1214,22 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb)
>   static void rxrpc_input_ackall(struct rxrpc_call *call, struct sk_buff *skb)
>   {
>   	struct rxrpc_ack_summary summary = { 0 };
> +	rxrpc_seq_t top = READ_ONCE(call->tx_top);
> +
> +	switch (__rxrpc_call_state(call)) {
> +	case RXRPC_CALL_CLIENT_SEND_REQUEST:
> +	case RXRPC_CALL_CLIENT_AWAIT_REPLY:
> +	case RXRPC_CALL_SERVER_SEND_REPLY:
> +	case RXRPC_CALL_SERVER_AWAIT_ACK:
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	if (call->tx_bottom == top)
> +		return;
>   
> -	if (rxrpc_rotate_tx_window(call, call->tx_top, &summary))
> +	if (rxrpc_rotate_tx_window(call, top, &summary))
>   		rxrpc_end_tx_phase(call, false, rxrpc_eproto_unexpected_ackall);
>   }
>   

Wyatt,

Thank you for identifying the NULL pointer dereference but I do not 
believe the patch is correct from an RxRPC protocol perspective.

The rxrpc protocol is not formally standardized.  Linux rxrpc is a clean 
room implementation of Transarc/IBM RxRPC protocol used by AFS 3.0.

I've been spelunking through old source code trees dating back to 
mid-1988.  The original usage of the ACKALL packet was a form of delayed 
acknowledgement only to be sent after all of the DATA packets inclusive 
of the LAST_PACKET had been received.  Your expectation of how the 
packet type is intended to be used is consistent with that behavior.

However, in Nov 1988 the DATA acknowledgement logic was altered in a 
backward incompatible manner.   Instead of immediately sending ACK 
packets in response to every DATA packet except when the final DATA 
packet inclusive of LAST_PACKET was received, the ACK packet usage was 
extended to permit delayed transmissions. From Nov 1988 onward ACK 
packets were scheduled to be sent with a 200ms delay unless the received 
DATA packet was a duplicate, out-of-sequence, out-of-window, etc OR 
unless the received DATA packet had the RX_REQUEST_ACK flag set.   The 
delayed ACKs replaced the ACKALL usage in the general case.

But it appears there was a bug introduced which resulted in the sending 
of arbitrary ACKALL packets at any point in the call lifetime.   This 
bug was not identified until Nov 2001 [OpenAFS 
db2ddfaf1b322710e1bd4edce6d7519157c3c9eb] at which point the sending of 
ACKALL packets was further restricted.  One of the reasons why the 
sending of ACKALL packets at arbitrary times was not identified as a 
problem for more than a decade is that ACKALL packets received when 
there were no transmitted packets waiting for acknowledgement had no 
impact on the call state.   If there were transmitted packets waiting 
for acknowledgement and they were successfully delivered, then the call 
continued successfully.

OpenAFS 1.6 pre-releases attempted to resume use of ACKALL packets as a 
performance enhancement only to revert the change because of 
compatibility problems.

I think the best change at this point would to accept the ACKALL packets 
without generating an error regardless of the call state. If there are 
transmitted DATA packets waiting for acknowledgement, acknowledge them.  
If there are DATA packets which have yet to be sent, leave them alone.  
Only complete the call in response to an ACKALL if the ACKALL is 
received by the acceptor (incoming call) and all DATA packets inclusive 
of LAST_PACKET have been transmitted at least once.

Sincerely,

Jeffrey Altman




^ permalink raw reply

* Re: [PATCH bpf] bpf, sockmap: fix lock inversion between stab->lock and sk_callback_lock
From: Alexei Starovoitov @ 2026-06-19 21:55 UTC (permalink / raw)
  To: John Fastabend, Sechang Lim
  Cc: Jiayuan Chen, Jakub Sitnicki, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Simon Horman, netdev, bpf,
	linux-kernel
In-Reply-To: <ajLR9CRn6O27Ound@john-p8>

On Wed Jun 17, 2026 at 9:59 AM PDT, John Fastabend wrote:
>
> The bot also thinks it found another locking issue. I'm not sure
> supporting 'tc' is really needed here. sockmap is much more easy
> to reason about from socket layer. What about just blocking sockmap
> manipulations from these prog types.
>
> My current thinking on sockmap at the moment is its has sprawled
> across so many layers the locking is overly tricky to reason about.
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d9bdc3b32c05..5e08d3e03453 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8567,11 +8567,7 @@ static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id)
>                          return true;
>                  break;
>          case BPF_PROG_TYPE_SOCKET_FILTER:
> -       case BPF_PROG_TYPE_SCHED_CLS:
> -       case BPF_PROG_TYPE_SCHED_ACT:
> -       case BPF_PROG_TYPE_XDP:
>          case BPF_PROG_TYPE_SK_REUSEPORT:
> -       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>          case BPF_PROG_TYPE_SK_LOOKUP:
>                  return true;

+1. Let's disable.

^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Bryam Vargas @ 2026-06-19 22:17 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajS4BgnyzRsa7HVm@linux.alibaba.com>

On Fri, 19 Jun 2026 11:31:18 +0800, Dust Li wrote:
> I think we can decide after we see the real issue.

Here it is, as a truth table over the real smc_curs_diff. cons is fixed (the app
isn't reading), bytes_to_rcv is the running sum of per-CDC smc_curs_diff(prod_old,
prod_new), len = 65504:

  scenario                     b2r       count>=len  diff>len  occ>len  OOB no-clamp  OOB clamp
  honest steady / full / wrap  <= len    no          no        no       no            no
  attack single big diff       131007    no          yes       yes      yes           no
  attack count=len-1 wrapflip  327519    no          yes       yes      yes           no
  attack wrap++ count=0        327520    no          no        no       yes           no

Every attack row has count < len, so an input count check accepts it. The last
row is the one that matters: a peer that just increments prod.wrap with count=0
adds len to bytes_to_rcv every CDC, unbounded, and no cursor-level check sees it.
The per-CDC diff is exactly len, and smc_curs_diff(cons, prod) stays at len
because it can't see the wrap accumulation. The only thing that bounds it is
clamping bytes_to_rcv at the consumer. So #2 isn't subsumed by validating cursors
at the input -- the cursor view can't see the accumulator.

> should we also abort the connection like what we did in patch #1 ?

Yes for net-next. Two caveats: First, the detection
has to be on bytes_to_rcv itself, not on a cursor recompute -- the wrap++ row
walks past every cursor check, so an occupancy gate at the input wouldn't catch
it. Second, the abort supplements the clamp, it doesn't replace it: the clamp is
synchronous, the abort via queue_work isn't. The producer add runs in the tasklet
under bh_lock_sock, the consumer sub runs in smc_recvmsg under lock_sock which
drops the spinlock, so they race; between queue_work and abort_work running
smc_conn_kill, smc_recvmsg can read the inflated bytes_to_rcv and copy past the
RMB. The clamp at the consumer is what closes that window.

So v4: -stable keeps the consumer-side clamp on #2, and the same shape on #3 for
sndbuf_space and peer_rmbe_space -- no control-flow change. net-next keeps the
clamp and, when bytes_to_rcv goes over len (which an honest peer never does),
queues the abort the way patch #1 does. Patch #1 keeps its count-based abort for
the urgent index.

Bryam

The table above is this program (gcc -O2 -Wall -Wextra -fwrapv; self-checks, exit 0):

  #include <stdio.h>
  #include <stdint.h>
  typedef uint16_t u16; typedef uint32_t u32;
  union hc { struct { u16 reserved; u16 wrap; u32 count; }; };

  /* verbatim net/smc/smc_cdc.h:149-158 */
  static int smc_curs_diff(unsigned int size, const union hc *old, const union hc *new)
  {
          if (old->wrap != new->wrap) {
                  int v = (int)((size - old->count) + new->count);
                  return v > 0 ? v : 0;
          }
          { int v = (int)(new->count - old->count); return v > 0 ? v : 0; }
  }

  #define LEN 65504
  struct cur { u16 w; u32 c; };

  /* prod[]/cons[]: cursor positions after each CDC. honest=app drains so
   * occupancy stays <= len; attack=cons stuck. */
  static int run(const char *name, int honest, int n,
                 const struct cur *prod, const struct cur *cons)
  {
          union hc po = {0}, co = {0};
          long b2r = 0; int i, cnt_rej = 0, raw_rej = 0, occ_rej = 0, fail = 0;
          for (i = 0; i < n; i++) {
                  union hc p = { .wrap = prod[i].w, .count = prod[i].c };
                  union hc c = { .wrap = cons[i].w, .count = cons[i].c };
                  int dp = smc_curs_diff(LEN, &po, &p);
                  if (prod[i].c >= (u32)LEN) cnt_rej = 1;
                  if (dp > LEN) raw_rej = 1;
                  if (smc_curs_diff(LEN, &c, &p) > LEN) occ_rej = 1;
                  b2r += dp; b2r -= smc_curs_diff(LEN, &co, &c);
                  po = p; co = c;
          }
          int oob_noclamp = b2r > LEN;
          int oob_clamp   = (b2r > LEN ? LEN : b2r) > LEN;   /* always 0 */
          printf("  %-30s b2r=%-8ld cnt_rej=%d raw_rej=%d occ_rej=%d oob_noclamp=%d oob_clamp=%d\n",
                 name, b2r, cnt_rej, raw_rej, occ_rej, oob_noclamp, oob_clamp);
          if (honest) fail = (cnt_rej || raw_rej || occ_rej || oob_noclamp);
          else        fail = (oob_clamp || !oob_noclamp);
          return fail;
  }

  int main(void)
  {
          struct cur ps[][5] = {
                  {{0,5000}}, {{1,0}}, {{0,30000},{0,60000},{1,10000}},
                  {{1,LEN-1}},
                  {{1,LEN-1},{0,LEN-1},{1,LEN-1},{0,LEN-1}},
                  {{1,0},{2,0},{3,0},{4,0},{5,0}},
          };
          struct cur cs[][5] = {
                  {{0,4000}}, {{0,0}}, {{0,0},{0,30000},{0,50000}},
                  {{0,0}},
                  {{0,0},{0,0},{0,0},{0,0}},
                  {{0,0},{0,0},{0,0},{0,0},{0,0}},
          };
          const char *nm[] = { "honest: steady", "honest: full ring",
                  "honest: wrapping", "attack: single big diff",
                  "attack: count=len-1 wrapflip", "attack: wrap++ count=0" };
          int hon[] = { 1,1,1,0,0,0 };
          int nc[]  = { 1,1,3,1,4,5 };
          int i, fails = 0;
          for (i = 0; i < 6; i++)
                  fails += run(nm[i], hon[i], nc[i], ps[i], cs[i]);
          printf("RESULT: %s\n", fails ? "FAIL" : "PASS");
          return fails ? 1 : 0;
  }

(In-kernel KASAN confirming the over-read at count=65503 is available on request;
a small out-of-tree module driving the same smc_curs_diff over a real
rmb_desc->len allocation -- bytes_to_rcv 131007 -> 327519, slab-out-of-bounds in
the recv copy, clean with the clamp.)


^ permalink raw reply

* Re: [BUG] net: tcp: SO_LINGER with l_linger=0 leaks memory when closing sockets with pending send data
From: Ahmed, Aaron @ 2026-06-19 22:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: stable@vger.kernel.org, netdev@vger.kernel.org,
	ncardwell@google.com, edumazet@google.com
In-Reply-To: <CAAVpQUBtKBzq36Wz9p3MaHR=G10-NFBtQXgGW3S3QV5THW2iCg@mail.gmail.com>



Hi Kuniyuki,

Sorry to keep asking, were you able take a look at the updated reproducer? I've still been able to repro with the latest 6.18 LTS.

Thanks,
Aaron 



^ permalink raw reply

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-19 22:58 UTC (permalink / raw)
  To: Gustavo A. R. Silva, netdev
  Cc: i.maximets, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Gustavo A. R. Silva,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kernel, linux-hardening, llvm, Johan Thomsen
In-Reply-To: <13b922ce-8450-48fd-adf7-5377989fb6e4@embeddedor.com>

On 6/18/26 6:02 AM, Gustavo A. R. Silva wrote:
> 
> 
> On 6/17/26 16:59, Gustavo A. R. Silva wrote:
>>
>>
>> On 6/17/26 16:01, Ilya Maximets wrote:
>>> On 6/17/26 10:08 PM, Gustavo A. R. Silva wrote:
>>>> Hi,
>>>>
>>>> On 6/16/26 04:03, Ilya Maximets wrote:
>>>>> kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
>>>>> structure to the options_len, which is then initialized to zero.
>>>>> Later, we're initializing the structure by copying the tunnel info
>>>>> together with the options, and this triggers a warning for a potential
>>>>> memcpy overflow, since the compiler estimates that the options can't
>>>>> fit into the structure, even though the memory for them is actually
>>>>> allocated.
>>>>>
>>>>>    memcpy: detected buffer overflow: 104 byte write of buffer size 96
>>>>>    WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
>>>>>     skb_tunnel_info_unclone+0x179/0x190
>>>>>     geneve_xmit+0x7fe/0xe00
>>>>
>>>> This warning has nothing to do with counted_by. See below for more
>>>> comments.
>>>>
>>>>>
>>>>> The issue is triggered when built with clang and source fortification.
>>>>>
>>>>> Fix that by doing the copy in two stages: first - the main data with
>>>>> the options_len, then the options.  This way the correct length should
>>>>> be known at the time of the copy.
>>>>>
>>>>> It would be better if the options_len never changed after allocation,
>>>>> but the allocation code is a little separate from the initialization
>>>>> and it would be awkward and potentially dangerous to return a struct
>>>>> with options_len set to a non-zero value from the metadata_dst_alloc().
>>>>>
>>>>> Another option would be to use ip_tunnel_info_opts_set(), but it is
>>>>> doing too many unnecessary operations for the use case here.
>>>>>
>>>>> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
>>>>> Reported-by: Johan Thomsen <write@ownrisk.dk>
>>>>> Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>> ---
>>>>>
>>>>> Johan, if you can test this one in your setup as well, that would
>>>>> be great.  Thanks.
>>>>>
>>>>>    include/net/dst_metadata.h | 7 +++++--
>>>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>>>>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>>>>> --- a/include/net/dst_metadata.h
>>>>> +++ b/include/net/dst_metadata.h
>>>>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>>>>        if (!new_md)
>>>>>            return ERR_PTR(-ENOMEM);
>>>>> -    memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>>>>> -           sizeof(struct ip_tunnel_info) + md_size);
>>>>
>>>> What's going on here is that, internally, fortified memcpy() retrieves
>>>> the destination size via __builtin_dynamic_object_size() in mode 1.
>>>>
>>>> That is:
>>>>
>>>> __builtin_dynamic_object_size(&new_md->u.tun_info, 1)
>>>>
>>>> For the above case, Clang returns sizeof(new_md->u.tun_info) == 96.
>>>>
>>>> So the warning is reporting that 104 bytes don't fit in an object of
>>>> size 96 bytes, regardless of any counted_by annotation or allocation.
>>>
>>> Hmm.  Does __builtin_dynamic_object_size(&new_md->u.tun_info, 1) return
>>> 104 when the options_len is 8?  If so, isn't that because it is counted
>>> by that field?  Asking because the fortification doesn't complain if we
>>> keep the full 104-byte copy as-is, but set the options_len beforehand,
>>> as tested by Johan.
>>
>> I see. If that is the case, then, internally, fortified memcpy() ends up
>> using mode 0 instead of mode 1. Something like this:
>>
>> __builtin_dynamic_object_size(&new_md->u.tun_info, 0)
>>
>> The above will effectively consider the allocation and counted_by because
>> it will interpret new_md->u.tun_info as an open-ended object due to the
>> flexible-array member (in struct ip_tunnel_info) whose size is determined
>> by counted_by.
> 
> Indeed. The execution stops here:
> 
> fortify_memcpy_chk():
> 588         /*
> 589          * Always stop accesses beyond the struct that contains the
> 590          * field, when the buffer's remaining size is known.
> 591          * (The SIZE_MAX test is to optimize away checks where the buffer
> 592          * lengths are unknown.)
> 593          */
> 594         if (p_size != SIZE_MAX && p_size < size)
> 595                 fortify_panic(func, FORTIFY_WRITE, p_size, size, true);
> 
> with p_size = __builtin_dynamic_object_size(&new_md->u.tun_info, 0)
> 
> The code never reaches the part where p_size_field (__bdos(&new_md->u.tun_info, 1))
> is checked at runtime because there is no need for that.
> 
> So yep, this patch is okay as-is.

Ack.  Thanks for looking into this!

Best regards, Ilya Maximets.

^ permalink raw reply

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-19 22:59 UTC (permalink / raw)
  To: Johan Thomsen
  Cc: i.maximets, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Gustavo A. R. Silva,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kernel, linux-hardening, llvm
In-Reply-To: <CAKv6aAMTqSo0qSng2Kv4=i6BSvJuUy9KSfeFRYx5JsYuo9=kqQ@mail.gmail.com>

On 6/18/26 1:43 PM, Johan Thomsen wrote:
>> Johan, if you can test this one in your setup as well, that would
>> be great.  Thanks.
>>
>>  include/net/dst_metadata.h | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>> --- a/include/net/dst_metadata.h
>> +++ b/include/net/dst_metadata.h
>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>         if (!new_md)
>>                 return ERR_PTR(-ENOMEM);
>>
>> -       memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>> -              sizeof(struct ip_tunnel_info) + md_size);
>> +       /* Copy in two stages to keep the __counted_by happy. */
>> +       new_md->u.tun_info = md_dst->u.tun_info;
>> +       memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
>> +              ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
>> +
>>  #ifdef CONFIG_DST_CACHE
>>         /* Unclone the dst cache if there is one */
>>         if (new_md->u.tun_info.dst_cache.cache) {
> 
> Hi Ilya,
> 
> Sure. Just stressed it for 24 hours and - I cannot trigger the bug
> with this patch applied.

Thanks, Johan!

Best regards, Ilya Maximets.

^ permalink raw reply

* [PATCH bpf] bpf: tcp: Fix use-after-free in bpf_iter_tcp_established_batch()
From: Jose Fernandez (Anthropic) @ 2026-06-20  0:32 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrii Nakryiko,
	Yonghong Song, Martin KaFai Lau
  Cc: netdev, linux-kernel, bpf, Ben Cressey,
	Jose Fernandez (Anthropic)

reqsk_queue_hash_req() publishes a TCP_NEW_SYN_RECV request_sock onto
the ehash chain (via inet_ehash_insert(), which drops the bucket lock on
return) and only afterwards refcount_set()s rsk_refcnt to 3.

Lockless readers such as __inet_lookup_established() account for this by
using refcount_inc_not_zero(), but bpf_iter_tcp_established_batch() uses
plain sock_hold() while holding the bucket lock, on the assumption that
the lock guarantees sk_refcnt > 0. That assumption does not hold for
request_sock:

  CPU 0                                CPU 1
  -----                                -----
  tcp_conn_request()
   reqsk_queue_hash_req()
    inet_ehash_insert(req)
     spin_lock(bucket)
     __sk_nulls_add_node_rcu(req)      // rsk_refcnt == 0
     spin_unlock(bucket)
                                       bpf_iter_tcp_established_batch()
                                        spin_lock(bucket)
                                        sock_hold(req)   <-- addition on 0
                                        spin_unlock(bucket)
    refcount_set(&req->rsk_refcnt, 3)  // clobbers saturated value

which surfaces as:

  refcount_t: addition on 0; use-after-free.
  WARNING: lib/refcount.c:25 at refcount_warn_saturate+0x48/0x90, CPU#1
  Call Trace:
   bpf_iter_tcp_established_batch+0x14e/0x170
   bpf_iter_tcp_batch+0x53/0x200
   bpf_iter_tcp_seq_next+0x27/0x70
   bpf_seq_read+0x107/0x410
   vfs_read+0xb9/0x380

refcount_warn_saturate() then saturates the count, the publishing CPU's
refcount_set() clobbers it, and the socket is left one reference short.
When the last legitimate owner drops its reference the reqsk is freed
while still reachable, leading to use-after-free panics in e.g.
inet_csk_accept() or inet_csk_listen_stop().

This reproduces in seconds with tcp_syncookies=0, a handful of threads
doing connect()/close() to a local listener while others read an
iter/tcp link in a tight loop.

Use refcount_inc_not_zero() and skip the socket on failure, the same way
every other ehash walker does. The listening hash is unaffected as
listeners are always inserted into lhash2 with sk_refcnt >= 1, so
bpf_iter_tcp_listening_batch() is left as-is.

If every matching socket in a bucket is mid-init, end_sk can stay at 0;
advance to the next bucket in that case rather than terminating the
whole iteration on a stale batch[0].

Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
Reviewed-by: Ben Cressey <ben@cressey.dev>
Assisted-by: Claude:unspecified
Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>
---
 net/ipv4/tcp_ipv4.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fdc81150ff6c..92342dcc6892 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3074,25 +3074,25 @@ static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
 {
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct hlist_nulls_node *node;
-	unsigned int expected = 1;
-	struct sock *sk;
+	unsigned int expected = 0;
+	struct sock *sk = *start_sk;
 
-	sock_hold(*start_sk);
-	iter->batch[iter->end_sk++].sk = *start_sk;
-
-	sk = sk_nulls_next(*start_sk);
 	*start_sk = NULL;
 	sk_nulls_for_each_from(sk, node) {
-		if (seq_sk_match(seq, sk)) {
-			if (iter->end_sk < iter->max_sk) {
-				sock_hold(sk);
-				iter->batch[iter->end_sk++].sk = sk;
-			} else if (!*start_sk) {
-				/* Remember where we left off. */
-				*start_sk = sk;
-			}
-			expected++;
+		if (!seq_sk_match(seq, sk))
+			continue;
+		if (iter->end_sk < iter->max_sk) {
+			/* reqsk_queue_hash_req() inserts with sk_refcnt == 0
+			 * and refcount_set()s it after the bucket lock drops.
+			 */
+			if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+				continue;
+			iter->batch[iter->end_sk++].sk = sk;
+		} else if (!*start_sk) {
+			/* Remember where we left off. */
+			*start_sk = sk;
 		}
+		expected++;
 	}
 
 	return expected;
@@ -3129,6 +3129,7 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 	struct sock *sk;
 	int err;
 
+again:
 	sk = bpf_iter_tcp_resume(seq);
 	if (!sk)
 		return NULL; /* Done */
@@ -3167,6 +3168,10 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 	WARN_ON_ONCE(iter->end_sk != expected);
 done:
 	bpf_iter_tcp_unlock_bucket(seq);
+	if (unlikely(!iter->end_sk)) {
+		++iter->state.bucket;
+		goto again;
+	}
 	return iter->batch[0].sk;
 }
 

---
base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
change-id: 20260619-bpf-iter-tcp-refcnt-107d52b238da

Best regards,
--  
Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>


^ permalink raw reply related

* [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit
From: Jan Klos @ 2026-06-20  1:19 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn, Russell King, netdev
  Cc: Jan Klos, Maxime Chevallier, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Golle, Vladimir Oltean,
	Aleksander Jan Bajkowski, Markus Stockhausen, Jan Hoffmann,
	Issam Hamdi, Chukun Pan, Russell King (Oracle), ChunHao Lin,
	linux-kernel

On RTL8127A connected to a link partner that advertises 10000baseT
speed cannot be changed to anything other than 10000baseT as 10GbE
is always advertised regardless of any setting. Fix this by
clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
call to phy_modify_mmd_changed().

Fixes: 83d962316128 ("net: phy: realtek: add RTL8127-internal PHY")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Jan Klos <honza.klos@gmail.com>
---
v2: Patch formalities (rebase, tree name, tags, ccs)

 drivers/net/phy/realtek/realtek_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/realtek/realtek_main.c b/drivers/net/phy/realtek/realtek_main.c
index 27268811f564..b65d0f5fa1a0 100644
--- a/drivers/net/phy/realtek/realtek_main.c
+++ b/drivers/net/phy/realtek/realtek_main.c
@@ -1802,7 +1802,8 @@ static int rtl822x_config_aneg(struct phy_device *phydev)
 		ret = phy_modify_mmd_changed(phydev, MDIO_MMD_VEND2,
 					     RTL_MDIO_AN_10GBT_CTRL,
 					     MDIO_AN_10GBT_CTRL_ADV2_5G |
-					     MDIO_AN_10GBT_CTRL_ADV5G, adv);
+					     MDIO_AN_10GBT_CTRL_ADV5G |
+					     MDIO_AN_10GBT_CTRL_ADV10G, adv);
 		if (ret < 0)
 			return ret;
 	}
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf-next v5 0/3] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel

A BPF_PROG_TYPE_SK_SKB stream parser runs on strparser's message head,
which can chain skbs through frag_list. A parser that resizes the skb
frees the frag_list segments that strparser still tracks through
skb_nextp, leading to a use-after-free.

A stream parser is only meant to measure the next message, not to modify
the packet, so reject a packet-modifying parser at attach time.

v5:
 - target bpf-next instead of bpf
 - add Reviewed-by tag (Jiayuan Chen)

v4:
 - https://lore.kernel.org/all/20260619062959.3277612-1-rhkrqnwk98@gmail.com/

v3:
 - https://lore.kernel.org/all/20260618102718.2331468-1-rhkrqnwk98@gmail.com/

v2:
 - https://lore.kernel.org/all/20260612123553.2724240-1-rhkrqnwk98@gmail.com/

v1:
 - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/

Sechang Lim (3):
  selftests/bpf: don't modify the skb in the strparser parser prog
  bpf, sockmap: reject a packet-modifying SK_SKB stream parser
  selftests/bpf: test rejection of a packet-modifying SK_SKB stream
    parser

 net/core/sock_map.c                           | 20 ++++++++++++
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/sockmap_parse_prog.c  | 22 -------------
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 4 files changed, 58 insertions(+), 22 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH bpf-next v5 1/3] selftests/bpf: don't modify the skb in the strparser parser prog
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel
In-Reply-To: <20260620024423.4141004-1-rhkrqnwk98@gmail.com>

sockmap_parse_prog.c is attached as an SK_SKB stream parser and modifies
the skb: it calls bpf_skb_pull_data() and writes a byte into the packet.
A stream parser runs on strparser's message head and must not modify it.
A resize frees the frag_list segments strparser still tracks, leading to
a use-after-free.

Make the parser read-only. It only needs to return the message length,
which keeps it attaching once packet-modifying parsers are rejected.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/progs/sockmap_parse_prog.c  | 22 -------------------
 1 file changed, 22 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c b/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
index c9abfe3a11af..56e9aebf05f2 100644
--- a/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
+++ b/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
@@ -5,28 +5,6 @@
 SEC("sk_skb1")
 int bpf_prog1(struct __sk_buff *skb)
 {
-	void *data_end = (void *)(long) skb->data_end;
-	void *data = (void *)(long) skb->data;
-	__u8 *d = data;
-	int err;
-
-	if (data + 10 > data_end) {
-		err = bpf_skb_pull_data(skb, 10);
-		if (err)
-			return SK_DROP;
-
-		data_end = (void *)(long)skb->data_end;
-		data = (void *)(long)skb->data;
-		if (data + 10 > data_end)
-			return SK_DROP;
-	}
-
-	/* This write/read is a bit pointless but tests the verifier and
-	 * strparser handler for read/write pkt data and access into sk
-	 * fields.
-	 */
-	d = data;
-	d[7] = 1;
 	return skb->len;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v5 2/3] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel
In-Reply-To: <20260620024423.4141004-1-rhkrqnwk98@gmail.com>

sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
to find the length of the next message. strparser assembles a message out
of several received skbs by chaining them onto the head's frag_list and
recording where to append the next one in strp->skb_nextp:

	*strp->skb_nextp = skb;
	strp->skb_nextp = &skb->next;

and then calls the parser on the head:

	len = (*strp->cb.parse_msg)(strp, head);

The parser is only meant to inspect the skb, but the program may call
bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
Once the head carries a frag_list these go

	... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail

and __pskb_pull_tail() frees the frag_list skbs that strparser still
tracks through skb_nextp:

	while ((list = skb_shinfo(skb)->frag_list) != insp) {
		skb_shinfo(skb)->frag_list = list->next;
		consume_skb(list);
	}

strp->skb_nextp now points into a freed sk_buff. The next segment of
the same message arrives in __strp_recv(), which links it with
*strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
and the write happen in different __strp_recv() calls, so the message
has to span at least three segments before it triggers.

  BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
  Write of size 8 at addr ffff88810db86140 by task repro/349

  Call Trace:
   <IRQ>
   __strp_recv+0x447/0xda0
   __tcp_read_sock+0x13d/0x590
   tcp_bpf_strp_read_sock+0x195/0x320
   strp_data_ready+0x267/0x340
   sk_psock_strp_data_ready+0x1ce/0x350
   tcp_data_queue+0x1364/0x2fd0
   tcp_rcv_established+0xe07/0x1640
   [...]

  Allocated by task 349:
   skb_clone+0x17b/0x210
   __strp_recv+0x2c3/0xda0
   __tcp_read_sock+0x13d/0x590
   [...]

  Freed by task 349:
   kmem_cache_free+0x150/0x570
   __pskb_pull_tail+0x57b/0xc20
   skb_ensure_writable+0x236/0x260
   __bpf_skb_change_tail+0x1d4/0x590
   sk_skb_change_tail+0x2a/0x40
   bpf_prog_1b285dcd6c41373e+0x27/0x30
   bpf_prog_run_pin_on_cpu+0xf3/0x260
   sk_psock_strp_parse+0x118/0x1e0
   __strp_recv+0x4f6/0xda0
   [...]

The same resize also leaves the head's length inconsistent with its
frags, so a later __pskb_pull_tail() can instead hit the
BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.

A stream parser is only meant to measure the next message, not to modify
the packet. Reject a parser whose program can change packet data
(prog->aux->changes_pkt_data) at attach time. The check is shared by
sock_map_prog_update() and sock_map_link_update_prog(), which between them
cover prog attach, link create and link update. Verdict programs are
unaffected and may still modify the skb.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/core/sock_map.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..c60ba6d292f9 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1515,6 +1515,17 @@ static int sock_map_prog_link_lookup(struct bpf_map *map, struct bpf_prog ***ppr
 	return 0;
 }
 
+static int sock_map_prog_attach_check(enum bpf_attach_type attach_type,
+				      struct bpf_prog *prog)
+{
+	/* A stream parser must not modify the skb, only measure it. */
+	if (prog && attach_type == BPF_SK_SKB_STREAM_PARSER &&
+	    prog->aux->changes_pkt_data)
+		return -EINVAL;
+
+	return 0;
+}
+
 /* Handle the following four cases:
  * prog_attach: prog != NULL, old == NULL, link == NULL
  * prog_detach: prog == NULL, old != NULL, link == NULL
@@ -1533,6 +1544,10 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 	if (ret)
 		return ret;
 
+	ret = sock_map_prog_attach_check(which, prog);
+	if (ret)
+		return ret;
+
 	/* for prog_attach/prog_detach/link_attach, return error if a bpf_link
 	 * exists for that prog.
 	 */
@@ -1776,6 +1791,11 @@ static int sock_map_link_update_prog(struct bpf_link *link,
 		ret = -EINVAL;
 		goto out;
 	}
+
+	ret = sock_map_prog_attach_check(link->attach_type, prog);
+	if (ret)
+		goto out;
+
 	if (!sockmap_link->map) {
 		ret = -ENOLINK;
 		goto out;
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v5 3/3] selftests/bpf: test rejection of a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-20  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Jakub Sitnicki, Eduard Zingerman
  Cc: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S . Miller, Jakub Kicinski, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Kumar Kartikeya Dwivedi, Simon Horman,
	Shuah Khan, Jiayuan Chen, Bobby Eshleman, netdev, bpf,
	linux-kselftest, linux-kernel
In-Reply-To: <20260620024423.4141004-1-rhkrqnwk98@gmail.com>

Verify that attaching an SK_SKB stream parser that can modify the packet
is rejected, while a read-only parser still attaches.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 2 files changed, 38 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
index 621b3b71888e..1d7231728eaf 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
@@ -431,6 +431,35 @@ static void test_sockmap_strp_verdict(int family, int sotype)
 	test_sockmap_strp__destroy(strp);
 }
 
+static void test_sockmap_strp_parser_reject(void)
+{
+	struct test_sockmap_strp *strp = NULL;
+	int parser_mod, parser_ro, link;
+	int err, map;
+
+	strp = test_sockmap_strp__open_and_load();
+	if (!ASSERT_OK_PTR(strp, "test_sockmap_strp__open_and_load"))
+		return;
+
+	map = bpf_map__fd(strp->maps.sock_map);
+	parser_mod = bpf_program__fd(strp->progs.prog_skb_parser_resize);
+	parser_ro = bpf_program__fd(strp->progs.prog_skb_parser);
+
+	err = bpf_prog_attach(parser_mod, map, BPF_SK_SKB_STREAM_PARSER, 0);
+	ASSERT_ERR(err, "bpf_prog_attach parser_mod");
+
+	link = bpf_link_create(parser_ro, map, BPF_SK_SKB_STREAM_PARSER, NULL);
+	if (!ASSERT_GE(link, 0, "bpf_link_create parser_ro"))
+		goto out;
+
+	err = bpf_link_update(link, parser_mod, NULL);
+	ASSERT_ERR(err, "bpf_link_update parser_mod");
+out:
+	if (link >= 0)
+		close(link);
+	test_sockmap_strp__destroy(strp);
+}
+
 void test_sockmap_strp(void)
 {
 	if (test__start_subtest("sockmap strp tcp pass"))
@@ -451,4 +480,6 @@ void test_sockmap_strp(void)
 		test_sockmap_strp_multiple_pkt(AF_INET, SOCK_STREAM);
 	if (test__start_subtest("sockmap strp tcp dispatch"))
 		test_sockmap_strp_dispatch_pkt(AF_INET, SOCK_STREAM);
+	if (test__start_subtest("sockmap strp parser reject pkt mod"))
+		test_sockmap_strp_parser_reject();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
index dde3d5bec515..fe88fa6d40bc 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
@@ -50,4 +50,11 @@ int prog_skb_parser_partial(struct __sk_buff *skb)
 	return 10;
 }
 
+SEC("sk_skb/stream_parser")
+int prog_skb_parser_resize(struct __sk_buff *skb)
+{
+	bpf_skb_change_tail(skb, skb->len, 0);
+	return skb->len;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 1/2] fs: Add bpf_sock_read_xattr() kfunc to read socket xattrs
From: Alexei Starovoitov @ 2026-06-20  3:20 UTC (permalink / raw)
  To: Christian Brauner, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann
  Cc: Alexander Viro, Jan Kara, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, linux-fsdevel, netdev, bpf, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa
In-Reply-To: <20260617-work-bpf-sock-xattr-v1-1-a1276f7c9da3@kernel.org>

On Wed Jun 17, 2026 at 4:18 AM PDT, Christian Brauner wrote:
> In c8db08110cbe ("Merge tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
> we added support for extended attributes for sockets. This comes in two
> flavors: sockfs and non-sockfs/filesystem sockets. Filesystem sockets
> are actual filesystem objects so reading xattrs must use dedicated fs
> helpers such as bpf_get_dentry_xattr() and bpf_get_file_xattr(). Those
> are inherently sleeping operations. Sockfs sockets on the other hand
> don't need to use sleeping operations as the underlying data structure
> is lockless. In addition, retrieval of sockfs extended attributes often
> happens from LSM hooks that only provide struct socket and it's
> completely nonsensical to grab a reference to a file, then force a
> sleeping operation to retrieve the xattr and drop the reference. We know
> that the sockfs file cannot go away while the LSM hook runs.
>
> This series adds a bpf_sock_read_xattr() kfunc that, given a struct
> socket, reads a user.* extended attribute from the socket's sockfs inode
> into a bpf_dynptr. Together with fsetxattr() from userspace this lets a
> process label a socket with a user.* xattr and have a BPF LSM program
> retrieve that label locklessly. The kfunc mirrors the existing
> bpf_cgroup_read_xattr(), including the restriction to the user.*
> namespace.
>
> systemd uses user.* xattrs on sockets to implement socket rate limiting
> and to tag sockets for other purposes [1] such as implementing a varlink
> registry. There is currently no efficient way for a BPF program to read
> those labels back. The new helper allows a listening socket marked with
> an extended attribute to be read back during bind/connect and then act
> on the connect()ing socket. Extended attributes make it possible to
> allow an unprivileged user manager such as systemd --user to mark
> sockets from userspace and then rediscover them or implement policies.
>
> The kfunc is registered KF_RCU and only for BPF LSM programs. A struct
> socket is only guaranteed to live in sockfs when an LSM socket hook hands
> it out, which is what keeps SOCK_INODE() valid. Sockets that embed struct
> socket outside sockfs (tun, tap) are only reachable from tracing programs
> and are excluded by the registration. (Btw, for consistency it would
> be nice to force allocation of struct socket from sockfs instead of
> simply embedding it in e.g., struct tun_file which makes the SOCKFS_I()
> pattern a hazard - at least outside of sockfs functions.)
>
> The read never sleeps and takes no lock. For sockfs the value lives in
> the inode's in-memory xattr store and simple_xattr_get() resolves it
> with an RCU-protected rhashtable lookup, taking neither the inode lock
> nor any xattr lock. The kfunc is therefore usable from both sleepable
> and non-sleepable LSM hooks.
>
> Link: https://github.com/systemd/systemd/pull/40559 [1]
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/bpf_fs_kfuncs.c  | 37 +++++++++++++++++++++++++++++++++++++
>  include/linux/net.h |  1 +
>  net/socket.c        | 25 +++++++++++++++++++++++++
>  3 files changed, 63 insertions(+)
>
> diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c
> index 11841c3d4260..85fc9519d1ff 100644
> --- a/fs/bpf_fs_kfuncs.c
> +++ b/fs/bpf_fs_kfuncs.c
> @@ -11,6 +11,7 @@
>  #include <linux/file.h>
>  #include <linux/kernfs.h>
>  #include <linux/mm.h>
> +#include <linux/net.h>
>  #include <linux/xattr.h>
>  
>  __bpf_kfunc_start_defs();
> @@ -359,6 +360,39 @@ __bpf_kfunc int bpf_cgroup_read_xattr(struct cgroup *cgroup, const char *name__s
>  }
>  #endif /* CONFIG_CGROUPS */
>  
> +#ifdef CONFIG_NET
> +/**
> + * bpf_sock_read_xattr - read xattr of a socket's inode in sockfs
> + * @sock: socket to get xattr from
> + * @name__str: name of the xattr
> + * @value_p: output buffer of the xattr value
> + *
> + * Get xattr *name__str* of *sock* and store the output in *value_p*.
> + *
> + * For security reasons, only *name__str* with prefix "user." is allowed.
> + *
> + * Return: length of the xattr value on success, a negative value on error.
> + */
> +__bpf_kfunc int bpf_sock_read_xattr(struct socket *sock, const char *name__str,
> +				    struct bpf_dynptr *value_p)
> +{
> +	struct bpf_dynptr_kern *value_ptr = (struct bpf_dynptr_kern *)value_p;
> +	u32 value_len;
> +	void *value;
> +
> +	/* Only allow reading "user.*" xattrs */
> +	if (strncmp(name__str, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN))
> +		return -EPERM;
> +
> +	value_len = __bpf_dynptr_size(value_ptr);
> +	value = __bpf_dynptr_data_rw(value_ptr, value_len);
> +	if (!value)
> +		return -EINVAL;
> +
> +	return sock_read_xattr(sock, name__str, value, value_len);
> +}
> +#endif /* CONFIG_NET */

lgtm.
How do you want to route it? Thought vfs tree for the next merge window?
If so
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [PATCH bpf v2] bpf, sockmap: disallow update and delete from tc, xdp and flow_dissector
From: Sechang Lim @ 2026-06-20  3:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Stanislav Fomichev, Lorenz Bauer, Jiayuan Chen, bpf, linux-kernel,
	netdev

sock_map_update_common() and __sock_map_delete() hold stab->lock and call
sock_map_unref() -> sock_map_del_link(), which takes sk_callback_lock for
write. That gives the order stab->lock -> sk_callback_lock.

The reverse order comes from the SK_SKB stream parser.
sk_psock_strp_data_ready() holds sk_callback_lock for read, and after the
verdict tcp_bpf_strp_read_sock() acks the consumed data inline via
__tcp_cleanup_rbuf(). The ACK goes out egress, where a sched_cls program
deletes from the sockmap and takes stab->lock:

  WARNING: possible circular locking dependency detected
  7.1.0-rc6 Not tainted
  ------------------------------------------------------
  syz.9.8824 is trying to acquire lock:
  (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
  but task is already holding lock:
  (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173

  -> #1 (clock-AF_INET){++.-}-{3:3}:
         _raw_write_lock_bh
         sock_map_del_link net/core/sock_map.c:167
         sock_map_unref net/core/sock_map.c:184
         sock_map_update_common net/core/sock_map.c:509
         sock_map_update_elem_sys net/core/sock_map.c:588
         map_update_elem kernel/bpf/syscall.c:1805

  -> #0 (&stab->lock){+.-.}-{3:3}:
         _raw_spin_lock_bh
         __sock_map_delete net/core/sock_map.c:421
         sock_map_delete_elem net/core/sock_map.c:452
         bpf_prog_06044d24140080b6
         tcx_run net/core/dev.c:4451
         sch_handle_egress net/core/dev.c:4541
         __dev_queue_xmit net/core/dev.c:4808
         ...
         tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
         strp_data_ready net/strparser/strparser.c:402
         sk_psock_strp_data_ready net/core/skmsg.c:1174
         tcp_data_queue net/ipv4/tcp_input.c:5661

  Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    rlock(clock-AF_INET);
                                 lock(&stab->lock);
                                 lock(clock-AF_INET);
    lock(&stab->lock);

   *** DEADLOCK ***

A tc, xdp or flow_dissector program has no reason to update or delete a
sockmap, and redirect does not go through here. Drop them from
may_update_sockmap() so the verifier rejects it. It also closes the
matching sockhash inversion.

Fixes: 0126240f448d ("bpf: sockmap: Allow update from BPF")
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
v2:
 - reject sockmap update/delete from tc, xdp and flow_dissector (John
   Fastabend)
 - fix the changelog (Jiayuan Chen)

v1:
 - https://lore.kernel.org/all/20260616091153.2966617-1-rhkrqnwk98@gmail.com/

 kernel/bpf/verifier.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7fb88e1cd7c4..94d225521b5a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8766,11 +8766,7 @@ static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id)
 			return true;
 		break;
 	case BPF_PROG_TYPE_SOCKET_FILTER:
-	case BPF_PROG_TYPE_SCHED_CLS:
-	case BPF_PROG_TYPE_SCHED_ACT:
-	case BPF_PROG_TYPE_XDP:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
-	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return true;
 	default:
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox