Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net/sched: dualpi2: fix GSO backlog accounting
From: Xingquan Liu @ 2026-06-18 12:43 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, Jiri Pirko, Victor Nogueira, stable,
	Chia-Yu Chang (Nokia)
In-Reply-To: <CAM0EoMmXrZ5pUAkuVScgQjPFm3-dSC03mygDm3sAaFO=TQgvDw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 245 bytes --]

On Wed, Jun 17, 2026 at 10:23:42AM -0400, Jamal Hadi Salim wrote:
> Do you know how to create a tdc test that will recreate this? If not
> either Victor or myself can help you create one.

Okay, I will try to create a tdc test.

--
Xingquan Liu

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 403 bytes --]

^ permalink raw reply

* [PATCH 5.10] net: 9p: fix refcount leak in p9_read_work() error handling
From: Alexander Martyniuk @ 2026-06-18 15:19 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Alexander Martyniuk, Eric Van Hensbergen, Latchesar Ionkov,
	Dominique Martinet, David S. Miller, Jakub Kicinski,
	Tomas Bortoli, v9fs-developer, netdev, linux-kernel,
	Eric Van Hensbergen, Christian Schoenebeck, v9fs, lvc-project,
	Hangyu Hua

From: Hangyu Hua <hbh25y@gmail.com>

commit 4ac7573e1f9333073fa8d303acc941c9b7ab7f61 upstream.

p9_req_put need to be called when m->rreq->rc.sdata is NULL to avoid
temporary refcount leak.

Link: https://lkml.kernel.org/r/20220712104438.30800-1-hbh25y@gmail.com
Fixes: 728356dedeff ("9p: Add refcount to p9_req_t")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
[Dominique: commit wording adjustments, p9_req_put argument fixes for rebase]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
[Alexander: this branch doesn't contain 8b11ff098af4 ("9p: Add client parameter
 to p9_req_put()"), therefore the parameter is removed from the added line]
Signed-off-by: Alexander Martyniuk <alexevgmart@gmail.com>
---
Backport fix for CVE-2022-50114
 net/9p/trans_fd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 40d458c438df..bd6a54e6f427 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -346,6 +346,7 @@ static void p9_read_work(struct work_struct *work)
 			p9_debug(P9_DEBUG_ERROR,
 				 "No recv fcall for tag %d (req %p), disconnecting!\n",
 				 m->rc.tag, m->rreq);
+			p9_req_put(m->rreq);
 			m->rreq = NULL;
 			err = -EIO;
 			goto error;
-- 
2.47.3

^ permalink raw reply related

* Re: [PATCH net 0/5] rxrpc: Miscellaneous fixes
From: David Howells @ 2026-06-18 12:01 UTC (permalink / raw)
  To: netdev
  Cc: dhowells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel
In-Reply-To: <20260616155749.2125907-1-dhowells@redhat.com>

I'm going to send a v2 of this patchset, so please don't apply.

David


^ permalink raw reply

* Re: [PATCH bpf v3 1/2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Jiayuan Chen @ 2026-06-18 11:56 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, bpf, linux-kernel, Jakub Kicinski, Sechang Lim
In-Reply-To: <20260618102718.2331468-2-rhkrqnwk98@gmail.com>


On 6/18/26 6:27 PM, Sechang Lim wrote:
> sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
> to find the length of the next message. strparser assembles a message out
> of several received skbs by chaining them onto the head's frag_list and
> recording where to append the next one in strp->skb_nextp:
>
> 	*strp->skb_nextp = skb;
> 	strp->skb_nextp = &skb->next;
>
> and then calls the parser on the head:
>
> 	len = (*strp->cb.parse_msg)(strp, head);

[...]

> unaffected and may still modify the skb.
>
> Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")

Is the Fixes tag correct ?

Anyway, I don't think this patch is a fix; it's more of a hardening. So 
no Fixes tag needed, IMO.


> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
> ---
>   net/core/sock_map.c | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 99e3789492a0..c60ba6d292f9 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -1515,6 +1515,17 @@ static int sock_map_prog_link_lookup(struct bpf_map *map, struct bpf_prog ***ppr
>   	return 0;
>   }
>   
> +static int sock_map_prog_attach_check(enum bpf_attach_type attach_type,
> +				      struct bpf_prog *prog)
> +{
> +	/* A stream parser must not modify the skb, only measure it. */
> +	if (prog && attach_type == BPF_SK_SKB_STREAM_PARSER &&
> +	    prog->aux->changes_pkt_data)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
>   /* Handle the following four cases:
>    * prog_attach: prog != NULL, old == NULL, link == NULL
>    * prog_detach: prog == NULL, old != NULL, link == NULL
> @@ -1533,6 +1544,10 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
>   	if (ret)
>   		return ret;
>   
> +	ret = sock_map_prog_attach_check(which, prog);
> +	if (ret)
> +		return ret;
> +
>   	/* for prog_attach/prog_detach/link_attach, return error if a bpf_link
>   	 * exists for that prog.
>   	 */
> @@ -1776,6 +1791,11 @@ static int sock_map_link_update_prog(struct bpf_link *link,
>   		ret = -EINVAL;
>   		goto out;
>   	}
> +
> +	ret = sock_map_prog_attach_check(link->attach_type, prog);
> +	if (ret)
> +		goto out;
> +
>   	if (!sockmap_link->map) {
>   		ret = -ENOLINK;
>   		goto out;


CI failed:
https://github.com/kernel-patches/bpf/actions/runs/27754218839/job/82113319982
    Failed stream parser bpf prog attach

Hi John
I noticed that bpf_skb_pull_data was added to the skmsg test:
https://github.com/torvalds/linux/commit/82a8616889d506cb690cfc0afb2ccadda120461d

Can we drop bpf_skb_pull_data in parser prog(sockmap_parse_prog.c‎) ?
And are there any scenarios where we need to modify skb len when using 
strparser ?



^ permalink raw reply

* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Johan Thomsen @ 2026-06-18 11:43 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Gustavo A. R. Silva,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kernel, linux-hardening, llvm
In-Reply-To: <20260616100332.1308294-1-i.maximets@ovn.org>

> Johan, if you can test this one in your setup as well, that would
> be great.  Thanks.
>
>  include/net/dst_metadata.h | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> index 1fc2fb03ce3f..f45d1e3163f0 100644
> --- a/include/net/dst_metadata.h
> +++ b/include/net/dst_metadata.h
> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>         if (!new_md)
>                 return ERR_PTR(-ENOMEM);
>
> -       memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
> -              sizeof(struct ip_tunnel_info) + md_size);
> +       /* Copy in two stages to keep the __counted_by happy. */
> +       new_md->u.tun_info = md_dst->u.tun_info;
> +       memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
> +              ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
> +
>  #ifdef CONFIG_DST_CACHE
>         /* Unclone the dst cache if there is one */
>         if (new_md->u.tun_info.dst_cache.cache) {

Hi Ilya,

Sure. Just stressed it for 24 hours and - I cannot trigger the bug
with this patch applied.

BR
Johan

^ permalink raw reply

* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Andrew Lunn @ 2026-06-18 11:23 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Finn Thain, Carsten Strotmann, Jakub Kicinski,
	John Paul Adrian Glaubitz, davem, netdev, edumazet, pabeni,
	andrew+netdev, horms, chleroy, npiggin, mpe, maddy, linux-mips,
	linux-m68k, linuxppc-dev
In-Reply-To: <CAMuHMdU0em2r-SixT_+EpWJnm4f0g8mReYKBXOw42=HGb_T8WQ@mail.gmail.com>

On Thu, Jun 18, 2026 at 10:13:08AM +0200, Geert Uytterhoeven wrote:
> Hi Andrew,
> 
> On Thu, 18 Jun 2026 at 10:01, Andrew Lunn <andrew@lunn.ch> wrote:
> > If the appletalk community can take the workload off the top level
> > maintainers, respond to all patches within 2 to 3 days, give
> > Reviewed-by, or make change requests, it can probably stay in the
> > Mainline kernel. Otherwise it will move out of tree.
> 
> "2 or 3 days" is rather short.  If we would have to move all code
> maintained by people who cannot respond to all patches within 2 to
> 3 days out of the mainline kernel, you'd end up with a networking
> subsystem without supporting OS ;-)

I do agree that every subsystem is different, but that is the speed
netdev goes, often faster. There are around 150 patches a day
submitted, and in order to not drown in those patches, they need to be
processed fast.

It is however known for a sub-subsystem to move out of netdev to a
mailing list and a git tree of its own, and just send git pull
requests to netdev. It can then move at its own speed.

	Andrew


^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Sebastian Andrzej Siewior @ 2026-06-18 11:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jakub Kicinski, Petr Mladek, John Ogness, Sergey Senozhatsky,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616170257.GH49951@noisy.programming.kicks-ass.net>

On 2026-06-16 19:02:57 [+0200], Peter Zijlstra wrote:
> On Tue, Jun 16, 2026 at 12:35:29PM +0200, Sebastian Andrzej Siewior wrote:
> 
> > So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> > to NBCON console infrastructure"). Because from here now on writes are
> > deferred to the nbcon thread. So this purely about -stable in this case.
> 
> Hmm, I thought netconsole had some reserved skbs and could to writes
> 'atomic' like? That said, it was 2.6 era the last time I looked at
> netconsole.

Let's look at 8250 for a second in this scenario.
serial8250_console_write() -> uart_port_lock_irqsave(). The uart lock is
a spinlock_t. lockdep does not complain because printk annotates it as
with RT we have NBCONs mandatory and don't use this path.
serial8250_console_write() -> serial8250_modem_status() does a
wake_up_interruptible(). Even if not here, it is used under the port
lock so eventually lockdep will see it and complain about rq lock vs
port lock ordering.

> > Now. The scheduler usually does printk_deferred() because of the rq lock
> > so it does not deadlock for various reasons. It is kind of a pity that
> > the various WARN macros don't do that.
> 
> People have tried, last time was here:
> 
>   https://lkml.kernel.org/r/20260611074344.GG48970@noisy.programming.kicks-ass.net
> 
> and I hate deferred with a passion. It means you'll never see the
> message when you wreck the machine.

Oh, I do hate them, too. Maybe not as much because I spread my hate
evenly across the code. I did *miss* output on RT because the box
crashed before sending output so hate is here.

> > We could add printk_deferred_enter/exit() to all the rq_lock() variants.
> > I think PeterZ loves this the most. And Greg will appreciate it too
> > while backporting because of all the context changes.
> 
> No, not going to happen, ever, sorry. Instead printk should delete
> console sem and have printk() itself be atomic safe.

That was not meant serious but as a possibility.

> As stated, printk deferred is an abomination and needs to die a horrible
> painful death.
> 
> As described here:
> 
>   https://lkml.kernel.org/r/20260611191922.GK187714@noisy.programming.kicks-ass.net
> 
> "So printk should:
> 
>  - stick msg in buffer (lockless)
>  - print to atomic consoles (lockless)
>  - use irq_work to wake console kthreads (lockless)
>  - each kthread then tries to flush buffer to its own non-atomic console
>    in non-atomic context."

So we do this with nbcon afaik and this is the plan forward. The 8250 is
stuck behind broken flow control that John works tirelessly on fixing
before the 8250 can move over to the nbcon land. And some point it might
be possible to force-thread legacy consoles as we do it on RT or remove
them due to no users.

However until then and for stable I do suggest the following:

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 09e8eccee8ed9..9cba16474cb6e 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -115,6 +115,17 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#define WARN_ON_DEFERRED(condition) ({						\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		printk_deferred_enter();				\
+		__WARN_FLAGS(#condition,				\
+			     BUGFLAG_TAINT(TAINT_WARN));		\
+		printk_deferred_exit();					\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
+
 #ifndef WARN_ON_ONCE
 #define WARN_ON_ONCE(condition) ({					\
 	int __ret_warn_on = !!(condition);				\
@@ -125,6 +136,18 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 	unlikely(__ret_warn_on);					\
 })
 #endif
+
+#define WARN_ON_ONCE_DEFERRED(condition) ({				\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		printk_deferred_enter();				\
+		__WARN_FLAGS(#condition,				\
+			     BUGFLAG_ONCE |				\
+			     BUGFLAG_TAINT(TAINT_WARN));		\
+		printk_deferred_exit();				\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
 #endif /* __WARN_FLAGS */
 
 #if defined(__WARN_FLAGS) && !defined(__WARN_printf)
@@ -159,6 +182,18 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#ifndef WARN_ON_DEFERRED
+#define WARN_ON_DEFERRED(condition) ({					\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		printk_deferred_enter()					\
+		__WARN();						\
+		printk_deferred_exit()					\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
+#endif
+
 #ifndef WARN
 #define WARN(condition, format...) ({					\
 	int __ret_warn_on = !!(condition);				\
@@ -180,6 +215,11 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 	DO_ONCE_LITE_IF(condition, WARN_ON, 1)
 #endif
 
+#ifndef WARN_ON_ONCE_DEFERRED
+#define WARN_ON_ONCE_DEFERRED(condition)				\
+	DO_ONCE_LITE_IF(condition, WARN_ON_DEFERRED, 1)
+#endif
+
 #ifndef WARN_ONCE
 #define WARN_ONCE(condition, format...)				\
 	DO_ONCE_LITE_IF(condition, WARN, 1, format)
@@ -215,7 +255,9 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#define WARN_ON_DEFERRED(condition) WARN_ON(condition)
 #define WARN_ON_ONCE(condition) WARN_ON(condition)
+#define WARN_ON_ONCE_DEFERRED(condition) WARN_ON(condition)
 #define WARN_ONCE(condition, format...) WARN(condition, format)
 #define WARN_TAINT(condition, taint, format...) WARN(condition, format)
 #define WARN_TAINT_ONCE(condition, taint, format...) WARN(condition, format)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f9823..439379e6a83de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5814,7 +5814,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
-	WARN_ON_ONCE(cfs_rq->curr != prev);
+	WARN_ON_ONCE_DEFERRED(cfs_rq->curr != prev);
 	cfs_rq->curr = NULL;
 }
 

This plus this other occurrences in sched under rq lock.

If I replace the above WARN_ON_ONCE with
  WARN_ON_ONCE(system_state >= SYSTEM_RUNNING);

then my box fails to boot. Which means the warning seems harmful as of
today. The disgusting _DEFERERED workaround gets the box to boot until
we are in nbcon land.

Sebastian

^ permalink raw reply related

* Re: [RESEND PATCH v1] net: dsa: motorcomm: add yt92xx dsa driver
From: Andrew Lunn @ 2026-06-18 11:10 UTC (permalink / raw)
  To: Kyle Switch
  Cc: David Yang, olteanv, davem, edumazet, kuba, pabeni, horms, netdev,
	linux-kernel, ming.xu, xiaolin.xu, jianmin.wang, de.ge
In-Reply-To: <39b79f5b-3e13-4620-83ba-b2ef991acca9@motor-comm.com>

> >>>>  #define ETH_P_QINQ1    0x9100          /* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>>  #define ETH_P_QINQ2    0x9200          /* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>>  #define ETH_P_QINQ3    0x9300          /* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>> -#define ETH_P_YT921X   0x9988          /* Motorcomm YT921x DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>> +#define ETH_P_YT92XX   0x9988          /* Motorcomm YT92xx DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>>  #define ETH_P_EDSA     0xDADA          /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>>  #define ETH_P_DSA_8021Q        0xDADB          /* Fake VLAN Header for DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>>  #define ETH_P_DSA_A5PSW        0xE001          /* A5PSW Tag Value [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>>
> >>> UAPI stands for User-space API. Do not change it unless there is a
> >>> very very good reason.
> >>>
> >>
> >> Ans: The default tpid both yt921x and yt922x is 0x9988. I have modified this to 
> >> allow for simultaneous use in both yt922x and yt921x scenarios.
> > 
> > As pointed out, this is UAPI. Any changes to this file need a good
> > explanation how it does not change the user API. Do this break
> > backwards compatibility with user space applications? Maybe tcpdump or
> > wireshark has a dissector which expects ETH_P_YT921X and you have just
> > broken it?
> > 
> 
> Ans:Now I have a better understanding of the role of the UAPI representative. 
> If a new dsa driver is added in the subsequent patch, consider adding one instead of modifying the original content.

Or just use ETH_P_YT921X.

> yt921x_vlan_add(struct yt921x_priv *priv, int port, u16 vid, bool untagged)
> {
>  u64 mask64;
>  u64 ctrl64;
> 
>  mask64 = YT921X_VLAN_CTRL_PORTn(port) |
>    YT921X_VLAN_CTRL_PORTS(priv->cpu_ports_mask);
>  ctrl64 = mask64;
> 
>  mask64 |= YT921X_VLAN_CTRL_UNTAG_PORTn(port);
>  if (untagged)
>   ctrl64 |= YT921X_VLAN_CTRL_UNTAG_PORTn(port);
> 
>  return yt921x_reg64_update_bits(priv, YT921X_VLANn_CTRL(vid),
>      mask64, ctrl64);
> }
> 
> after patch:
> 
> yt921x_vlan_add(struct yt921x_priv *priv, int port, u16 vid, bool untagged)
> {
>  struct yt_port_mask member;
>  struct yt_port_mask untag;
> 
>  member.portsbits[0] = BIT(port) | priv->cpu_ports_mask;
>  if (untagged)
>   untag.portbits[0] = BIT(port);
> 
>   return yt_vlan_port_set(priv->unit, vid, member, untag);  // Here we use encapsulated interfaces to complete the hardware configuration. 
> 							     // We can ignore the differences between different motorcomm series, which will be reflected in driver/net/dsa/motorocmm/switch/yt_vlan. c
> }

Look at other DSA drivers, e.g. mv88e6xxx, ocelot. They have
structures like ocelot_ops and mv88e6xxx_ops which abstract the
differences between different families.

	Andrew

^ permalink raw reply

* [RFT 1/1] usb: class: cdc-wdm: switch to kfifo for buffering
From: Oliver Neukum @ 2026-06-18 11:04 UTC (permalink / raw)
  To: linux-usb, netdev; +Cc: Oliver Neukum
In-Reply-To: <20260618110612.439021-1-oneukum@suse.com>

The kfifo code is more efficient and takes care
of memory ordering without locking.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
---
 drivers/usb/class/cdc-wdm.c | 62 ++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/usb/class/cdc-wdm.c b/drivers/usb/class/cdc-wdm.c
index 7556c0dac908..83fc253f8c09 100644
--- a/drivers/usb/class/cdc-wdm.c
+++ b/drivers/usb/class/cdc-wdm.c
@@ -27,6 +27,7 @@
 #include <linux/wwan.h>
 #include <asm/byteorder.h>
 #include <linux/unaligned.h>
+#include <linux/kfifo.h>
 #include <linux/usb/cdc-wdm.h>
 
 #define DRIVER_AUTHOR "Oliver Neukum"
@@ -77,7 +78,8 @@ struct wdm_device {
 	u8			*inbuf; /* buffer for response */
 	u8			*outbuf; /* buffer for command */
 	u8			*sbuf; /* buffer for status */
-	u8			*ubuf; /* buffer for copy to user space */
+
+	struct kfifo		ubuf; /* payload */
 
 	struct urb		*command;
 	struct urb		*response;
@@ -92,7 +94,6 @@ struct wdm_device {
 	u16			wMaxCommand;
 	u16			wMaxPacketSize;
 	__le16			inum;
-	int			length;
 	int			read;
 	int			count;
 	dma_addr_t		shandle;
@@ -170,6 +171,7 @@ static void wdm_in_callback(struct urb *urb)
 	struct wdm_device *desc = urb->context;
 	int status = urb->status;
 	int length = urb->actual_length;
+	int processed;
 
 	spin_lock_irqsave(&desc->iuspin, flags);
 	clear_bit(WDM_RESPONDING, &desc->flags);
@@ -218,17 +220,13 @@ static void wdm_in_callback(struct urb *urb)
 		goto skip_zlp;
 	}
 
-	if (length + desc->length > desc->wMaxCommand) {
-		/* The buffer would overflow */
-		set_bit(WDM_OVERFLOW, &desc->flags);
-	} else {
-		/* we may already be in overflow */
-		if (!test_bit(WDM_OVERFLOW, &desc->flags)) {
-			memmove(desc->ubuf + desc->length, desc->inbuf, length);
-			smp_wmb(); /* against wdm_read() */
-			WRITE_ONCE(desc->length, desc->length + length);
-		}
+	processed = kfifo_in(&desc->ubuf, desc->inbuf, length);
+	if (processed < length) {
+		 set_bit(WDM_OVERFLOW, &desc->flags);
+		 /* WDM_OVERFLOW must not be set after WDM_READ */
+		 smp_wmb(); /* against wdm_read() */
 	}
+
 skip_error:
 
 	if (desc->rerr) {
@@ -372,8 +370,8 @@ static void cleanup(struct wdm_device *desc)
 	kfree(desc->inbuf);
 	kfree(desc->orq);
 	kfree(desc->irq);
-	kfree(desc->ubuf);
 	free_urbs(desc);
+	kfifo_free(&desc->ubuf);
 	kfree(desc);
 }
 
@@ -524,8 +522,7 @@ static int service_outstanding_interrupt(struct wdm_device *desc)
 static ssize_t wdm_read
 (struct file *file, char __user *buffer, size_t count, loff_t *ppos)
 {
-	int rv, cntr;
-	int i = 0;
+	int rv, cntr, done;
 	struct wdm_device *desc = file->private_data;
 
 
@@ -533,8 +530,7 @@ static ssize_t wdm_read
 	if (rv < 0)
 		return -ERESTARTSYS;
 
-	cntr = READ_ONCE(desc->length);
-	smp_rmb(); /* against wdm_in_callback() */
+	cntr = kfifo_len(&desc->ubuf);
 	if (cntr == 0) {
 		desc->read = 0;
 retry:
@@ -547,7 +543,6 @@ static ssize_t wdm_read
 			rv = -ENOBUFS;
 			goto err;
 		}
-		i++;
 		if (file->f_flags & O_NONBLOCK) {
 			if (!test_bit(WDM_READ, &desc->flags)) {
 				rv = -EAGAIN;
@@ -568,6 +563,13 @@ static ssize_t wdm_read
 			rv = -EIO;
 			goto err;
 		}
+		smp_rmb(); /* against wdm_in_callback() */
+		if (test_bit(WDM_OVERFLOW, &desc->flags)) {
+			clear_bit(WDM_OVERFLOW, &desc->flags);
+			rv = -ENOBUFS;
+			goto err;
+		}
+
 		usb_mark_last_busy(interface_to_usbdev(desc->intf));
 		if (rv < 0) {
 			rv = -ERESTARTSYS;
@@ -591,31 +593,27 @@ static ssize_t wdm_read
 			goto retry;
 		}
 
-		cntr = desc->length;
+		cntr = kfifo_len(&desc->ubuf);
 		spin_unlock_irq(&desc->iuspin);
 	}
 
 	if (cntr > count)
 		cntr = count;
-	rv = copy_to_user(buffer, desc->ubuf, cntr);
-	if (rv > 0) {
+	rv = kfifo_to_user(&desc->ubuf, buffer, cntr, &done);
+	if (rv < 0) {
 		rv = -EFAULT;
 		goto err;
 	}
 
 	spin_lock_irq(&desc->iuspin);
 
-	for (i = 0; i < desc->length - cntr; i++)
-		desc->ubuf[i] = desc->ubuf[i + cntr];
-
-	desc->length -= cntr;
 	/* in case we had outstanding data */
-	if (!desc->length) {
+	if (kfifo_is_empty(&desc->ubuf)) {
 		clear_bit(WDM_READ, &desc->flags);
 		service_outstanding_interrupt(desc);
 	}
 	spin_unlock_irq(&desc->iuspin);
-	rv = cntr;
+	rv = done;
 
 err:
 	mutex_unlock(&desc->rlock);
@@ -1013,7 +1011,7 @@ static void service_interrupt_work(struct work_struct *work)
 
 	spin_lock_irq(&desc->iuspin);
 	service_outstanding_interrupt(desc);
-	if (!desc->resp_count && (desc->length || desc->rerr)) {
+	if (!desc->resp_count && (!kfifo_is_empty(&desc->ubuf) || desc->rerr)) {
 		set_bit(WDM_READ, &desc->flags);
 		wake_up(&desc->wait);
 	}
@@ -1071,10 +1069,6 @@ static int wdm_create(struct usb_interface *intf, struct usb_endpoint_descriptor
 	if (!desc->command)
 		goto err;
 
-	desc->ubuf = kmalloc(desc->wMaxCommand, GFP_KERNEL);
-	if (!desc->ubuf)
-		goto err;
-
 	desc->sbuf = kmalloc(desc->wMaxPacketSize, GFP_KERNEL);
 	if (!desc->sbuf)
 		goto err;
@@ -1083,6 +1077,10 @@ static int wdm_create(struct usb_interface *intf, struct usb_endpoint_descriptor
 	if (!desc->inbuf)
 		goto err;
 
+	rv = kfifo_alloc(&desc->ubuf, roundup_pow_of_two(desc->wMaxCommand), GFP_KERNEL);
+	if (rv < 0)
+		goto err;
+
 	usb_fill_int_urb(
 		desc->validity,
 		interface_to_usbdev(intf),
-- 
2.54.0


^ permalink raw reply related

* (no subject)
From: Oliver Neukum @ 2026-06-18 11:04 UTC (permalink / raw)
  To: linux-usb, netdev

Hi, unfortunately my old phne broke, so I am out of options for
testing patches for WDM. I need testers.

This patch is a major modernization of the driver in that it
switches it to using a kfifo.

	Regards
		Oliver

^ permalink raw reply

* [PATCH net v2] net: dsa: sja1105: round up PTP perout pin duration
From: Aleksandrova Alyona @ 2026-06-18 11:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, linux-kernel,
	netdev, lvc-project

pin_duration is converted from the user-provided period to SJA1105
clock ticks and is later passed as the cycle_time argument to
future_base_time().

Very small period values may become zero after the conversion,
which can lead to a division by zero in future_base_time().

Round zero pin_duration up to 1 tick so that the smallest unsupported
periods use the minimum non-zero hardware duration instead of passing
zero to future_base_time().

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 747e5eb31d59 ("net: dsa: sja1105: configure the PTP_CLK pin as EXT_TS or PER_OUT")
Signed-off-by: Aleksandrova Alyona <aga@itb.spb.ru>
---
v2:
- Round up zero pin_duration to 1 instead of rejecting it, as suggested
  by Andrew Lunn.

 drivers/net/dsa/sja1105/sja1105_ptp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_ptp.c b/drivers/net/dsa/sja1105/sja1105_ptp.c
index a7d41e781398..afb11690c217 100644
--- a/drivers/net/dsa/sja1105/sja1105_ptp.c
+++ b/drivers/net/dsa/sja1105/sja1105_ptp.c
@@ -755,7 +755,7 @@ static int sja1105_per_out_enable(struct sja1105_private *priv,
 		 * 2 edges on PTP_CLK. So check for truncation which happens
 		 * at periods larger than around 68.7 seconds.
 		 */
-		pin_duration = ns_to_sja1105_ticks(pin_duration / 2);
+		pin_duration = max_t(u64, ns_to_sja1105_ticks(pin_duration / 2), 1);
 		if (pin_duration > U32_MAX) {
 			rc = -ERANGE;
 			goto out;
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH net v3] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Wayen Yan @ 2026-06-18 14:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Lorenzo Bianconi, netdev, horms, pabeni, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek
In-Reply-To: <20260617161951.52abe413@kernel.org>

On Wed, Jun 18, 2026 at 08:19:51AM +0200, Jakub Kicinski wrote:
> Hi Lorenzo, is there a reason we're subtracting 1 here in the first
> place? Could be just me, but may be worth adding a comment here.
>
> Please respin with some sort of an explanation..

Hi Jakub,

The (priority - 1) mapping predates my involvement — I only addressed
the underflow bug when skb->priority is 0, where the unsigned
subtraction wraps and routes best-effort packets to the highest-priority
queue.

Lorenzo, could you clarify the intended priority-to-queue mapping so
I can add a proper comment in the respin?

Regards,
Wayen

^ permalink raw reply

* Re: (subset) [PATCH net-next 1/3] leds: trigger: netdev: Extend speeds up to 100G
From: Lee Jones @ 2026-06-18 11:01 UTC (permalink / raw)
  To: Lee Jones, Pavel Machek, Alexander Duyck, Jakub Kicinski,
	kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Russell King, Daniel Golle, Kees Cook, Simon Horman,
	Dimitri Daskalakis, Jacob Keller, Lee Trager, Mohsin Bashir,
	Alok Tiwari, Chengfeng Ye, Andy Shevchenko, Andrew Lunn,
	mike.marciniszyn
  Cc: linux-kernel, linux-leds, netdev
In-Reply-To: <20260520200337.204431-2-mike.marciniszyn@gmail.com>

On Wed, 20 May 2026 16:03:35 -0400, mike.marciniszyn@gmail.com wrote:
> Add 25G, 40G, 50G, and 100G as available speeds to the netdev LED trigger.

Applied, thanks!

[1/3] leds: trigger: netdev: Extend speeds up to 100G
      commit: bbd8b5bdb88bf15006b078f6a2a3b452ffaa10b4

--
Lee Jones [李琼斯]


^ permalink raw reply

* Re: [PATCH net-next 1/3] net: bcmgenet: collapse TX priority queues to a single queue
From: Florian Fainelli @ 2026-06-18 10:50 UTC (permalink / raw)
  To: Nicolai Buchwitz, Doug Berger, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel
In-Reply-To: <20260612205915.3156127-2-nb@tipi-net.de>



On 6/12/2026 1:59 PM, Nicolai Buchwitz wrote:
> The strict-priority TX queues can starve under multi-queue load and
> trip NETDEV_WATCHDOG. Justin's earlier series [1] worked around the
> symptom but kept the design.
> 
> The multi-queue design was originally used for STB use cases that are
> no longer needed, as confirmed by Justin. v1 hw_params already
> exercises a single-queue path. Point v2-v4 at the same configuration:
> ring 0 takes the full BD pool, every per-ring loop collapses to one
> iteration, and netif_set_real_num_tx_queues drops to 1 via the
> existing tx_queues + 1 arithmetic.
> 
> Tested on Raspberry Pi CM4 (BCM2711). The baseline kernel trips
> NETDEV_WATCHDOG within seconds under iperf3 UDP saturation
> (-u -b0 -P16 -t60). After the change the same test completes
> without a watchdog, and a single-stream 60 s UDP run sustains
> 956 Mbit/s with 0/4952890 datagrams lost. Single-stream TCP
> throughput is unchanged at 943 Mbit/s.
> 
> [1] https://lore.kernel.org/netdev/20260406175756.134567-1-justin.chen@broadcom.com/
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply

* Re: [PATCH net-next 2/3] net: bcmgenet: remove dead priority queue plumbing
From: Florian Fainelli @ 2026-06-18 10:50 UTC (permalink / raw)
  To: Nicolai Buchwitz, Doug Berger, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel
In-Reply-To: <20260612205915.3156127-3-nb@tipi-net.de>



On 6/12/2026 1:59 PM, Nicolai Buchwitz wrote:
> With a single TX ring there is nothing left to prioritize. Drop the
> unused register writes, enum entries, helper macros, and the dead
> "flow period for ring != 0" branch in bcmgenet_init_tx_ring().
> 
> The DMA_ARBITER_{RR,WRR,SP} and DMA_RING_BUF_PRIORITY_* HW defines
> are kept as register documentation.
> 
> No functional change.
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply

* Re: [PATCH net-next 3/3] net: bcmgenet: allocate a single-queue netdev
From: Florian Fainelli @ 2026-06-18 10:50 UTC (permalink / raw)
  To: Nicolai Buchwitz, Doug Berger, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel
In-Reply-To: <20260612205915.3156127-4-nb@tipi-net.de>



On 6/12/2026 1:59 PM, Nicolai Buchwitz wrote:
> The driver only uses TX ring 0 and RX ring 0, so allocating a netdev
> with GENET_MAX_MQ_CNT + 1 = 5 TX and 5 RX slots leaves four of each
> unused. Switch to alloc_etherdev() which allocates exactly one queue
> of each kind.
> 
> No functional change: netif_set_real_num_{tx,rx}_queues() already
> clamps the visible queue count to 1.
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply

* Re: [PATCH bpf v3 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: sun jian @ 2026-06-18 10:45 UTC (permalink / raw)
  To: Paul Chaignon
  Cc: bot+bpf-ci, bpf, netdev, linux-kselftest, linux-kernel, ast,
	daniel, andrii, martin.lau, eddyz87, memxor, song, yonghong.song,
	jolsa, davem, edumazet, kuba, pabeni, horms, shuah, hawk,
	john.fastabend, sdf, toke, lorenzo, martin.lau, clm,
	ihor.solodrai
In-Reply-To: <ajOv_oOd1zInaW1b@mail.gmail.com>

On Thu, Jun 18, 2026 at 4:44 PM Paul Chaignon <paul.chaignon@gmail.com> wrote:
>
> On Wed, Jun 17, 2026 at 10:19:52PM +0800, sun jian wrote:
> > On Wed, Jun 17, 2026 at 6:31 PM <bot+bpf-ci@kernel.org> wrote:
> > >
> > > > diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > > > index 01f1d1b6715a..9cc898e6a9f7 100644
> > > > --- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > > > @@ -4,6 +4,10 @@
> > > >
> > > >  #include "test_pkt_access.skel.h"
> > > >
> > > > +#define NONLINEAR_PKT_LEN 9000
> > > > +#define NONLINEAR_LINEAR_DATA_LEN 64
> > > > +#define SHORT_OUT_LEN 100
> > > > +
> > >
> > > [ ... ]
> > >
> > > > @@ -20,6 +24,69 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
> > > >             "incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
> > > >  }
> > > >
> > > > +static void init_pkt(__u8 *pkt, size_t len)
> > > > +{
> > > > +     size_t i;
> > > > +
> > > > +     for (i = 0; i < len; i++)
> > > > +             pkt[i] = i & 0xff;
> > > > +}
> > >
> > > A question was raised on v2 about whether pkt_v4 could be reused by
> > > reducing the linear area to ETH_HLEN, rather than introducing a custom
> > > init_pkt() with a 9000-byte stack packet.
> > >
> > > Can't we reuse pkt_v4 here by reducing the linear area to ETH_HLEN?
> > > The v3 still adds init_pkt() and the NONLINEAR_PKT_LEN packet, so this
> > > doesn't seem to have been picked up.
> > >
> > > > +
> > > > +static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
> > > > +{
> > > > +     LIBBPF_OPTS(bpf_test_run_opts, topts);
> > > > +     __u8 pkt[NONLINEAR_PKT_LEN];
> > > > +     __u8 out[SHORT_OUT_LEN];
> > > > +     struct __sk_buff skb = {};
> > > > +     int prog_fd, err;
> > > > +
> > > > +     init_pkt(pkt, sizeof(pkt));
> > > > +
> > > > +     skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
> > > > +
> > > > +     topts.data_in = pkt;
> > > > +     topts.data_size_in = sizeof(pkt);
> > > > +     topts.data_out = out;
> > > > +     topts.data_size_out = sizeof(out);
> > > > +     topts.ctx_in = &skb;
> > > > +     topts.ctx_size_in = sizeof(skb);
> > > > +
> > > > +     prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
> > >
> > > [ ... ]
> > >
> > > > diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > > > index bce7173152c6..cd284401eebd 100644
> > > > --- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > > > +++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > > > @@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
> > > >
> > > >       return TC_ACT_UNSPEC;
> > > >  }
> > > > +
> > > > +SEC("tc")
> > > > +int tc_pass_prog(struct __sk_buff *skb)
> > > > +{
> > > > +     return TC_ACT_OK;
> > > > +}
> > > > +
> > > > +SEC("xdp.frags")
> > > > +int xdp_frags_pass_prog(struct xdp_md *ctx)
> > > > +{
> > > > +     return XDP_PASS;
> > > > +}
> > >
> > > A related suggestion on v2 was that, once pkt_v4 is reused, the existing
> > > BPF program could be reused instead of adding new pass-through programs.
> > >
> > > Could tc_pass_prog and xdp_frags_pass_prog be dropped in favour of the
> > > existing program? The v3 still adds both of these, so this point also
> > > seems to be open.
> > >
> > >
> > > ---
> > > AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> > > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> > >
> > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27680511802
> >
> > Hi,
> >
> > Thanks for checking this.
>
> Hi Sun Jian,
>
> It would help if you could reply inline instead of at the end of the
> messages, especially when there are multiple comments. See [1] for an
> explanation of how that works.
>
> 1: https://kernelnewbies.org/FirstKernelPatch#Responding_inline

Acknowledged, I will reply inline.

>
> >
> > I tried reusing pkt_v4 and the existing TC program, but they do not fit
> > the skb case this test is trying to cover.
> >
> > For skb test_run, IPv4/IPv6 inputs with a too-short L3 header in the
> > linear area are rejected before bpf_test_finish(). With pkt_v4 and a
> > linear area of ETH_HLEN, the test fails with -EINVAL before reaching the
> > partial copy-out path. If the linear area is increased enough to pass the
> > IPv4 check, pkt_v4 is too small to both trigger the old
> > copy_size - frag_size path and verify that the copied prefix spans the
> > linear data and the first fragment. pkt_v6 has the same issue: after
> > making the IPv6 header linear, only 20 bytes remain in frags.
> >
> > The existing test_pkt_access program has its own packet-access coverage
> > goals and is not just a pass-through carrier. With such a short linear
> > area or small packet fixture, it can fail before the test hits the
> > bpf_test_finish()'s partial copy-out path. A pass-through TC program is
> > therefore a better fit, because it keeps the test focused on the
> > bpf_test_finish() copy-out semantics.
>
> If we're keeping tc_pass_prog() then can't we use pkt_v4 and get rid of
> init_pkt?
>

pkt_v4 is too small to construct a meaningful nonlinear skb with a stable
linear/frag split while still exercising the partial copy-out boundary in
bpf_test_finish().

With pkt_v4, we either do not reach a fragmented layout, or lose control over
the linear/frag boundary needed to exercise the regression path.

This test uses a 9000B packet so it does not depend on small-packet
allocation details. Smaller packets might work depending on allocation
state, but 9000B reliably gives us a non-linear skb with page frags and a
stable linear/frag boundary for the copy-out regression.

init_pkt() is needed to ensure deterministic byte content across both linear
and fragmented regions so that the memcmp-based validation is stable.

Thanks,
Sun Jian


> >
> > For XDP, this object does not have an existing xdp.frags pass-through
> > program, so the small XDP frags program is needed to cover the other
> > caller of the shared bpf_test_finish() path.
> >
> > Thanks,
> > Sun Jian

^ permalink raw reply

* [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Yun Zhou @ 2026-06-18 10:43 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, bigeasy
  Cc: netdev, linux-kernel, yun.zhou

On Armada XP (non-armada3700), mvneta uses percpu interrupts where
the ISR (mvneta_percpu_isr) calls disable_percpu_irq() to mask the
MPIC percpu IRQ, then schedules NAPI. NAPI poll completion calls
enable_percpu_irq() to unmask.

If suspend occurs while NAPI is actively polling (between
disable_percpu_irq in the ISR and enable_percpu_irq in
napi_complete_done), the MPIC percpu interrupt remains masked.
mvneta_stop_dev/mvneta_start_dev do not manage the percpu IRQ
enable state -- they only control mvneta's own INTR_NEW_MASK register.

After resume, the MPIC percpu IRQ stays masked permanently: the
network hardware generates interrupts (INTR_NEW_CAUSE != 0) but the
CPU never receives them (irq count stops incrementing), causing a
complete loss of network connectivity.

Fix by calling on_each_cpu(mvneta_percpu_enable) after
mvneta_start_dev() in the resume path, ensuring the MPIC percpu
IRQ is always unmasked regardless of the pre-suspend state.

Fixes: 12bb03b436da ("net: mvneta: Handle per-cpu interrupts")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3:
  - Dropped the free_irq/request_irq approach (incorrect root cause).
  - Instead, call on_each_cpu(mvneta_percpu_enable) in the resume path
    to ensure the MPIC percpu IRQ is unmasked, matching mvneta_open().
  - Updated commit message with correct root cause analysis.

v2:
  - Move request_irq before cpuhp registration in resume (matching
    mvneta_open ordering) so that failure does not leave cpuhp
    callbacks registered on a non-functional device.
  - On request_irq failure, call netif_device_detach() to prevent
    further traffic on the dead interface.

 drivers/net/ethernet/marvell/mvneta.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b4a845f04c05..5ef79e70e319 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5907,6 +5907,9 @@ static int mvneta_resume(struct device *device)
 	rtnl_unlock();
 	mvneta_set_rx_mode(dev);

+	if (!pp->neta_armada3700)
+		on_each_cpu(mvneta_percpu_enable, pp, true);
+
 	return 0;
 }
 #endif
-- 
2.43.0

^ permalink raw reply related

* [PATCH net] ipv6: ioam: fix type confusion of dst_entry
From: Jiayuan Chen @ 2026-06-18 10:43 UTC (permalink / raw)
  To: netdev
  Cc: Jiayuan Chen, Justin Iurman, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, linux-kernel

IOAM uses a dummy dst_entry(null_dst) to mark that the destination should
not be changed after the transformation. This dst is stored in the IOAM lwt
state and may be passed to dst_cache_set_ip6().

However, the IPv6 dst cache path eventually calls rt6_get_cookie(), which
treats the dst_entry as part of a struct rt6_info. Since the null_dst was
embedded directly as a struct dst_entry in struct ioam6_lwt, this resulted
in an invalid cast and rt6_get_cookie() reading fields from the wrong
object.

In practice, the wrong cookie is not used while dst->obsolete is zero, but
rt6_get_cookie() may also access per-cpu value when rt->sernum is
zero. In this case, rt->sernum aliases ioam6_lwt::cache::reset_ts, which
can become zero, making this a potential invalid pointer access.

Fix this by embedding a full struct rt6_info for the dummy IPv6 route and
passing its dst member to the dst APIs.

Fixes: 47ce7c854563 ("net: ipv6: ioam6: fix double reallocation")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/ipv6/ioam6_iptunnel.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ioam6_iptunnel.c b/net/ipv6/ioam6_iptunnel.c
index b9f6d892a566..cfb2c41634a0 100644
--- a/net/ipv6/ioam6_iptunnel.c
+++ b/net/ipv6/ioam6_iptunnel.c
@@ -35,7 +35,7 @@ struct ioam6_lwt_freq {
 };

 struct ioam6_lwt {
-	struct dst_entry null_dst;
+	struct rt6_info null_rt;
 	struct dst_cache cache;
 	struct ioam6_lwt_freq freq;
 	atomic_t pkt_cnt;
@@ -176,7 +176,7 @@ static int ioam6_build_state(struct net *net, struct nlattr *nla,
 	 * it is stored in the cache. Then, +1/-1 each time we read the cache
 	 * and release it. Long story short, we're fine.
 	 */
-	dst_init(&ilwt->null_dst, NULL, NULL, DST_OBSOLETE_NONE, DST_NOCOUNT);
+	dst_init(&ilwt->null_rt.dst, NULL, NULL, DST_OBSOLETE_NONE, DST_NOCOUNT);

 	atomic_set(&ilwt->pkt_cnt, 0);
 	ilwt->freq.k = freq_k;
@@ -360,7 +360,7 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	/* This is how we notify that the destination does not change after
 	 * transformation and that we need to use orig_dst instead of the cache
 	 */
-	if (dst == &ilwt->null_dst) {
+	if (dst == &ilwt->null_rt.dst) {
 		dst_release(dst);

 		dst = orig_dst;
@@ -429,7 +429,7 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		local_bh_disable();
 		if (orig_dst->lwtstate == dst->lwtstate)
 			dst_cache_set_ip6(&ilwt->cache,
-					  &ilwt->null_dst, &fl6.saddr);
+					  &ilwt->null_rt.dst, &fl6.saddr);
 		else
 			dst_cache_set_ip6(&ilwt->cache, dst, &fl6.saddr);
 		local_bh_enable();
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net] net/sched: act_ct: fix nf_connlabels leak on two error paths
From: Jamal Hadi Salim @ 2026-06-18 10:41 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Jiri Pirko, Pablo Neira Ayuso, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260617215708.1115818-1-michael.bommarito@gmail.com>

On Wed, Jun 17, 2026 at 5:57 PM Michael Bommarito
<michael.bommarito@gmail.com> wrote:
>
> tcf_ct_fill_params() calls nf_connlabels_get() (setting put_labels) when
> TCA_CT_LABELS is present, but two later error sites use a bare return
> instead of "goto err", skipping the err: nf_connlabels_put() cleanup.
> They also precede the "p->put_labels = put_labels" assignment, so the
> tcf_ct_params_free() fallback does not release the count either. Each
> failed RTM_NEWACTION on these paths leaks one nf_connlabels reference:
> net->ct.labels_used is incremented and never released. The action is
> reachable with CAP_NET_ADMIN over the netns, i.e. from an unprivileged
> user namespace on default-userns kernels.
>
> Impact: an unprivileged user with CAP_NET_ADMIN over a network namespace
> (e.g. via user namespaces) leaks one nf_connlabels reference per failed
> RTM_NEWACTION on the two error paths; net->ct.labels_used is never
> released.
>
> The err: label is safe to reach from both sites: p->tmpl is still NULL
> there (kzalloc'd, not yet assigned) and nf_ct_put(NULL) is a no-op, so
> no inline release is needed.
>
> Fixes: 70f06c115bcc ("sched: act_ct: switch to per-action label counting")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> Testing: refcount/counter leak (CWE-772); no sanitizer for this class, so
> the oracle is the nf_connlabels accounting counter net->ct.labels_used.
>
> Reproduction (UML, before/after, same trigger): CONFIG_NET_ACT_CT=y,
> NF_CONNTRACK_LABELS=y, NF_CONNTRACK_ZONES=n (forces the zone-disabled
> path). A raw RTM_NEWACTION trigger adds "action ct label 0x1/0x1 zone 1"
> 20 times; each returns -EOPNOTSUPP.
>   stock:   net->ct.labels_used climbs 1,2,...,20 (get, then bare return,
>            no put) -- 20 leaked counts, never recovered.
>   patched: counter stays balanced (get then goto err -> put); baseline.
> Control: the same loop without "label" (no nf_connlabels_get) leaves the
> counter unchanged on both trees -- the trigger reached the labels path and
> the synthesis is not itself the cause.
>
> Conditions: reachable via RTM_NEWACTION with CAP_NET_ADMIN over the netns,
> i.e. an unprivileged user in a fresh user+net namespace on default-userns
> distros. The easy path needs CONFIG_NF_CONNTRACK_ZONES=n; the
> nf_ct_tmpl_alloc ENOMEM path leaks on any config under memory pressure.
>
> Mitigations: restrict unprivileged user namespaces; otherwise none short of
> the fix. Harness (trigger.c, init) available on request.
>
>  net/sched/act_ct.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 6158e13c98d35..f5866a364a74a 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -1295,7 +1295,8 @@ static int tcf_ct_fill_params(struct net *net,
>         if (tb[TCA_CT_ZONE]) {
>                 if (!IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES)) {
>                         NL_SET_ERR_MSG_MOD(extack, "Conntrack zones isn't enabled.");
> -                       return -EOPNOTSUPP;
> +                       err = -EOPNOTSUPP;
> +                       goto err;
>                 }
>
>                 tcf_ct_set_key_val(tb,
> @@ -1308,7 +1309,8 @@ static int tcf_ct_fill_params(struct net *net,
>         tmpl = nf_ct_tmpl_alloc(net, &zone, GFP_KERNEL);
>         if (!tmpl) {
>                 NL_SET_ERR_MSG_MOD(extack, "Failed to allocate conntrack template");
> -               return -ENOMEM;
> +               err = -ENOMEM;
> +               goto err;
>         }
>         p->tmpl = tmpl;
>         if (tb[TCA_CT_HELPER_NAME]) {

Looks sane to me.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH bpf v3 2/2] selftests/bpf: test rejection of a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-18 10:27 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Bobby Eshleman, Jiayuan Chen, netdev, bpf,
	linux-kernel
In-Reply-To: <20260618102718.2331468-1-rhkrqnwk98@gmail.com>

Verify that attaching an SK_SKB stream parser that can modify the packet
is rejected, while a read-only parser still attaches.

Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 2 files changed, 38 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
index 621b3b71888e..1d7231728eaf 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
@@ -431,6 +431,35 @@ static void test_sockmap_strp_verdict(int family, int sotype)
 	test_sockmap_strp__destroy(strp);
 }
 
+static void test_sockmap_strp_parser_reject(void)
+{
+	struct test_sockmap_strp *strp = NULL;
+	int parser_mod, parser_ro, link;
+	int err, map;
+
+	strp = test_sockmap_strp__open_and_load();
+	if (!ASSERT_OK_PTR(strp, "test_sockmap_strp__open_and_load"))
+		return;
+
+	map = bpf_map__fd(strp->maps.sock_map);
+	parser_mod = bpf_program__fd(strp->progs.prog_skb_parser_resize);
+	parser_ro = bpf_program__fd(strp->progs.prog_skb_parser);
+
+	err = bpf_prog_attach(parser_mod, map, BPF_SK_SKB_STREAM_PARSER, 0);
+	ASSERT_ERR(err, "bpf_prog_attach parser_mod");
+
+	link = bpf_link_create(parser_ro, map, BPF_SK_SKB_STREAM_PARSER, NULL);
+	if (!ASSERT_GE(link, 0, "bpf_link_create parser_ro"))
+		goto out;
+
+	err = bpf_link_update(link, parser_mod, NULL);
+	ASSERT_ERR(err, "bpf_link_update parser_mod");
+out:
+	if (link >= 0)
+		close(link);
+	test_sockmap_strp__destroy(strp);
+}
+
 void test_sockmap_strp(void)
 {
 	if (test__start_subtest("sockmap strp tcp pass"))
@@ -451,4 +480,6 @@ void test_sockmap_strp(void)
 		test_sockmap_strp_multiple_pkt(AF_INET, SOCK_STREAM);
 	if (test__start_subtest("sockmap strp tcp dispatch"))
 		test_sockmap_strp_dispatch_pkt(AF_INET, SOCK_STREAM);
+	if (test__start_subtest("sockmap strp parser reject pkt mod"))
+		test_sockmap_strp_parser_reject();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
index dde3d5bec515..fe88fa6d40bc 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
@@ -50,4 +50,11 @@ int prog_skb_parser_partial(struct __sk_buff *skb)
 	return 10;
 }
 
+SEC("sk_skb/stream_parser")
+int prog_skb_parser_resize(struct __sk_buff *skb)
+{
+	bpf_skb_change_tail(skb, skb->len, 0);
+	return skb->len;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf v3 1/2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Sechang Lim @ 2026-06-18 10:27 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Bobby Eshleman, Jiayuan Chen, netdev, bpf,
	linux-kernel
In-Reply-To: <20260618102718.2331468-1-rhkrqnwk98@gmail.com>

sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
to find the length of the next message. strparser assembles a message out
of several received skbs by chaining them onto the head's frag_list and
recording where to append the next one in strp->skb_nextp:

	*strp->skb_nextp = skb;
	strp->skb_nextp = &skb->next;

and then calls the parser on the head:

	len = (*strp->cb.parse_msg)(strp, head);

The parser is only meant to inspect the skb, but the program may call
bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
Once the head carries a frag_list these go

	... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail

and __pskb_pull_tail() frees the frag_list skbs that strparser still
tracks through skb_nextp:

	while ((list = skb_shinfo(skb)->frag_list) != insp) {
		skb_shinfo(skb)->frag_list = list->next;
		consume_skb(list);
	}

strp->skb_nextp now points into a freed sk_buff. The next segment of
the same message arrives in __strp_recv(), which links it with
*strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
and the write happen in different __strp_recv() calls, so the message
has to span at least three segments before it triggers.

  BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
  Write of size 8 at addr ffff88810db86140 by task repro/349

  Call Trace:
   <IRQ>
   __strp_recv+0x447/0xda0
   __tcp_read_sock+0x13d/0x590
   tcp_bpf_strp_read_sock+0x195/0x320
   strp_data_ready+0x267/0x340
   sk_psock_strp_data_ready+0x1ce/0x350
   tcp_data_queue+0x1364/0x2fd0
   tcp_rcv_established+0xe07/0x1640
   [...]

  Allocated by task 349:
   skb_clone+0x17b/0x210
   __strp_recv+0x2c3/0xda0
   __tcp_read_sock+0x13d/0x590
   [...]

  Freed by task 349:
   kmem_cache_free+0x150/0x570
   __pskb_pull_tail+0x57b/0xc20
   skb_ensure_writable+0x236/0x260
   __bpf_skb_change_tail+0x1d4/0x590
   sk_skb_change_tail+0x2a/0x40
   bpf_prog_1b285dcd6c41373e+0x27/0x30
   bpf_prog_run_pin_on_cpu+0xf3/0x260
   sk_psock_strp_parse+0x118/0x1e0
   __strp_recv+0x4f6/0xda0
   [...]

The same resize also leaves the head's length inconsistent with its
frags, so a later __pskb_pull_tail() can instead hit the
BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.

A stream parser is only meant to measure the next message, not to modify
the packet. Reject a parser whose program can change packet data
(prog->aux->changes_pkt_data) at attach time. The check is shared by
sock_map_prog_update() and sock_map_link_update_prog(), which between them
cover prog attach, link create and link update. Verdict programs are
unaffected and may still modify the skb.

Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/core/sock_map.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..c60ba6d292f9 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1515,6 +1515,17 @@ static int sock_map_prog_link_lookup(struct bpf_map *map, struct bpf_prog ***ppr
 	return 0;
 }

+static int sock_map_prog_attach_check(enum bpf_attach_type attach_type,
+				      struct bpf_prog *prog)
+{
+	/* A stream parser must not modify the skb, only measure it. */
+	if (prog && attach_type == BPF_SK_SKB_STREAM_PARSER &&
+	    prog->aux->changes_pkt_data)
+		return -EINVAL;
+
+	return 0;
+}
+
 /* Handle the following four cases:
  * prog_attach: prog != NULL, old == NULL, link == NULL
  * prog_detach: prog == NULL, old != NULL, link == NULL
@@ -1533,6 +1544,10 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 	if (ret)
 		return ret;

+	ret = sock_map_prog_attach_check(which, prog);
+	if (ret)
+		return ret;
+
 	/* for prog_attach/prog_detach/link_attach, return error if a bpf_link
 	 * exists for that prog.
 	 */
@@ -1776,6 +1791,11 @@ static int sock_map_link_update_prog(struct bpf_link *link,
 		ret = -EINVAL;
 		goto out;
 	}
+
+	ret = sock_map_prog_attach_check(link->attach_type, prog);
+	if (ret)
+		goto out;
+
 	if (!sockmap_link->map) {
 		ret = -ENOLINK;
 		goto out;
-- 
2.43.0

^ permalink raw reply related

* [PATCH bpf v3 0/2] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-18 10:27 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Bobby Eshleman, Jiayuan Chen, netdev, bpf,
	linux-kernel

A BPF_PROG_TYPE_SK_SKB stream parser runs on strparser's message head,
which can chain skbs through frag_list. A parser that resizes the skb
frees the frag_list segments that strparser still tracks through
skb_nextp, leading to a use-after-free.

A stream parser is only meant to measure the next message, not to modify
the packet, so reject a packet-modifying parser at attach time rather
than working around the resize at runtime.

v3:
 - reject the parser at attach time instead of cloning the skb at
   runtime (Kuniyuki Iwashima, Jiayuan Chen)
 - add a selftest (Bobby Eshleman)

v2:
 - https://lore.kernel.org/all/20260612123553.2724240-1-rhkrqnwk98@gmail.com/

v1:
 - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/

Sechang Lim (2):
  bpf, sockmap: fix use-after-free when the stream parser resizes the
    skb
  selftests/bpf: test rejection of a packet-modifying SK_SKB stream
    parser

 net/core/sock_map.c                           | 20 ++++++++++++
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 3 files changed, 58 insertions(+)

-- 
2.43.0


^ permalink raw reply

* [PATCH bpf-next v2 2/2] selftests/bpf: Cover small conntrack opts error writes
From: Yiyang Chen @ 2026-06-18 10:18 UTC (permalink / raw)
  To: bpf, netfilter-devel
  Cc: Yiyang Chen, pablo, fw, phil, davem, edumazet, kuba, pabeni,
	horms, andrii, eddyz87, ast, daniel, memxor, martin.lau, song,
	yonghong.song, jolsa, emil, shuah, kartikey406, coreteam, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn>

Add a conntrack kfunc regression check for opts__sz values that do not
cover opts->error. The BPF program initializes opts->error with a guard
value, calls the lookup and allocation kfuncs with opts__sz set to
sizeof(opts->netns_id), and verifies that the guard is still intact
after the kfunc returns NULL.

Without the conntrack wrapper guard, the kfunc error path overwrites
that guard with -EINVAL even though the verifier checked only the first
four bytes of the options object.

Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
---
 .../testing/selftests/bpf/prog_tests/bpf_nf.c |  6 +++++
 .../testing/selftests/bpf/progs/test_bpf_nf.c | 26 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
index b33dba4b126e2..14d4c1793aed5 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
@@ -5,6 +5,8 @@
 #include "test_bpf_nf.skel.h"
 #include "test_bpf_nf_fail.skel.h"
 
+#define CT_OPTS_ERROR_GUARD 0x12345678
+
 static char log_buf[1024 * 1024];
 
 struct {
@@ -119,6 +121,10 @@ static void test_bpf_nf_ct(int mode)
 	ASSERT_EQ(skel->bss->test_einval_reserved_new, -EINVAL, "Test EINVAL for reserved in new struct not set to 0");
 	ASSERT_EQ(skel->bss->test_einval_netns_id, -EINVAL, "Test EINVAL for netns_id < -1");
 	ASSERT_EQ(skel->bss->test_einval_len_opts, -EINVAL, "Test EINVAL for len__opts != NF_BPF_CT_OPTS_SZ");
+	ASSERT_EQ(skel->bss->test_einval_len_opts_small_lookup, CT_OPTS_ERROR_GUARD,
+		  "Test no error write for lookup opts__sz before error field");
+	ASSERT_EQ(skel->bss->test_einval_len_opts_small_alloc, CT_OPTS_ERROR_GUARD,
+		  "Test no error write for alloc opts__sz before error field");
 	ASSERT_EQ(skel->bss->test_eproto_l4proto, -EPROTO, "Test EPROTO for l4proto != TCP or UDP");
 	ASSERT_EQ(skel->bss->test_enonet_netns_id, -ENONET, "Test ENONET for bad but valid netns_id");
 	ASSERT_EQ(skel->bss->test_enoent_lookup, -ENOENT, "Test ENOENT for failed lookup");
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_nf.c b/tools/testing/selftests/bpf/progs/test_bpf_nf.c
index 076fbf03a1268..df43649ecb785 100644
--- a/tools/testing/selftests/bpf/progs/test_bpf_nf.c
+++ b/tools/testing/selftests/bpf/progs/test_bpf_nf.c
@@ -10,6 +10,8 @@
 #define EINVAL 22
 #define ENOENT 2
 
+#define CT_OPTS_ERROR_GUARD 0x12345678
+
 #define NF_CT_ZONE_DIR_ORIG (1 << IP_CT_DIR_ORIGINAL)
 #define NF_CT_ZONE_DIR_REPL (1 << IP_CT_DIR_REPLY)
 
@@ -19,6 +21,8 @@ int test_einval_reserved = 0;
 int test_einval_reserved_new = 0;
 int test_einval_netns_id = 0;
 int test_einval_len_opts = 0;
+int test_einval_len_opts_small_lookup = 0;
+int test_einval_len_opts_small_alloc = 0;
 int test_eproto_l4proto = 0;
 int test_enonet_netns_id = 0;
 int test_enoent_lookup = 0;
@@ -124,6 +128,28 @@ nf_ct_test(struct nf_conn *(*lookup_fn)(void *, struct bpf_sock_tuple *, u32,
 	else
 		test_einval_len_opts = opts_def.error;
 
+	opts_def.error = CT_OPTS_ERROR_GUARD;
+	ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+		       sizeof(opts_def.netns_id));
+	if (ct) {
+		bpf_ct_release(ct);
+		test_einval_len_opts_small_lookup = -EINVAL;
+	} else {
+		test_einval_len_opts_small_lookup = opts_def.error;
+	}
+
+	opts_def.error = CT_OPTS_ERROR_GUARD;
+	ct = alloc_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+		      sizeof(opts_def.netns_id));
+	if (ct) {
+		ct = bpf_ct_insert_entry(ct);
+		if (ct)
+			bpf_ct_release(ct);
+		test_einval_len_opts_small_alloc = -EINVAL;
+	} else {
+		test_einval_len_opts_small_alloc = opts_def.error;
+	}
+
 	opts_def.l4proto = IPPROTO_ICMP;
 	ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
 		       sizeof(opts_def));
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 0/2] bpf: Guard conntrack opts error writes
From: Yiyang Chen @ 2026-06-18 10:18 UTC (permalink / raw)
  To: bpf, netfilter-devel
  Cc: Yiyang Chen, pablo, fw, phil, davem, edumazet, kuba, pabeni,
	horms, andrii, eddyz87, ast, daniel, memxor, martin.lau, song,
	yonghong.song, jolsa, emil, shuah, kartikey406, coreteam, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn>

The conntrack lookup/allocation kfuncs expose an opts/opts__sz pair.
The verifier checks the caller-provided opts__sz range, but the wrappers
currently write opts->error after internal errors even when opts__sz is too
small to include that field.

Patch 1 writes opts->error only when opts__sz includes it, and uses a
single helper to fold ERR_PTR returns into the kfunc ABI result while keeping
the local nfct result variable in each wrapper.
Patch 2 adds a bpf_nf regression check that keeps a guard in opts->error
while passing opts__sz covering only netns_id.

The regression check follows the existing bpf_nf test shape.  Before the
fix, the guard is overwritten with -EINVAL even though opts__sz covers only
the first four bytes of the options object.  After the fix, the kfunc still
returns NULL for the invalid size, but the guard remains intact.

Validation, rebased and tested on bpf-next master e771677c937d
("Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd"):

  git diff --check origin/master..HEAD: OK
  scripts/checkpatch.pl --strict on 1/2 and 2/2: OK
  make O=/root/ebpf-verifier-bug-detection/kernel-build/bpf-next \
    net/netfilter/nf_conntrack_bpf.o: OK
  Focused QEMU direct-runner against XDP and TC lookup/alloc paths:
    unpatched bpf-next e771677c937d: guard overwritten with -EINVAL
    patched v2 007dfd0341cd: guard preserved as 0x12345678
  QEMU upstream bpf_nf selftest with CONFIG_NF_CONNTRACK_MARK,
  CONFIG_NF_CONNTRACK_ZONES, and legacy iptables enabled:
    ./test_progs -t bpf_nf -vv: OK
  git am of exported 1/2 and 2/2 on a fresh worktree at base: OK
  range-diff between branch commits and git-am result: equivalent

Changes in v2:
  - Rebased onto current bpf-next master.
  - Reworked patch 1 to use bpf_ct_opts_result() for the ERR_PTR-to-NULL
    conversion and guarded opts->error write, as suggested by Alexei.
  - Kept the local nfct result variable in each wrapper before returning
    through bpf_ct_opts_result().
  - Added matching Fixes tags to the selftest patch so the regression test
    can be backported with the fix.

v1: https://lore.kernel.org/bpf/cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn/

Yiyang Chen (2):
  bpf: Guard conntrack opts error writes
  selftests/bpf: Cover small conntrack opts error writes

 net/netfilter/nf_conntrack_bpf.c              | 35 +++++++------------
 .../testing/selftests/bpf/prog_tests/bpf_nf.c |  6 ++++
 .../testing/selftests/bpf/progs/test_bpf_nf.c | 26 ++++++++++++++
 3 files changed, 45 insertions(+), 22 deletions(-)

base-commit: e771677c937da5808f7b6c1f0e4a97ec1a84f8a8
-- 
2.34.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox