Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] netlink: clean up failed initial dump-start state
From: Jakub Kicinski @ 2026-04-20 17:37 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: David S . Miller, Eric Dumazet, Paolo Abeni, netdev, Simon Horman,
	Kuniyuki Iwashima, Kees Cook, Feng Yang, linux-kernel
In-Reply-To: <20260420162734.854587-1-michael.bommarito@gmail.com>

On Mon, 20 Apr 2026 12:27:34 -0400 Michael Bommarito wrote:
> When __netlink_dump_start() has already installed cb->skb, taken the
> module reference and set cb_running, a failure from the first
> netlink_dump(sk, true) call returns via errout_skb without unwinding the
> callback lifetime. That leaves cb_running set and defers module_put()
> and consume_skb(cb->skb) until userspace drains the socket or closes it.

On a quick look I can't see which path clears the dump state in case we
keep failing to allocate an skb. Could you add more info on that?

> Share the normal callback teardown in a helper and use it on successful
> completion and on the initial lock_taken=true failure path. Keep the
> lock_taken=false continuation path unchanged, because recvmsg()-driven
> retries legitimately preserve cb_running when they run out of receive
> room.
> 
> Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: Codex:gpt-5-4
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> Validation inside a UML guest on current mainline:
> 
>   - An unprivileged local task (uid=65534, no CAP_NET_ADMIN) opens a
>     plain NETLINK_ROUTE socket, preloads sk_rmem_alloc with echoed
>     NLMSG_ERROR replies from an unsupported rtnetlink type, then issues
>     RTM_GETLINK | NLM_F_DUMP | NLM_F_ACK.
>   - Stock kernel: the initial __netlink_dump_start() hits the rmem gate
>     and returns via errout_skb with cb_running stuck at 1 until
>     recvmsg() or close() drives forward progress.
>   - Patched kernel: the same probe leaves cb_running clear immediately
>     on the lock_taken=true failure, and the larger-rcvbuf continuation
>     path (legitimate dump in progress) is unchanged.
> 
> A scaling pass on 3500 such wedged sockets in a 256M UML guest shows
> about 3.8-3.9 MiB of extra unreclaimable slab (/proc/meminfo
> SUnreclaim) beyond the visible queued rmem on the vulnerable kernel,
> roughly 1.1 KiB/socket. Real accumulation, but the test hits
> RLIMIT_NOFILE long before the guest approaches OOM, so this still
> looks like a local availability cleanup rather than an exhaustion
> primitive.

This should be part of the commit message, it's useful to understanding
the problem. Actually more than the current commit msg TBH.

> No Cc: stable@ on the theory that the bug self-heals on
> recvmsg()/close and the accumulation is mild. Happy to add it and
> route to net if you'd rather see it backported.
> 
>  net/netlink/af_netlink.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4d609d5cf406..7019c17e6879 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2250,6 +2250,20 @@ static int netlink_dump_done(struct netlink_sock *nlk, struct sk_buff *skb,
>  	return 0;
>  }
>  
> +static void netlink_dump_cleanup(struct netlink_sock *nlk)
> +{
> +	struct module *module = nlk->cb.module;
> +	struct sk_buff *skb = nlk->cb.skb;
> +
> +	if (nlk->cb.done)
> +		nlk->cb.done(&nlk->cb);
> +
> +	WRITE_ONCE(nlk->cb_running, false);
> +	mutex_unlock(&nlk->nl_cb_mutex);
> +	module_put(module);
> +	consume_skb(skb);
> +}

It's probably better to create a helper that shares the code with 
the release path as well. And try not to switch the skb freeing 
to consume_skb().

>  static int netlink_dump(struct sock *sk, bool lock_taken)
>  {
>  	struct netlink_sock *nlk = nlk_sk(sk);
> @@ -2258,7 +2272,6 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
>  	struct sk_buff *skb = NULL;
>  	unsigned int rmem, rcvbuf;
>  	size_t max_recvmsg_len;
> -	struct module *module;
>  	int err = -ENOBUFS;
>  	int alloc_min_size;
>  	int alloc_size;
> @@ -2366,19 +2379,14 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
>  	else
>  		__netlink_sendskb(sk, skb);
>  
> -	if (cb->done)
> -		cb->done(cb);
> -
> -	WRITE_ONCE(nlk->cb_running, false);
> -	module = cb->module;
> -	skb = cb->skb;
> -	mutex_unlock(&nlk->nl_cb_mutex);
> -	module_put(module);
> -	consume_skb(skb);
> +	netlink_dump_cleanup(nlk);
>  	return 0;
>  
>  errout_skb:
> -	mutex_unlock(&nlk->nl_cb_mutex);
> +	if (lock_taken)
> +		netlink_dump_cleanup(nlk);
> +	else
> +		mutex_unlock(&nlk->nl_cb_mutex);
>  	kfree_skb(skb);
>  	return err;
>  }

If you're planning to repost - please wait until tomorrow, we ask that
revisions are at least 24h apart so that people across the timezones
have a chance to chime in.

^ permalink raw reply

* Re: [PATCH net-next] net: hns: use u32 for register offset in RCB TX coalescing
From: Jakub Kicinski @ 2026-04-20 17:38 UTC (permalink / raw)
  To: Agalakov Daniil
  Cc: Jian Shen, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, netdev, linux-kernel, lvc-project, Roman Razov
In-Reply-To: <20260420144047.2846673-1-ade@amicon.ru>

On Mon, 20 Apr 2026 17:40:19 +0300 Agalakov Daniil wrote:
> The local variable reg in hns_rcb_get_tx_coalesced_frames() and
> hns_rcb_set_tx_coalesced_frames() holds a register offset passed to
> dsaf_read_dev()/dsaf_write_dev(). Register offsets on this hardware
> are 32-bit values; using u64 was misleading.

net-next is closed during the merge window.

If you repost please improve the "why". As is I don't think this patch
is worth merging.

^ permalink raw reply

* Re: [PATCH-next v2 0/2] ipvs: Fix incorrect use of HK_TYPE_KTHREAD housekeeping cpumask
From: Pablo Neira Ayuso @ 2026-04-20 17:45 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Waiman Long, Simon Horman, David S. Miller, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Florian Westphal,
	Phil Sutter, Frederic Weisbecker, Chen Ridong, Phil Auld,
	linux-kernel, netdev, lvs-devel, netfilter-devel, coreteam,
	sheviks
In-Reply-To: <097db82c-c9d1-4532-694a-b7ecbdd67532@ssi.bg>

On Mon, Apr 20, 2026 at 08:24:56PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Fri, 3 Apr 2026, Pablo Neira Ayuso wrote:
> 
> > On Fri, Apr 03, 2026 at 05:15:50PM +0300, Julian Anastasov wrote:
> > > 
> > > 	Hello,
> > > 
> > > On Tue, 31 Mar 2026, Waiman Long wrote:
> > > 
> > > >  v2:
> > > >   - Rebased on top of linux-next
> > > > 
> > > > Since commit 041ee6f3727a ("kthread: Rely on HK_TYPE_DOMAIN for preferred
> > > > affinity management"), the HK_TYPE_KTHREAD housekeeping cpumask may no
> > > > longer be correct in showing the actual CPU affinity of kthreads that
> > > > have no predefined CPU affinity. As the ipvs networking code is still
> > > > using HK_TYPE_KTHREAD, we need to make HK_TYPE_KTHREAD reflect the
> > > > reality.
> > > > 
> > > > This patch series makes HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
> > > > and uses RCU to protect access to the HK_TYPE_KTHREAD housekeeping
> > > > cpumask.
> > > > 
> > > > Waiman Long (2):
> > > >   sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
> > > >   ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU
> > > 
> > > 	The patchset looks good to me for nf-next, thanks!
> > > 
> > > Acked-by: Julian Anastasov <ja@ssi.bg>
> > > 
> > > 	Pablo, Florian, as a bugfix this patchset missed
> > > the chance to be applied before the changes that are in
> > > nf-next in ip_vs.h, there is little fuzz there. If there
> > > is no chance to resolve it somehow, we can apply it
> > > on top of nf-next where it now applies successfully.
> > 
> > One way to handle this is to follow up with nf-next as you suggest,
> > then send a backport that applies cleanly for -stable once it is
> > released.
> > 
> > Else, let me know if I am misunderstanding.
> 
> 	This patchset is now material for the net tree. To help it,
> I just posted patch "ipvs: fix races around est_mutex and est_cpulist"
> that can be applied before this patchset to the net tree.
> Can we get this patchset for the net tree?

Yes, I am preparing a PR.

BTW, did you get look at the report provided by the AI assistant?

https://sashiko.dev/#/?list=org.kernel.vger.netfilter-devel

If not, please repost to get initial feedback from it.

Thanks.

^ permalink raw reply

* Re: [PATCH 1/9] bitfield: add FIELD_GET_SIGNED()
From: Yury Norov @ 2026-04-20 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Jonathan Cameron, David Lechner,
	Nuno Sá, Andy Shevchenko, Ping-Ke Shih, Richard Cochran,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexandre Belloni, Yury Norov, Rasmus Villemoes,
	Hans de Goede, Linus Walleij, Sakari Ailus, Salah Triki,
	Achim Gratz, Ben Collins, linux-kernel, linux-iio, linux-wireless,
	netdev, linux-rtc
In-Reply-To: <20260420111940.GE3102624@noisy.programming.kicks-ass.net>

On Mon, Apr 20, 2026 at 01:19:40PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 17, 2026 at 01:36:12PM -0400, Yury Norov wrote:
> > The bitfields are designed in assumption that fields contain unsigned
> > integer values, thus extracting the values from the field implies
> > zero-extending.
> > 
> > Some drivers need to sign-extend their fields, and currently do it like:
> > 
> > 	dc_re += sign_extend32(FIELD_GET(0xfff000, tmp), 11);
> > 	dc_im += sign_extend32(FIELD_GET(0xfff, tmp), 11);
> > 
> > It's error-prone because it relies on user to provide the correct
> > index of the most significant bit and proper 32 vs 64 function flavor.
> > 
> > Thus, introduce a FIELD_GET_SIGNED() macro, which is the more
> > convenient and compiles (on x86_64) to just a couple instructions:
> > shl and sar.
> > 
> > Signed-off-by: Yury Norov <ynorov@nvidia.com>
> > ---
> >  include/linux/bitfield.h | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
> > index 54aeeef1f0ec..35ef63972810 100644
> > --- a/include/linux/bitfield.h
> > +++ b/include/linux/bitfield.h
> > @@ -178,6 +178,22 @@
> >  		__FIELD_GET(_mask, _reg, "FIELD_GET: ");		\
> >  	})
> >  
> > +/**
> > + * FIELD_GET_SIGNED() - extract a signed bitfield element
> > + * @mask: shifted mask defining the field's length and position
> > + * @reg:  value of entire bitfield
> > + *
> > + * Returns the sign-extended field specified by @_mask from the
> > + * bitfield passed in as @_reg by masking and shifting it down.
> > + */
> > +#define FIELD_GET_SIGNED(mask, reg)					\
> > +	({								\
> > +		__BF_FIELD_CHECK(mask, reg, 0U, "FIELD_GET_SIGNED: ");	\
> > +		 ((__signed_scalar_typeof(mask))((long long)(reg) <<	\
> > +		 __builtin_clzll(mask) >> (__builtin_clzll(mask) +	\
> > +						__builtin_ctzll(mask))));\
> > +	})
> 
> IIRC clz is count-leading-zeros and ctz is count-trailing-zeros. Most of
> the other FIELD things use __bf_shf() which is defined in terms of ffs -
> 1 (which is another way of writing ctz).
> 
> So how about you start by redefining __bf_shf() in ctz, and then add
> another helper for the clz and write the thing something like:
> 
> 	((long long)(reg) << __bf_clz(mask)) >> (__bf_clz(mask) + __bf_shf(mask));

So...

I like the shorter form, but whatever we add in the bitfield.h - we'll
have to support it.

For example, __bf_shf() wasn't intended to be used outsize of the
header, thus double underscored. But there's over 100 external users
now. And to make it worse, it's broken for GCC 14 and earlier:

https://lore.kernel.org/all/20260409-field-prep-fix-v1-1-f0e9ae64f63c@imgtec.com/

So needs to get fixed.

The bitfield.h has two __bf macros: __bf_shf() and __bf_cast_unsigned().
They are thin wrappers, but after all do something with the corresponding
builtins output. The __bf_cls() would be a pure renaming. I'm OK with
that, but some people don't:

https://lore.kernel.org/all/20260303182845.250bb2de@kernel.org/

That's why I didn't make FIELD_GET_SIGNED() implementation looking nicer.
If you strongly prefer the shorter version, I can do that in v2.
 
> Also, since the order of the shifts is rather important, I think it
> makes sense to add this extra pair of (), even when not strictly needed,
> just to make it easier to read.

Sure, will do.

^ permalink raw reply

* Re: [PATCH net-next] netlink: clean up failed initial dump-start state
From: Michael Bommarito @ 2026-04-20 17:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Eric Dumazet, Paolo Abeni, netdev, Simon Horman,
	Kuniyuki Iwashima, Kees Cook, Feng Yang, linux-kernel
In-Reply-To: <20260420103715.347fbd4a@kernel.org>

On Mon, Apr 20, 2026 at 1:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On a quick look I can't see which path clears the dump state in case we
> keep failing to allocate an skb. Could you add more info on that?
> ...
> This should be part of the commit message, it's useful to understanding
> the problem. Actually more than the current commit msg TBH.
> ...
> If you're planning to repost - please wait until tomorrow, we ask that
> revisions are at least 24h apart so that people across the timezones
> have a chance to chime in.

Thanks, good points.  I'll set a reminder and follow up tomorrow with
your ideas if we don't hear from others.

Thanks,
Mike Bommarito

^ permalink raw reply

* Re: pre-boot plugged SFP autoneg advertisement
From: Andrew Lunn @ 2026-04-20 17:57 UTC (permalink / raw)
  To: markus.stockhausen
  Cc: linux, hkallweit1, netdev, 'Jonas Jelonek', jan, nbd,
	'Daniel Golle'
In-Reply-To: <007701dcd0e1$11c45210$354cf630$@gmx.de>

On Mon, Apr 20, 2026 at 06:16:34PM +0200, markus.stockhausen@gmx.de wrote:
> > Von: markus.stockhausen@gmx.de <markus.stockhausen@gmx.de> 
> > Gesendet: Sonntag, 19. April 2026 10:49
> > An: 'Andrew Lunn' <andrew@lunn.ch>
> > Betreff: AW: pre-boot plugged SFP autoneg advertisement
> >
> > Took that hint/question and digged deeper. Added further debug
> > to each and every linkmode_copy. I think I found the culprit in 
> > a userspace ethtool call. For now I assume OpenWrt netifd.
> 
> Hi Andrew,
> 
> once again thanks for your help. After further investigation I hopefully can
> add 
> more details. I think I got the whole picture now. So some additional
> background 
> information about the environment. 
> 
> - Realtek RTL930x devices with SFP+ module slots
> - These are driven directly by a SerDes (controlled by downstream PCS
> driver)
> - The DTS reads
> 
> 	port11: port@11 { 
> 		reg = <11>;
> 		label = "lan12" ;
> 		pcs-handle = <&serdes8>;
> 		phy-mode = "1000base-x";
> 		sfp = <&sfp1>;
> 		managed = "in-band-status";
> 	};
> 
> Sequence of events during boot is as follows:
> 
> - SFP module is already inserted (in my case 1G)
> - phylink_sfp_config_phy() runs long before any network config starts
> - OpenWrt netifd daemon starts and wants to configure the network interfaces
> - It reads current settings via ethtool ioctl and gets autoneg=off
> - It writes basic config values via ethtool ioctl including autneg=off
> - Later on it starts the interface and phylink_start() is issued

I would say netifd is not optimal. I'm not sure we every agree to
return the full ksetting on an interface which is admin down. Many
driver don't even connect to the PHY until open is called, and so are
likely to return -ENODEV. See phy_ethtool_set_link_ksettings().

Could you look into the behaviour of netifd, especially if it gets
-ENODEV during the first read. Does it try again after setting the
interface up?

Could you disable netifd and manually configure the interface up. Does
it get autoneg correct then?

Now, i think it is useful to be able to configure an interface when it
is admin down. So if ksetting_get does not return -ENODEV it probably
should return the full and correct information. However, im not sure
your change is sufficient to do that, since what an interface can
actually do is the common subset of what the MAC, PCS and SFP can
do. So just taking the value from the SFP does not feel correct to me,
at least not without having a deeper understanding of what phylink is
doing. And Russell King is busy with other things are the moment.

So i think we are looking at multiple problems/solutions here:

netifd should does a second ksettings_get after setting the interface
admin up, and reevaluates how the interface should be configured.

If we know phylink is going to return a subset of the correct
information when the interface is admin down, maybe it should return
-ENODEV?

Is it possible in general to make phylink return the full correct
ksetting when start() has not been called. We need to think about
multiple use cases here, not just an SFP, but also a PHY, a fixed link
and a BASE-T PHY inside an SFP module. Maybe it needs to sometimes
return -ENODEV, other times it can return correct information?

       Andrew

^ permalink raw reply

* Re: [PATCH 7/9] wifi: rtw89: switch to using FIELD_GET_SIGNED()
From: Yury Norov @ 2026-04-20 17:59 UTC (permalink / raw)
  To: Ping-Ke Shih
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H. Peter Anvin, Andy Lutomirski, Peter Zijlstra,
	Jonathan Cameron, David Lechner, Nuno Sá, Andy Shevchenko,
	Richard Cochran, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexandre Belloni, Yury Norov,
	Rasmus Villemoes, Hans de Goede, Linus Walleij, Sakari Ailus,
	Salah Triki, Achim Gratz, Ben Collins,
	linux-kernel@vger.kernel.org, linux-iio@vger.kernel.org,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-rtc@vger.kernel.org
In-Reply-To: <5fea4ea146404b55919037594ab85f1a@realtek.com>

On Mon, Apr 20, 2026 at 07:49:19AM +0000, Ping-Ke Shih wrote:
> Yury Norov <ynorov@nvidia.com> wrote:
> > --- a/drivers/net/wireless/realtek/rtw89/rtw8852b_common.c
> > +++ b/drivers/net/wireless/realtek/rtw89/rtw8852b_common.c
> > @@ -206,9 +206,9 @@ static void rtw8852bx_efuse_parsing_tssi(struct rtw89_dev *rtwdev,
> >  static bool _decode_efuse_gain(u8 data, s8 *high, s8 *low)
> >  {
> >         if (high)
> > -               *high = sign_extend32(FIELD_GET(GENMASK(7,  4), data), 3);
> > +               *high = FIELD_GET_SIGNED(GENMASK(7,  4), data);
> >         if (low)
> > -               *low = sign_extend32(FIELD_GET(GENMASK(3,  0), data), 3);
> > +               *low = FIELD_GET(GENMASK(3,  0), data);
> 
> FIELD_GET_SIGNED()?
> 
> > 
> >         return data != 0xff;
> >  }
 
Ah sorry. Will fix in v2

^ permalink raw reply

* [ANN] netdev call - Apr 21st
From: Jakub Kicinski @ 2026-04-20 18:02 UTC (permalink / raw)
  To: netdev

Hi!

The bi-weekly call is scheduled for tomorrow at 8:30 am (PT) / 
5:30 pm (~EU), at https://bbb.lwn.net/rooms/ldm-chf-zxx-we7/join

I'd like to discuss evolution of the process which would prepare
us for the "AI age" (read: influx of plausibly looking yet entirely
computer generated patches).

^ permalink raw reply

* Re: [PATCH-next v2 0/2] ipvs: Fix incorrect use of HK_TYPE_KTHREAD housekeeping cpumask
From: Julian Anastasov @ 2026-04-20 18:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Waiman Long, Simon Horman, David S. Miller, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Florian Westphal,
	Phil Sutter, Frederic Weisbecker, Chen Ridong, Phil Auld,
	linux-kernel, netdev, lvs-devel, netfilter-devel, coreteam,
	sheviks
In-Reply-To: <aeZmVMaMymU6ZS5S@chamomile>


	Hello,

On Mon, 20 Apr 2026, Pablo Neira Ayuso wrote:

> On Mon, Apr 20, 2026 at 08:24:56PM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Fri, 3 Apr 2026, Pablo Neira Ayuso wrote:
> > 
> > > On Fri, Apr 03, 2026 at 05:15:50PM +0300, Julian Anastasov wrote:
> > > > 
> > > > 	Hello,
> > > > 
> > > > On Tue, 31 Mar 2026, Waiman Long wrote:
> > > > 
> > > > >  v2:
> > > > >   - Rebased on top of linux-next
> > > > > 
> > > > > Since commit 041ee6f3727a ("kthread: Rely on HK_TYPE_DOMAIN for preferred
> > > > > affinity management"), the HK_TYPE_KTHREAD housekeeping cpumask may no
> > > > > longer be correct in showing the actual CPU affinity of kthreads that
> > > > > have no predefined CPU affinity. As the ipvs networking code is still
> > > > > using HK_TYPE_KTHREAD, we need to make HK_TYPE_KTHREAD reflect the
> > > > > reality.
> > > > > 
> > > > > This patch series makes HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
> > > > > and uses RCU to protect access to the HK_TYPE_KTHREAD housekeeping
> > > > > cpumask.
> > > > > 
> > > > > Waiman Long (2):
> > > > >   sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
> > > > >   ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU
> > > > 
> > > > 	The patchset looks good to me for nf-next, thanks!
> > > > 
> > > > Acked-by: Julian Anastasov <ja@ssi.bg>
> > > > 
> > > > 	Pablo, Florian, as a bugfix this patchset missed
> > > > the chance to be applied before the changes that are in
> > > > nf-next in ip_vs.h, there is little fuzz there. If there
> > > > is no chance to resolve it somehow, we can apply it
> > > > on top of nf-next where it now applies successfully.
> > > 
> > > One way to handle this is to follow up with nf-next as you suggest,
> > > then send a backport that applies cleanly for -stable once it is
> > > released.
> > > 
> > > Else, let me know if I am misunderstanding.
> > 
> > 	This patchset is now material for the net tree. To help it,
> > I just posted patch "ipvs: fix races around est_mutex and est_cpulist"
> > that can be applied before this patchset to the net tree.
> > Can we get this patchset for the net tree?
> 
> Yes, I am preparing a PR.
> 
> BTW, did you get look at the report provided by the AI assistant?
> 
> https://sashiko.dev/#/?list=org.kernel.vger.netfilter-devel

	Yes, I monitor it. I'm waiting the review for
my 3+1 patches from today. And I hope the review for
this HK_TYPE_KTHREAD patchset is addressed too with
"ipvs: fix races around est_mutex and est_cpulist".

> If not, please repost to get initial feedback from it.
> 
> Thanks.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply

* Re: [PATCH net-next v9 10/10] test: Add networking selftest for eh limits
From: Tom Herbert @ 2026-04-20 18:23 UTC (permalink / raw)
  To: Simon Horman
  Cc: davem, kuba, netdev, justin.iurman, willemdebruijn.kernel, pabeni
In-Reply-To: <20260317153222.GD1710951@horms.kernel.org>

On Tue, Mar 17, 2026 at 8:32 AM Simon Horman <horms@kernel.org> wrote:
>
> On Sat, Mar 14, 2026 at 10:51:24AM -0700, Tom Herbert wrote:
> > Add a networking selftest for Extension Header limits. The
> > limits to test are in systcls:
> >
> >       net.ipv6.enforce_ext_hdr_order
> >       net.ipv6.max_dst_opts_number
> >       net.ipv6.max_hbh_opts_number
> >       net.ipv6.max_hbh_length
> >       net.ipv6.max_dst_opts_length
> >
> > The basic idea of the test is to fabricate ICMPv6 Echo Request
> > packets with various combinations of Extension Headers. The packets
> > are sent to a host in another namespace. If a an ICMPv6 Echo Reply
> > is received then the packet wasn't dropped due to a limit being
> > exceeded, and if it was dropped then we assume that a limit was
> > exceeded. For each test packet we derive an expectation as to
> > whether the packet will be dropped or not. Test success depends
> > on whether our expectation is matched. i.e. if we expect a reply
> > then the test succeeds if we see a reply, and if we don't expect a
> > reply then the test succeeds if we don't see a reply.
> >
> > The test is divided into a frontend bash script (eh_limits.sh) and a
> > backend Python script (eh_limits.py).
> >
> > The frontend sets up two network namespaces with IPv6 addresses
> > configured on veth's. We then invoke the backend to send the
> > test packets. This first pass is done with default sysctl settings.
> > On a second pass we change the various sysctl settings and run
> > again.
> >
> > The backend runs through the various test cases described in the
> > Make_Test_Packets function. This function calls Make_Packet for
> > a test case where arguments provide the Extension Header chain to
> > be tested. The Run_Test function loops through the various packets
> > and tests if a reply is received versus the expectation. If a test
> > case fails then an error status is returned by the backend.
> >
> > The backend script can also be run with the "-w <pcap_file>" to
> > write the created packets to a pcap file instead of running the
> > test.
> >
> > Signed-off-by: Tom Herbert <tom@herbertland.com>
> > ---
> >  tools/testing/selftests/net/Makefile     |   1 +
> >  tools/testing/selftests/net/eh_limits.py | 349 +++++++++++++++++++++++
> >  tools/testing/selftests/net/eh_limits.sh | 205 +++++++++++++
> >  3 files changed, 555 insertions(+)
> >  create mode 100755 tools/testing/selftests/net/eh_limits.py
> >  create mode 100755 tools/testing/selftests/net/eh_limits.sh
>
> Hi Tom,
>
> Shellcheck flags several instances of the following:
>
> - https://www.shellcheck.net/wiki/SC2154 -- ns1 is referenced but not assigned.
> - https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
>
> In the case of SC2086 I think this can be trivially addressed by adding
> double quotes.
>
> While I think SC2154 should probably be ignored using
>
> # shellcheck disable=SC2154
>
> We're trying to make new scripts shellcheck clean, so I'd appreciate it if
> you could look into this.

Okay, I'll fix the double quotes.

>
>
> Also, pylint has also something to say about this patch.

I don't believe those warnings aren't critical, and the score is a
high as other python files in the directory.

Tom



>
> ...

^ permalink raw reply

* Re: [net-next,v9,09/10] test: Add ext_hdr.py in networking selftests
From: Tom Herbert @ 2026-04-20 18:25 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, justin.iurman, willemdebruijn.kernel, pabeni, willemb,
	davem, kuba, justin.iurman
In-Reply-To: <20260317152456.1742468-3-horms@kernel.org>

On Tue, Mar 17, 2026 at 8:26 AM Simon Horman <horms@kernel.org> wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
>
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
>
> Simon says: pylint also has quite a lot to say about this patch. I'm
> assuming it is all minor stuff. But could you look into addressing the
> issues it flags?

Yes, it's all minor and score is good.

>
> ---
> test: Add ext_hdr.py in networking selftests
>
> This commit adds a new Python helper file that contains Extension Header
> format definitions and related functions for building IPv6 extension
> header chains in networking selftests.
>
> > diff --git a/tools/testing/selftests/net/ext_hdr.py b/tools/testing/selftests/net/ext_hdr.py
> > new file mode 100755
> > index 000000000000..bfb7da4a7c88
> > --- /dev/null
> > +++ b/tools/testing/selftests/net/ext_hdr.py
> > @@ -0,0 +1,385 @@
> > +#!/usr/bin/env python3
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +# Helper functions for creating extension headers using scapy
> > +
> > +import ctypes
> > +import shlex
> > +import socket
> > +import sys
> > +import subprocess
> > +import scapy
> > +import proto_nums
> > +
> > +
> > +# Read a sysctl
> > +def sysctl_read(name):
> > +    try:
> > +        # shlex.split helps handle arguments correctly
> > +        command = shlex.split(f"sysctl -n {name}")
> > +        # Use check=True to raise an exception if the command fails
> > +        result = subprocess.run(command, check=True,
> > +                    capture_output=True, text=True)
> > +        value = result.stdout.strip()
> > +    except subprocess.CalledProcessError as ex:
> > +        print(f"Error reading sysctl: {ex.stderr}")
> > +    except FileNotFoundError:
> > +        print("The 'sysctl' command was not found. "
> > +              "Check your system's PATH.")
> > +
> > +    return int(value)
>
> What happens when an exception is caught here? When either
> subprocess.CalledProcessError or FileNotFoundError occurs, the exception
> handlers print an error message but don't set 'value'. The function then
> reaches the 'return int(value)' statement, which will raise
> UnboundLocalError.
>
> This will crash the test in several places:
>
> tools/testing/selftests/net/ext_hdr.py:check_eh_order() uses the return:
>     do_check = sysctl_read("net.ipv6.enforce_ext_hdr_order")
>
> tools/testing/selftests/net/ext_hdr.py:make_eh_chain() uses the return:
>     pair = make_full_hbh_dst_options_list(nnext_hdr, args,
>         sysctl_read("net.ipv6.max_hbh_opts_number"),
>         sysctl_read("net.ipv6.max_hbh_length"))
>
> and:
>     pair = make_full_hbh_dst_options_list(nnext_hdr, args,
>         sysctl_read("net.ipv6.max_dst_opts_number"),
>         sysctl_read("net.ipv6.max_dst_opts_length"))
>
> Should the exception handlers return a default value or re-raise to let
> callers handle it?
>
> > +
> > +# Common definitions for Destination and Hop-by-Hop options
>
> [ ... ]
>
> Should ext_hdr.py be added to TEST_FILES in the Makefile? The selftests
> build system documentation requires that any file imported by test scripts
> must be added to TEST_FILES. Without this, tests will work in the source
> tree but fail after 'make install' with ModuleNotFoundError.

Yes, I will add missing files

>
> The file is imported at line 12 ('import proto_nums'), and appears to be
> a library meant for import rather than direct execution. Both ext_hdr.py
> and proto_nums.py need to be added to TEST_FILES in
> tools/testing/selftests/net/Makefile.

Yes, will do

^ permalink raw reply

* Re: [PATCH net v3 1/1] net: l3mdev: Reject non-L3 uppers in slave helpers
From: Ido Schimmel @ 2026-04-20 18:26 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, idosch, dsahern, davem, edumazet, kuba, pabeni, horms,
	jiri, yifanwucs, tomapufckgml, yuantan098, bird, royenheart
In-Reply-To: <20260420113208.GA972415@shredder>

On Mon, Apr 20, 2026 at 02:32:08PM +0300, Ido Schimmel wrote:
> On Sun, Apr 19, 2026 at 10:53:32PM +0800, Ren Wei wrote:
> > From: Haoze Xie <royenheart@gmail.com>
> > 
> > Several l3mdev slave-side helpers resolve an upper device and then use
> > l3mdev_ops without first proving that the resolved device is still a
> > valid L3 master.
> > 
> > During slave transition, an RCU reader can transiently observe an upper
> > that is not an L3 master. Guard the affected slave-resolved paths by
> > requiring the resolved upper to still be an L3 master before using
> > l3mdev_ops, while keeping existing L3 RX handler providers intact.
> > 
> > Fixes: fdeea7be88b1 ("net: vrf: Set slave's private flag before linking")
> > Cc: stable@kernel.org
> > Reported-by: Yifan Wu <yifanwucs@gmail.com>
> > Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> > Co-developed-by: Yuan Tan <yuantan098@gmail.com>
> > Signed-off-by: Yuan Tan <yuantan098@gmail.com>
> > Suggested-by: Xin Liu <bird@lzu.edu.cn>
> > Tested-by: Haoze Xie <royenheart@gmail.com>
> > Signed-off-by: Haoze Xie <royenheart@gmail.com>
> > Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn>
> 
> I think it's fine for net:
> 
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

Thought about this again. I would like to check another approach
(synchronize_net() after clearing IFF_L3MDEV_SLAVE). Will update
tomorrow.

^ permalink raw reply

* Re: [PATCH net] tcp: make probe0 timer handle expired user timeout
From: Jakub Kicinski @ 2026-04-20 18:33 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell
  Cc: Altan Hacigumus, Kuniyuki Iwashima, David S . Miller, David Ahern,
	Paolo Abeni, Simon Horman, netdev, Enke Chen
In-Reply-To: <20260414013634.43997-1-ahacigu.linux@gmail.com>

On Mon, 13 Apr 2026 18:36:34 -0700 Altan Hacigumus wrote:
> tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
> using subtraction with an unsigned lvalue.  If elapsed probing time
> already exceeds the configured TCP_USER_TIMEOUT, the subtraction
> underflows and yields a large value.
> 
> Handle this expiration case similarly to tcp_clamp_rto_to_user_timeout().
> 
> Fixes: 344db93ae3ee ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
> Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>

Hi Eric, Neal, does this makes sense?

> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 5a14a53a3c9e..4a43356a4e06 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -50,7 +50,8 @@ static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
>  u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
>  {
>  	const struct inet_connection_sock *icsk = inet_csk(sk);
> -	u32 remaining, user_timeout;
> +	u32 user_timeout;
> +	s32 remaining;
>  	s32 elapsed;
>  
>  	user_timeout = READ_ONCE(icsk->icsk_user_timeout);
> @@ -61,6 +62,8 @@ u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
>  	if (unlikely(elapsed < 0))
>  		elapsed = 0;
>  	remaining = msecs_to_jiffies(user_timeout) - elapsed;
> +	if (remaining <= 0)
> +		return 1;
>  	remaining = max_t(u32, remaining, TCP_TIMEOUT_MIN);
>  
>  	return min_t(u32, remaining, when);


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting
From: Bobby Eshleman @ 2026-04-20 18:34 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, Eric Dumazet, Simon Horman, kvm, Arseniy Krasnov,
	David S. Miller, Paolo Abeni, Jakub Kicinski, Michael S. Tsirkin,
	Jason Wang, virtualization, linux-kernel, Eugenio Pérez,
	Xuan Zhuo, Stefan Hajnoczi, Yiming Qian
In-Reply-To: <20260420132051.217589-1-sgarzare@redhat.com>

On Mon, Apr 20, 2026 at 03:20:51PM +0200, Stefano Garzarella wrote:
> From: Stefano Garzarella <sgarzare@redhat.com>
> 
> virtio_transport_init_zcopy_skb() uses iter->count as the size argument
> for msg_zerocopy_realloc(), which in turn passes it to
> mm_account_pinned_pages() for RLIMIT_MEMLOCK accounting. However, this
> function is called after virtio_transport_fill_skb() has already consumed
> the iterator via __zerocopy_sg_from_iter(), so on the last skb, iter->count
> will be 0, skipping the RLIMIT_MEMLOCK enforcement.
> 
> Pass pkt_len (the total bytes being sent) as an explicit parameter to
> virtio_transport_init_zcopy_skb() instead of reading the already-consumed
> iter->count.
> 
> This matches TCP and UDP, which both call msg_zerocopy_realloc() with
> the original message size.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Reported-by: Yiming Qian <yimingqian591@gmail.com>
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
>  net/vmw_vsock/virtio_transport_common.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 0742091beae7..416d533f493d 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -73,6 +73,7 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
>  static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>  					   struct sk_buff *skb,
>  					   struct msghdr *msg,
> +					   size_t pkt_len,
>  					   bool zerocopy)
>  {
>  	struct ubuf_info *uarg;
> @@ -81,12 +82,10 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>  		uarg = msg->msg_ubuf;
>  		net_zcopy_get(uarg);
>  	} else {
> -		struct iov_iter *iter = &msg->msg_iter;
>  		struct ubuf_info_msgzc *uarg_zc;
>  
>  		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> -					    iter->count,
> -					    NULL, false);
> +					    pkt_len, NULL, false);
>  		if (!uarg)
>  			return -1;
>  
> @@ -398,11 +397,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  		 * each iteration. If this is last skb for this buffer
>  		 * and MSG_ZEROCOPY mode is in use - we must allocate
>  		 * completion for the current syscall.
> +		 *
> +		 * Pass pkt_len because msg iter is already consumed
> +		 * by virtio_transport_fill_skb(), so iter->count
> +		 * can not be used for RLIMIT_MEMLOCK pinned-pages
> +		 * accounting done by msg_zerocopy_realloc().
>  		 */
>  		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>  		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>  			if (virtio_transport_init_zcopy_skb(vsk, skb,
>  							    info->msg,
> +							    pkt_len,
>  							    can_zcopy)) {
>  				kfree_skb(skb);
>  				ret = -ENOMEM;
> -- 
> 2.53.0
> 

Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths
From: Martin KaFai Lau @ 2026-04-20 18:36 UTC (permalink / raw)
  To: Weiming Shi
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Martin KaFai Lau, Alexei Starovoitov, Amery Hung,
	Leon Hwang, Kees Cook, Fushuai Wang, Menglong Dong, netdev, bpf,
	Xiang Mei
In-Reply-To: <20260420161432.3919396-2-bestswngs@gmail.com>

On Mon, Apr 20, 2026 at 09:14:33AM -0700, Weiming Shi wrote:
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index f8338acebf077..3b487280f50fa 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -172,7 +172,7 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
>  		struct bpf_map *map;
>  
>  		smap = rcu_dereference(SDATA(selem)->smap);
> -		if (!(smap->map.map_flags & BPF_F_CLONE))
> +		if (!smap || !(smap->map.map_flags & BPF_F_CLONE))
>  			continue;
>  
>  		/* Note that for lockless listeners adding new element
> @@ -547,6 +547,8 @@ static int diag_get(struct bpf_local_storage_data *sdata, struct sk_buff *skb)
>  		return -EMSGSIZE;
>  
>  	smap = rcu_dereference(sdata->smap);
> +	if (!smap)
> +		goto errout;

You need to study it more thoroughly and the code around it
instead of rushing to fix a problem discovered by AI/bot (?).

This is now treated as an -EMSGSIZE error by diag_get().

>  	if (nla_put_u32(skb, SK_DIAG_BPF_STORAGE_MAP_ID, smap->map.id))
>  		goto errout;
>  
> @@ -599,6 +601,8 @@ static int bpf_sk_storage_diag_put_all(struct sock *sk, struct sk_buff *skb,
>  	saved_len = skb->len;
>  	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
>  		smap = rcu_dereference(SDATA(selem)->smap);
> +		if (!smap)
> +			continue;
>  		diag_size += nla_value_size(smap->map.value_size);
>  
>  		if (nla_stgs && diag_get(SDATA(selem), skb))

... and here it will eventually return an -EMSGSIZE to the user space
which is incorrect. Pass the smap to diag_get instead.

pw-bot: cr

^ permalink raw reply

* Re: [PATCH] llc: Return -EINPROGRESS from llc_ui_connect()
From: Jakub Kicinski @ 2026-04-20 18:41 UTC (permalink / raw)
  To: Ernestas Kulik; +Cc: netdev, linux-kernel
In-Reply-To: <20260415063457.1008868-1-ernestas.k@iconn-networks.com>

On Wed, 15 Apr 2026 09:34:57 +0300 Ernestas Kulik wrote:
> Given a zero sk_sndtimeo, llc_ui_connect() skips waiting for state
> change and returns 0, confusing userspace applications that will assume
> the socket is connected, making e.g. getpeername() calls error out.
> 
> Set rc to -EINPROGRESS before considering blocking, akin to AF_INET
> sockets.

Please add a note on how you discovered this issue.
Including whether you're actively using this code or just scanning it
for bugs.

> diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
> index 59d593bb5d18..9317d092ba84 100644
> --- a/net/llc/af_llc.c
> +++ b/net/llc/af_llc.c
> @@ -515,10 +515,12 @@ static int llc_ui_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
>  		sock->state  = SS_UNCONNECTED;
>  		sk->sk_state = TCP_CLOSE;
>  		goto out;
>  	}
>  
> +	rc = -EINPROGRESS;

Isn't this a bit of an odd placement? ..

>  	if (sk->sk_state == TCP_SYN_SENT) {
>  		const long timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
>  
>  		if (!timeo || !llc_ui_wait_for_conn(sk, timeo))
>  			goto out;

.. I suspect you mean to target this branch, right?

^ permalink raw reply

* Re: [PATCH net] tcp: make probe0 timer handle expired user timeout
From: Eric Dumazet @ 2026-04-20 18:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Neal Cardwell, Altan Hacigumus, Kuniyuki Iwashima,
	David S . Miller, David Ahern, Paolo Abeni, Simon Horman, netdev,
	Enke Chen
In-Reply-To: <20260420113346.34735a1e@kernel.org>

On Mon, Apr 20, 2026 at 11:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 13 Apr 2026 18:36:34 -0700 Altan Hacigumus wrote:
> > tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
> > using subtraction with an unsigned lvalue.  If elapsed probing time
> > already exceeds the configured TCP_USER_TIMEOUT, the subtraction
> > underflows and yields a large value.
> >
> > Handle this expiration case similarly to tcp_clamp_rto_to_user_timeout().
> >
> > Fixes: 344db93ae3ee ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
> > Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
>
> Hi Eric, Neal, does this makes sense?
>

I missed this patch. I will take a look asap.
Thanks.

^ permalink raw reply

* Re: [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Justin Iurman @ 2026-04-20 18:55 UTC (permalink / raw)
  To: Ido Schimmel, daniel
  Cc: kuba, edumazet, dsahern, tom, willemdebruijn.kernel, pabeni,
	netdev
In-Reply-To: <20260419143137.GA885197@shredder>

On 4/19/26 16:31, Ido Schimmel wrote:
> On Sun, Apr 19, 2026 at 12:37:35AM +0200, Justin Iurman wrote:
>> Nope. But if it happens, users would be confused as max_dst_opts_cnt would
>> not have the same meaning in two different code paths. OTOH, I agree that
>> such situation would look suspicious. I guess it's fine to keep your patch
>> as is and to not over-complicate things unnecessarily.
> 
> I agree that it's weird to reuse max_dst_opts_cnt here:
> 
> 1. The meaning is different from the Rx path.
> 
> 2. We only enforce max_dst_opts_cnt, but not max_dst_opts_len.
> 
> 3. The default is derived from the initial netns, unlike in the Rx path.
> 
> Given the above and that:
> 
> 1. We believe that 8 options until the tunnel encapsulation limit option
> is liberal enough.
> 
> 2. We don't want to over-complicate things.
> 
> Can we go with an hard coded 8 and see if anyone complains? In the
> unlikely case that someone complains we can at least gain some insight
> into how this option is actually used with tunnels.

In general, I'm not a big fan of hard-coded values, but I also think 
that in this context it would make sense to do so. This is not a strong 
+1, let's say it's more a "not against it".

^ permalink raw reply

* Re: [PATCH net] netconsole: avoid out-of-bounds access on empty string in trim_newline()
From: Gustavo Luiz Duarte @ 2026-04-20 18:59 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Matthew Wood, netdev, linux-kernel, kernel-team,
	stable
In-Reply-To: <20260420-netcons_trim_newline-v1-1-dc35889aeedf@debian.org>

On Mon, Apr 20, 2026 at 11:34 AM Breno Leitao <leitao@debian.org> wrote:
>
> trim_newline() unconditionally dereferences s[len - 1] after computing
> len = strnlen(s, maxlen). When the string is empty, len is 0 and the
> expression underflows to s[(size_t)-1], reading (and potentially
> writing) one byte before the buffer.
>
> The two callers feed trim_newline() with the result of strscpy() from
> configfs store callbacks (dev_name_store, userdatum_value_store).
> configfs guarantees count >= 1 reaches the callback, but the byte
> itself can be NUL: a userspace write(fd, "\0", 1) leaves the
> destination empty after strscpy() and triggers the underflow. The OOB
> write only fires if the adjacent byte happens to be '\n', so this is
> not a security issue, but the access is undefined behaviour either way.
>
> This pattern is commonly flagged by LLM-based code reviewers. While it
> is not a security fix, the underlying access is undefined behaviour and
> the change is small and self-contained, so it is a reasonable candidate
> for the stable trees.
>
> Guard the dereference on a non-zero length.
>
> Fixes: ae001dc67907 ("net: netconsole: move newline trimming to function")
> Cc: stable@vger.kernel.org
> Signed-off-by: Breno Leitao <leitao@debian.org>

Reviewed-by: Gustavo Luiz Duarte <gustavold@gmail.com>

> ---
>  drivers/net/netconsole.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
> index 3c9acd6e49e86..205384dab89a6 100644
> --- a/drivers/net/netconsole.c
> +++ b/drivers/net/netconsole.c
> @@ -497,6 +497,8 @@ static void trim_newline(char *s, size_t maxlen)
>         size_t len;
>
>         len = strnlen(s, maxlen);
> +       if (!len)
> +               return;
>         if (s[len - 1] == '\n')
>                 s[len - 1] = '\0';
>  }
>
> ---
> base-commit: c7275b05bc428c7373d97aa2da02d3a7fa6b9f66
> change-id: 20260420-netcons_trim_newline-36f6ec3b9820
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>

^ permalink raw reply

* Re: [PATCH] gtp: disable BH before calling udp_tunnel_xmit_skb()
From: Justin Iurman @ 2026-04-20 19:02 UTC (permalink / raw)
  To: David Carlier, Pablo Neira Ayuso, Harald Welte, Andrew Lunn,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Weiming Shi, osmocom-net-gprs, netdev, linux-kernel, stable
In-Reply-To: <20260417055408.4667-1-devnexen@gmail.com>

On 4/17/26 07:54, David Carlier wrote:
> gtp_genl_send_echo_req() runs as a generic netlink doit handler in
> process context with BH not disabled. It calls udp_tunnel_xmit_skb(),
> which eventually invokes iptunnel_xmit() — that uses __this_cpu_inc/dec
> on softnet_data.xmit.recursion to track the tunnel xmit recursion level.
> 
> Without local_bh_disable(), the task may migrate between
> dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the
> per-CPU counter pairing. The result is stale or negative recursion
> levels that can later produce false-positive
> SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU.
> 
> The other udp_tunnel_xmit_skb() call sites in gtp.c are unaffected:
> the data path runs under ndo_start_xmit and the echo response handlers
> run from the UDP encap rx softirq, both with BH already disabled.
> 
> Fix it by disabling BH around the udp_tunnel_xmit_skb() call, mirroring
> commit 2cd7e6971fc2 ("sctp: disable BH before calling
> udp_tunnel_xmit_skb()").

Why not fix iptunnel_xmit() directly, rather than fixing all possible 
callers? Basically, jut like we did for lwtunnel_{output|xmit}(). The 
advantage would be that we no longer have to worry about BHs in the 
callers, and BHs would only be disabled when necessary.

^ permalink raw reply

* AW: pre-boot plugged SFP autoneg advertisement
From: markus.stockhausen @ 2026-04-20 19:10 UTC (permalink / raw)
  To: 'Andrew Lunn'
  Cc: linux, hkallweit1, netdev, 'Jonas Jelonek', jan, nbd,
	'Daniel Golle'
In-Reply-To: <664e6e24-4a94-43fa-8769-773a37a01c66@lunn.ch>

> Von: Andrew Lunn <andrew@lunn.ch> 
> Gesendet: Montag, 20. April 2026 19:58
> An: markus.stockhausen@gmx.de
> 
> > Sequence of events during boot is as follows:
> > 
> > - SFP module is already inserted (in my case 1G)
> > - phylink_sfp_config_phy() runs long before any network config starts
> > - OpenWrt netifd daemon starts and wants to configure the network
interfaces
> > - It reads current settings via ethtool ioctl and gets autoneg=off
> > - It writes basic config values via ethtool ioctl including autneg=off
> > - Later on it starts the interface and phylink_start() is issued
>
> I would say netifd is not optimal. I'm not sure we every agree to
> return the full ksetting on an interface which is admin down. Many
> driver don't even connect to the PHY until open is called, and so are
> likely to return -ENODEV. See phy_ethtool_set_link_ksettings().
>
> Could you look into the behaviour of netifd, especially if it gets
> -ENODEV during the first read. Does it try again after setting the
> interface up?

Netifd has no issues with linksettings reading/writing in admin state.
Getting a rc=0 it assumes that all values are filled, changes the needed
attributes and writes them back. I retested and think there might be a 
solution to avoid unneeded Ioctl access (see [1])

> If we know phylink is going to return a subset of the correct
> information when the interface is admin down, maybe it should return
> -ENODEV?

Or (stupid idea) phylink_ethtool_ksettings_set() should not accept all 
settings in this state. 

Thanks for your valuable input.

Markus

[1] https://github.com/openwrt/netifd/issues/76#issuecomment-4283478081


^ permalink raw reply

* Re: [PATCH net 1/1] net/rose: hold listener socket during call request handling
From: Yuan Tan @ 2026-04-20 19:11 UTC (permalink / raw)
  To: Simon Horman, Ren Wei
  Cc: yuantan098, linux-hams, netdev, davem, edumazet, kuba, pabeni,
	kees, takamitz, kuniyu, jiayuan.chen, mingo, stanksal, jlayton,
	yifanwucs, tomapufckgml, bird, tonanli66
In-Reply-To: <20260420162605.GV280379@horms.kernel.org>


On 4/20/2026 9:26 AM, Simon Horman wrote:
> On Fri, Apr 17, 2026 at 07:01:51PM +0800, Ren Wei wrote:
>> From: Nan Li <tonanli66@gmail.com>
>>
>> The call request receive path keeps using the listener socket after the
>> lookup lock has been dropped. Keep the listener alive across the
>> remaining validation and child socket setup by taking a reference in the
>> lookup path and releasing it once request handling is finished.
>>
>> This makes listener lifetime handling explicit and avoids races with
>> concurrent socket teardown.
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> Cc: stable@kernel.org
>> Reported-by: Yifan Wu <yifanwucs@gmail.com>
>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>> Reported-by: Xin Liu <bird@lzu.edu.cn>
>> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Nan Li <tonanli66@gmail.com>
>> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
>> ---
>>  net/rose/af_rose.c | 24 +++++++++++++++++++-----
>>  1 file changed, 19 insertions(+), 5 deletions(-)
> Reviewed-by: Simon Horman <horms@kernel.org>
>
> Sachiko has provided some feedback on this patch.
> I do not believe they relate to shortcomings in this patch,
> and I do not believe they should block progress of this patch.
> You may want to look over them for areas to investigate as follow-up
> (maybe you already did :)
>
> ...


Thanks for your review! Yes this module still has other issues that
haven't been fixed. We'll finish what we're currently working on and
then take a look :)


^ permalink raw reply

* Re: [PATCH 1/2] Bluetooth: ISO: Fix data-race on dst in iso_sock_connect()
From: Luiz Augusto von Dentz @ 2026-04-20 19:23 UTC (permalink / raw)
  To: SeungJu Cheon
  Cc: marcel, linux-bluetooth, netdev, linux-kernel, me, skhan,
	linux-kernel-mentees
In-Reply-To: <20260418053401.128483-2-suunj1331@gmail.com>

Hi SeungJu,

On Sat, Apr 18, 2026 at 1:34 AM SeungJu Cheon <suunj1331@gmail.com> wrote:
>
> iso_sock_connect() copies the destination address into
> iso_pi(sk)->dst under lock_sock, then releases the lock and reads
> it back with bacmp() to decide between the CIS and BIS connect
> paths:
>
>     lock_sock(sk);
>     bacpy(&iso_pi(sk)->dst, &sa->iso_bdaddr);
>     iso_pi(sk)->dst_type = sa->iso_bdaddr_type;
>     release_sock(sk);
>
>     if (bacmp(&iso_pi(sk)->dst, BDADDR_ANY))  // <- no lock held
>
> This read after release_sock() races with any concurrent write to
> iso_pi(sk)->dst on the same socket.
>
> Fix by performing the bacmp() inside the lock_sock critical section
> and caching the result in a local variable.
>
> This patch addresses only the bacmp() race in iso_sock_connect();
> other unprotected iso_pi(sk) accesses are fixed separately in the
> next patch.
>
> KCSAN report:
>
> BUG: KCSAN: data-race in memcmp+0x39/0xb0
>
> race at unknown origin, with read to 0xffff8f96ea66dde3 of 1 bytes by task 549 on cpu 1:
>  memcmp+0x39/0xb0
>  iso_sock_connect+0x275/0xb40
>  __sys_connect_file+0xbd/0xe0
>  __sys_connect+0xe0/0x110
>  __x64_sys_connect+0x40/0x50
>  x64_sys_call+0xcad/0x1c60
>  do_syscall_64+0x133/0x590
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> value changed: 0x00 -> 0xee
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 1 UID: 0 PID: 549 Comm: iso_race_combin Not tainted 7.0.0-08391-g1d51b370a0f8 #40 PREEMPT(lazy)
>
> Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type")
> Signed-off-by: SeungJu Cheon <suunj1331@gmail.com>
> ---
>  net/bluetooth/iso.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/bluetooth/iso.c b/net/bluetooth/iso.c
> index be145e2736b7..14963ba68597 100644
> --- a/net/bluetooth/iso.c
> +++ b/net/bluetooth/iso.c
> @@ -1169,6 +1169,7 @@ static int iso_sock_connect(struct socket *sock, struct sockaddr_unsized *addr,
>         struct sockaddr_iso *sa = (struct sockaddr_iso *)addr;
>         struct sock *sk = sock->sk;
>         int err;
> +       bool bcast;
>
>         BT_DBG("sk %p", sk);
>
> @@ -1191,9 +1192,11 @@ static int iso_sock_connect(struct socket *sock, struct sockaddr_unsized *addr,
>         bacpy(&iso_pi(sk)->dst, &sa->iso_bdaddr);
>         iso_pi(sk)->dst_type = sa->iso_bdaddr_type;
>
> +       bcast = !bacmp(&iso_pi(sk)->dst, BDADDR_ANY);
> +
>         release_sock(sk);
>
> -       if (bacmp(&iso_pi(sk)->dst, BDADDR_ANY))
> +       if (!bcast)
>                 err = iso_connect_cis(sk);
>         else
>                 err = iso_connect_bis(sk);
> --
> 2.52.0
>

https://sashiko.dev/#/patchset/20260418053401.128483-1-suunj1331%40gmail.com

Seems valid, so we migth just use sa in the place of iso_pi(sk) to
avoid using it without sk being locked. Other problems it may reveal
need to be addressed in separate patches.

-- 
Luiz Augusto von Dentz

^ permalink raw reply

* [PATCH net] ipv4: clamp MCAST_MSFILTER getsockopt to optlen, not gf_numsrc
From: Greg Kroah-Hartman @ 2026-04-20 19:26 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Greg Kroah-Hartman, David S. Miller, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, stable

ip_get_mcast_msfilter() and its compat sibling read gf_numsrc from
the user's buffer header and pass it to ip_mc_gsfget(), which writes:

    min(actual_sources, gf_numsrc) * sizeof(struct sockaddr_storage)

bytes back into the user's optval starting at the gf_slist_flex offset.
The only optlen check is len >= size0 (the header), so a user can pass
optlen = 144 (header only) with gf_numsrc = 4.  If the socket has at
least 4 sources joined, the kernel writes 4*128 = 512 bytes via
copy_to_sockptr_offset() past the end of the user buffer.

This is a kernel-driven userspace heap overflow: the user told the
kernel their buffer size via optlen, the kernel ignored it and used a
field inside the buffer instead.  On a real system the writes go into
adjacent userspace heap and copy_to_user does not fault on mapped heap
pages.

Clamp gf_numsrc to (len - size0) / sizeof(sockaddr_storage) before the
call so the kernel never writes past what the user provided.  The
setsockopt path already has the equivalent check
(GROUP_FILTER_SIZE(gf_numsrc) > optlen at line 790).

Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Reported-by: Anthropic
Assisted-by: gkh_clanker_t1000
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/ipv4/ip_sockglue.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index a55ef327ec93..c9bf5d223f21 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1456,6 +1456,11 @@ static int ip_get_mcast_msfilter(struct sock *sk, sockptr_t optval,
 		return -EFAULT;
 
 	num = gsf.gf_numsrc;
+
+	if (num > (len - size0) / sizeof(struct sockaddr_storage))
+		num = (len - size0) / sizeof(struct sockaddr_storage);
+	gsf.gf_numsrc = num;
+
 	err = ip_mc_gsfget(sk, &gsf, optval,
 			   offsetof(struct group_filter, gf_slist_flex));
 	if (err)
@@ -1486,8 +1491,12 @@ static int compat_ip_get_mcast_msfilter(struct sock *sk, sockptr_t optval,
 	gf.gf_interface = gf32.gf_interface;
 	gf.gf_fmode = gf32.gf_fmode;
 	num = gf.gf_numsrc = gf32.gf_numsrc;
-	gf.gf_group = gf32.gf_group;
 
+	if (num > (len - size0) / sizeof(struct sockaddr_storage))
+		num = (len - size0) / sizeof(struct sockaddr_storage);
+	gf.gf_numsrc = num;
+
+	gf.gf_group = gf32.gf_group;
 	err = ip_mc_gsfget(sk, &gf, optval,
 			   offsetof(struct compat_group_filter, gf_slist_flex));
 	if (err)
-- 
2.53.0


^ permalink raw reply related

* [PATCH net] ipv6: rpl: expand skb head when recompressed SRH grows, not only on last segment
From: Greg Kroah-Hartman @ 2026-04-20 19:32 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Greg Kroah-Hartman, David S. Miller, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, stable

ipv6_rpl_srh_rcv() processes a Routing Protocol for LLNs Source Routing
Header by decompressing it, swapping the next segment address into
ipv6_hdr->daddr, recompressing, and pushing the new header back. The
recompressed header can be larger than the original when the
address-elision opportunities are worse after the swap.

The function pulls (hdr->hdrlen + 1) << 3 bytes (the old header) and
pushes (chdr->hdrlen + 1) << 3 + sizeof(ipv6hdr) bytes (the new header
plus the IPv6 header).  pskb_expand_head() is called to guarantee
headroom only when segments_left == 0.

A crafted SRH that loops back to the local host (each segment is a local
address, so ip6_route_input() delivers it back to ipv6_rpl_srh_rcv())
with chdr growing on each pass exhausts headroom over several
iterations.  When skb_push() lands skb->data exactly at skb->head,
skb_reset_network_header() stores 0, and skb_mac_header_rebuild()'s
skb_set_mac_header(skb, -skb->mac_len) computes 0 + (u16)(-14) = 65522.
The subsequent memmove writes 14 bytes at skb->head + 65522.

Expand the head whenever there is insufficient room for the push, not
only on the final segment.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Reported-by: Anthropic
Cc: stable <stable@kernel.org>
Assisted-by: gkh_clanker_t1000
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/ipv6/exthdrs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index 95558fd6f447..d866ab011e0a 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -592,7 +592,9 @@ static int ipv6_rpl_srh_rcv(struct sk_buff *skb)
 	skb_pull(skb, ((hdr->hdrlen + 1) << 3));
 	skb_postpull_rcsum(skb, oldhdr,
 			   sizeof(struct ipv6hdr) + ((hdr->hdrlen + 1) << 3));
-	if (unlikely(!hdr->segments_left)) {
+	if (unlikely(!hdr->segments_left ||
+		     skb_headroom(skb) < sizeof(struct ipv6hdr) +
+					 ((chdr->hdrlen + 1) << 3))) {
 		if (pskb_expand_head(skb, sizeof(struct ipv6hdr) + ((chdr->hdrlen + 1) << 3), 0,
 				     GFP_ATOMIC)) {
 			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_OUTDISCARDS);
-- 
2.53.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox