Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v7 0/5] netem: bug fixes
From: Simon Horman @ 2026-04-17 16:02 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20260415142822.133241-1-stephen@networkplumber.org>

On Wed, Apr 15, 2026 at 07:27:03AM -0700, Stephen Hemminger wrote:
> These bugs were found when doing AI assisted  review of sch_netem.c
> during investigation of the packet duplication recursion problem
> addressed in Jamal's series.
> 
> The fixes cover:
> 
>  - probability gaps in the 4-state Markov loss model
>  - queue limit not accounting for reordered packets
>  - PRNG reseeded on every tc change, breaking reproducibility
>  - slot delay configuration not validated for inverted ranges
>  - slot delay arithmetic overflow for ranges above ~2.1 seconds
> 
> v7 - queue limit check Fixes: goes back further to earlier change
>    - use NL_SET_ERR_MSG_ATTR
> 
> Stephen Hemminger (5):
>   net/sched: netem: fix probability gaps in 4-state loss model
>   net/sched: netem: fix queue limit check to include reordered packets
>   net/sched: netem: only reseed PRNG when seed is explicitly provided
>   net/sched: netem: check for invalid slot range
>   net/sched: netem: fix slot delay calculation overflow

To the maintainers: I'd like to ask for more time to complete review of this.

^ permalink raw reply

* Re: [PATCH net v3 2/4] nfc: llcp: fix TLV parsing in parse_gb_tlv and parse_connection_tlv
From: Simon Horman @ 2026-04-17 16:04 UTC (permalink / raw)
  To: Lekë Hapçiu
  Cc: netdev, davem, edumazet, kuba, pabeni, linux-kernel, stable,
	Lekë Hapçiu
In-Reply-To: <20260414233534.55973-3-snowwlake@icloud.com>

On Wed, Apr 15, 2026 at 01:35:31AM +0200, Lekë Hapçiu wrote:
> From: Lekë Hapçiu <framemain@outlook.com>
> 
> nfc_llcp_parse_gb_tlv() and nfc_llcp_parse_connection_tlv() walk TLV
> arrays whose length and content come from a peer-supplied frame.  The
> parsing loop has three weaknesses:
> 
>  1. `offset` is declared u8 while `tlv_array_len` is u16.  In
>     parse_connection_tlv() the TLV array can reach ~2173 bytes (MIUX
>     up to 0x7FF), so 128 zero-length TLVs wrap `offset` back to 0 and
>     the loop never terminates while `tlv` advances past the buffer.
> 
>  2. The guard `offset < tlv_array_len` only proves one byte is
>     available, but the body reads tlv[0] (type) and tlv[1] (length).
>     When one byte remains, tlv[1] is out of bounds.
> 
>  3. `length` is read from peer data and used to advance `tlv` without
>     being checked against the remaining array space.  A crafted length
>     walks `tlv` past the buffer; the next iteration reads tlv[0]/tlv[1]
>     from adjacent memory.
> 
> The llcp_tlv8() and llcp_tlv16() accessors additionally read tlv[2]
> and tlv[2..3]; a zero-length TLV makes those reads out of bounds.
> 
> Fix: promote `offset` to u16; add two per-iteration guards, one for
> the TLV header and one for the TLV value; require length >= 1 for all
> TLVs before the type dispatch and length >= 2 for the llcp_tlv16()
> accessors (MIUX, WKS).  Return -EINVAL on malformed input.
> 
> Reached on ATR_RES (parse_gb_tlv) and on CONNECT/CC PDUs before a
> connection is established (parse_connection_tlv).  Both are
> triggerable from any NFC peer within ~4 cm, without authentication.

As per my comment on patch 1/4, I don't understand the relationship
between the last sentence above and this patch.

> 
> Reported-by: Simon Horman <horms@kernel.org>
> Fixes: d646960f7986 ("NFC: Add LLCP sockets")

I think the hash but not the subject is correct in the fixes line.
IOW, I think this should be:

Fixes: d646960f7986 ("NFC: Initial LLCP support")

> Cc: stable@vger.kernel.org
> Signed-off-by: Lekë Hapçiu <framemain@outlook.com>

Otherwise, looks good to me.


While looking over this I noticed that nfc_llcp_connect_sn() seems
to have the same kind of problem. You may wish to address that as
a follow-up.

...

^ permalink raw reply

* RE: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Sai Krishna Gajula @ 2026-04-17 16:10 UTC (permalink / raw)
  To: Fidelio Lawson, netdev@vger.kernel.org
  Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <20260417155025.488290-1-fidelio.lawson@exotec.com>

> -----Original Message-----
> From: Fidelio Lawson <lawson.fidelio@gmail.com>
> Sent: Friday, April 17, 2026 9:20 PM
> To: netdev@vger.kernel.org
> Cc: Marek Vasut <marex@nabladev.com>; Andrew Lunn <andrew@lunn.ch>;
> Woojung Huh <woojung.huh@microchip.com>; Fidelio Lawson
> <fidelio.lawson@exotec.com>
> Subject: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx
> Module 3 low-loss cable errata

Since this errata is a fix and pushed to "net", adding fixes tag may be required.

> 
> --- drivers/net/dsa/microchip/ksz8. c | 6 ++++++
> drivers/net/dsa/microchip/ksz8_reg. h | 3 +++ 2 files changed, 9 insertions(+)
> diff --git a/drivers/net/dsa/microchip/ksz8. c
> b/drivers/net/dsa/microchip/ksz8. c index 0f2b8acee80f. . 62fc59c3da7e
> 100644 
> ---
>  drivers/net/dsa/microchip/ksz8.c     | 6 ++++++
>  drivers/net/dsa/microchip/ksz8_reg.h | 3 +++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/net/dsa/microchip/ksz8.c
> b/drivers/net/dsa/microchip/ksz8.c
> index 0f2b8acee80f..62fc59c3da7e 100644
> --- a/drivers/net/dsa/microchip/ksz8.c
> +++ b/drivers/net/dsa/microchip/ksz8.c
> @@ -1297,6 +1297,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16
> reg, u16 val)
>  	case PHY_REG_KSZ87XX_LPF_BW:
>  		if (!ksz_is_ksz87xx(dev))
>  			return -EOPNOTSUPP;
> +		/* Only accept LPF bandwidth bits [7:6] */
> +		if (val & ~KSZ87XX_LPF_VALID_MASK)
> +			return -EINVAL;
>  		ret = ksz8_ind_write8(dev, TABLE_LINK_MD,
> KSZ87XX_REG_PHY_LPF, (u8)val);
>  		if (ret)
>  			return ret;
> @@ -1305,6 +1308,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16
> reg, u16 val)
>  	case PHY_REG_KSZ87XX_EQ_INIT:
>  		if (!ksz_is_ksz87xx(dev))
>  			return -EOPNOTSUPP;
> +		/* Only accept DSP EQ initial value bits [5:0] */
> +		if (val & ~KSZ87XX_DSP_EQ_VALID_MASK)
> +			return -EINVAL;
>  		ret = ksz8_ind_write8(dev, TABLE_LINK_MD,
> KSZ87XX_REG_DSP_EQ, (u8)val);
>  		if (ret)
>  			return ret;
> diff --git a/drivers/net/dsa/microchip/ksz8_reg.h
> b/drivers/net/dsa/microchip/ksz8_reg.h
> index 5df17c463f7c..cd41214f874e 100644
> --- a/drivers/net/dsa/microchip/ksz8_reg.h
> +++ b/drivers/net/dsa/microchip/ksz8_reg.h
> @@ -206,6 +206,9 @@
>  #define KSZ87XX_REG_DSP_EQ			0x08   /* DSP EQ initial value
> */
>  #define KSZ87XX_REG_PHY_LPF			0x4C   /* RX LPF
> bandwidth */
> 
> +#define KSZ87XX_DSP_EQ_VALID_MASK	GENMASK(5, 0)
> +#define KSZ87XX_LPF_VALID_MASK		GENMASK(7, 6)
> +
>  /* For KSZ8765. */
>  #define PORT_REMOTE_ASYM_PAUSE		BIT(5)
>  #define PORT_REMOTE_SYM_PAUSE		BIT(4)
> --
> 2.53.0
> 


^ permalink raw reply

* [PATCH 1/2 nf] netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
From: Fernando Fernandez Mancera @ 2026-04-17 16:20 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, coreteam, pablo, fw, phil, Fernando Fernandez Mancera

In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once
and passed by reference to nf_osf_match_one() for each fingerprint
checked. During TCP option parsing, nf_osf_match_one() advances the
shared ctx->optp pointer.

If a fingerprint perfectly matches, the function returns early without
restoring ctx->optp to its initial state. If the user has configured
NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint.
However, because ctx->optp was not restored, the next call to
nf_osf_match_one() starts parsing from the end of the options buffer.
This causes subsequent matches to read garbage data and fail
immediately, making it impossible to log more than one match or logging
incorrect matches.

Instead of using a shared ctx->optp pointer, pass the context as a
constant pointer and use a local pointer (optp) for TCP option
traversal. This makes nf_osf_match_one() strictly stateless from the
caller's perspective, ensuring every fingerprint check starts at the
correct option offset.

Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/netfilter/nfnetlink_osf.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index 45d9ad231a92..f58267986453 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -64,9 +64,9 @@ struct nf_osf_hdr_ctx {
 static bool nf_osf_match_one(const struct sk_buff *skb,
 			     const struct nf_osf_user_finger *f,
 			     int ttl_check,
-			     struct nf_osf_hdr_ctx *ctx)
+			     const struct nf_osf_hdr_ctx *ctx)
 {
-	const __u8 *optpinit = ctx->optp;
+	const __u8 *optp = ctx->optp;
 	unsigned int check_WSS = 0;
 	int fmatch = FMATCH_WRONG;
 	int foptsize, optnum;
@@ -95,17 +95,17 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
 	check_WSS = f->wss.wc;
 
 	for (optnum = 0; optnum < f->opt_num; ++optnum) {
-		if (f->opt[optnum].kind == *ctx->optp) {
+		if (f->opt[optnum].kind == *optp) {
 			__u32 len = f->opt[optnum].length;
-			const __u8 *optend = ctx->optp + len;
+			const __u8 *optend = optp + len;
 
 			fmatch = FMATCH_OK;
 
-			switch (*ctx->optp) {
+			switch (*optp) {
 			case OSFOPT_MSS:
-				mss = ctx->optp[3];
+				mss = optp[3];
 				mss <<= 8;
-				mss |= ctx->optp[2];
+				mss |= optp[2];
 
 				mss = ntohs((__force __be16)mss);
 				break;
@@ -113,7 +113,7 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
 				break;
 			}
 
-			ctx->optp = optend;
+			optp = optend;
 		} else
 			fmatch = FMATCH_OPT_WRONG;
 
@@ -156,9 +156,6 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
 		}
 	}
 
-	if (fmatch != FMATCH_OK)
-		ctx->optp = optpinit;
-
 	return fmatch == FMATCH_OK;
 }
 
-- 
2.53.0


^ permalink raw reply related

* Re: [Intel-wired-lan] [PATCH iwl-net v2] igc: fix potential skb leak in igc_fpe_xmit_smd_frame()
From: Kohei Enju @ 2026-04-17 16:20 UTC (permalink / raw)
  To: Simon Horman
  Cc: intel-wired-lan, netdev, Tony Nguyen, Przemek Kitszel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Faizal Rahim, kohei.enju, stable
In-Reply-To: <20260417115122.GA31784@horms.kernel.org>

On 04/17 12:51, Simon Horman wrote:
> On Wed, Apr 15, 2026 at 02:52:18AM +0000, Kohei Enju wrote:
> > When igc_fpe_init_tx_descriptor() fails, no one takes care of an
> > allocated skb, leaking it. [1]
> > Use dev_kfree_skb_any() on failure.
> > 
> > Tested on an I226 adapter with the following command, while injecting
> > faults in igc_fpe_init_tx_descriptor() to trigger the error path.
> >  # ethtool --set-mm $DEV verify-enabled on tx-enabled on pmac-enabled on
> > 
> > [1]
> > unreferenced object 0xffff888113c6cdc0 (size 224):
> > ...
> >   backtrace (crc be3d3fda):
> >     kmem_cache_alloc_node_noprof+0x3b1/0x410
> >     __alloc_skb+0xde/0x830
> >     igc_fpe_xmit_smd_frame.isra.0+0xad/0x1b0
> >     igc_fpe_send_mpacket+0x37/0x90
> >     ethtool_mmsv_verify_timer+0x15e/0x300
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: 5422570c0010 ("igc: add support for frame preemption verification")
> > Signed-off-by: Kohei Enju <kohei@enjuk.jp>
> > ---
> > Changes:
> >   v2:
> >     - change to idiomatic style with goto (Simon)
> >     - add Cc to stable (Alex)
> >     - add reprodunction steps (Alex)
> >   v1: https://lore.kernel.org/all/20260329145122.126040-1-kohei@enjuk.jp/
> 
> Thanks for the update.
> 
> Reviewed-by: Simon Horman <horms@kernel.org>
> 
> Sashiko has comments about a potential existing bug in the same code path.
> I'd appreciate it if, as a follow-up, you could look over that.

Thanks for the heads-up. I'll look into it.

> 
> Thanks!

^ permalink raw reply

* [PATCH 2/2 nf] netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
From: Fernando Fernandez Mancera @ 2026-04-17 16:20 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, coreteam, pablo, fw, phil, Fernando Fernandez Mancera,
	Kito Xu (veritas501)
In-Reply-To: <20260417162057.3732-1-fmancera@suse.de>

The nf_osf_ttl() function accessed skb->dev to perform a local interface
address lookup without verifying that the device pointer was valid.

Additionally, the implementation utilized an in_dev_for_each_ifa_rcu
loop to match the packet source address against local interface
addresses. It assumed that packets from the same subnet should not see a
decrement on the initial TTL. A packet might appear it is from the same
subnet but it actually isn't especially in modern environments with
containers and virtual switching.

Remove the device dereference and interface loop. Replace the logic with
a switch statement that evaluates the TTL according to the ttl_check.

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Kito Xu (veritas501) <hxzene@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
Note: if some help is needed during the backport I can assist.
---
 net/netfilter/nfnetlink_osf.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index f58267986453..f0d1e596e146 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -31,26 +31,18 @@ EXPORT_SYMBOL_GPL(nf_osf_fingers);
 static inline int nf_osf_ttl(const struct sk_buff *skb,
 			     int ttl_check, unsigned char f_ttl)
 {
-	struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
 	const struct iphdr *ip = ip_hdr(skb);
-	const struct in_ifaddr *ifa;
-	int ret = 0;
 
-	if (ttl_check == NF_OSF_TTL_TRUE)
+	switch (ttl_check) {
+	case NF_OSF_TTL_TRUE:
 		return ip->ttl == f_ttl;
-	if (ttl_check == NF_OSF_TTL_NOCHECK)
-		return 1;
-	else if (ip->ttl <= f_ttl)
+		break;
+	case NF_OSF_TTL_NOCHECK:
 		return 1;
-
-	in_dev_for_each_ifa_rcu(ifa, in_dev) {
-		if (inet_ifa_match(ip->saddr, ifa)) {
-			ret = (ip->ttl == f_ttl);
-			break;
-		}
+	case NF_OSF_TTL_LESS:
+	default:
+		return ip->ttl <= f_ttl;
 	}
-
-	return ret;
 }
 
 struct nf_osf_hdr_ctx {
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option
From: Tejun Heo @ 2026-04-17 16:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Vernet, Andrea Righi, Changwoo Min,
	Emil Tsalapatis, linux-crypto, sched-ext, linux-kernel,
	Florian Westphal, netdev
In-Reply-To: <aeHmeAz-Z-Rx2MqX@gondor.apana.org.au>

Hello,

On Fri, Apr 17, 2026 at 03:51:20PM +0800, Herbert Xu wrote:
> rhashtable originated in networking where it tries very hard to
> stop the hash table from ever degenerating into a linked list.

I see.

> If your use-case is not as adversarial as that, and you're happy
> for the hash table to degenerate into a linked-list in the worst
> case, then yes it's aboslutely fine to not grow the table (or
> try to grow it and fail with kmalloc_nolock).

My use case is a bit different. I want a resizable hashtable which can be
used under raw spinlock and doesn't fail unnecessarily. My only adversary is
memory pressure and operation failures can be harmful. ie. If the system is
under severe memory pressure, hashtable becoming temporarily slower is not a
big problem as long as it restores reasonable operation once the system
recovers. However, if the insertion operation fails under e.g. sudden
network rx burst that drains atomic reserve, that can lead to fatal failure
- e.g. forks failing out of blue on a busy but mostly okay system. I think
this pretty much requires all hashtable growths to be asynchronous.

> It's just that we haven't had any users like this until now and
> the feature that you want got removed because of that.
> 
> I'm more than happy to bring it back (commit 5f8ddeab10ce).

That'd be great but looking at the commit, I'm not sure it reliably avoids
allocation in the synchronous path.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH bpf v3 2/2] selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
From: Martin KaFai Lau @ 2026-04-17 16:25 UTC (permalink / raw)
  To: KaFai Wan
  Cc: daniel, john.fastabend, sdf, ast, andrii, eddyz87, memxor, song,
	yonghong.song, jolsa, davem, edumazet, kuba, pabeni, horms, shuah,
	jiayuan.chen, bpf, netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260417092035.2299913-3-kafai.wan@linux.dev>

On Fri, Apr 17, 2026 at 05:20:35PM +0800, KaFai Wan wrote:
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> index 56685fc03c7e..7b9dbbb84316 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> @@ -461,7 +461,7 @@ static void misc(void)
>  	const unsigned int nr_data = 2;
>  	struct bpf_link *link;
>  	struct sk_fds sk_fds;
> -	int i, ret;
> +	int i, ret, true_val = 1;
>  
>  	lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map);
>  
> @@ -477,6 +477,10 @@ static void misc(void)
>  		return;
>  	}
>  
> +	ret = setsockopt(sk_fds.active_fd, SOL_TCP, TCP_NODELAY, &true_val, sizeof(true_val));

Same comment as in v2. Why this setsockopt is needed?
The setsockopt in userspace is unnecessary. In the future,
we may need to understand why it is needed here in the first place.

^ permalink raw reply

* Re: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio LAWSON @ 2026-04-17 16:30 UTC (permalink / raw)
  To: Sai Krishna Gajula, netdev@vger.kernel.org
  Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <BYAPR18MB3735885B13017500E153FEA2A0202@BYAPR18MB3735.namprd18.prod.outlook.com>

On 4/17/26 18:10, Sai Krishna Gajula wrote:
>> -----Original Message-----
>> From: Fidelio Lawson <lawson.fidelio@gmail.com>
>> Sent: Friday, April 17, 2026 9:20 PM
>> To: netdev@vger.kernel.org
>> Cc: Marek Vasut <marex@nabladev.com>; Andrew Lunn <andrew@lunn.ch>;
>> Woojung Huh <woojung.huh@microchip.com>; Fidelio Lawson
>> <fidelio.lawson@exotec.com>
>> Subject: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx
>> Module 3 low-loss cable errata
> 
> Since this errata is a fix and pushed to "net", adding fixes tag may be required.
> 

Good point, thanks for spotting this.
I’ll add an appropriate fixes tag referencing the commit that introduced
the KSZ87xx support, and follow up with an updated fixup.

Thanks


^ permalink raw reply

* [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio Lawson @ 2026-04-17 16:39 UTC (permalink / raw)
  To: netdev; +Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <20260417-ksz87xx_errata_low_loss_connections-v4-1-6c7044ec4363@exotec.com>

Fixes: e66f840c08a2 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
---
 drivers/net/dsa/microchip/ksz8.c     | 6 ++++++
 drivers/net/dsa/microchip/ksz8_reg.h | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/drivers/net/dsa/microchip/ksz8.c b/drivers/net/dsa/microchip/ksz8.c
index 0f2b8acee80f..62fc59c3da7e 100644
--- a/drivers/net/dsa/microchip/ksz8.c
+++ b/drivers/net/dsa/microchip/ksz8.c
@@ -1297,6 +1297,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val)
 	case PHY_REG_KSZ87XX_LPF_BW:
 		if (!ksz_is_ksz87xx(dev))
 			return -EOPNOTSUPP;
+		/* Only accept LPF bandwidth bits [7:6] */
+		if (val & ~KSZ87XX_LPF_VALID_MASK)
+			return -EINVAL;
 		ret = ksz8_ind_write8(dev, TABLE_LINK_MD, KSZ87XX_REG_PHY_LPF, (u8)val);
 		if (ret)
 			return ret;
@@ -1305,6 +1308,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val)
 	case PHY_REG_KSZ87XX_EQ_INIT:
 		if (!ksz_is_ksz87xx(dev))
 			return -EOPNOTSUPP;
+		/* Only accept DSP EQ initial value bits [5:0] */
+		if (val & ~KSZ87XX_DSP_EQ_VALID_MASK)
+			return -EINVAL;
 		ret = ksz8_ind_write8(dev, TABLE_LINK_MD, KSZ87XX_REG_DSP_EQ, (u8)val);
 		if (ret)
 			return ret;
diff --git a/drivers/net/dsa/microchip/ksz8_reg.h b/drivers/net/dsa/microchip/ksz8_reg.h
index 5df17c463f7c..cd41214f874e 100644
--- a/drivers/net/dsa/microchip/ksz8_reg.h
+++ b/drivers/net/dsa/microchip/ksz8_reg.h
@@ -206,6 +206,9 @@
 #define KSZ87XX_REG_DSP_EQ			0x08   /* DSP EQ initial value */
 #define KSZ87XX_REG_PHY_LPF			0x4C   /* RX LPF bandwidth */
 
+#define KSZ87XX_DSP_EQ_VALID_MASK	GENMASK(5, 0)
+#define KSZ87XX_LPF_VALID_MASK		GENMASK(7, 6)
+
 /* For KSZ8765. */
 #define PORT_REMOTE_ASYM_PAUSE		BIT(5)
 #define PORT_REMOTE_SYM_PAUSE		BIT(4)
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 0/6] selftests: net: multithread + rss_multiqueue support for iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <20260408163816.2760-1-juanlu@fastmail.com>

Add multithread support to the iou-zcrx selftest, plus a new
rss_multiqueue Python variant that exercises multi-queue zero-copy
receive on a single listening socket with NAPI-ID-based dispatch.

v2:
 - merge iou-zcrx.c server changes, leaving iou-zcrx.py changes in the
   last patch (David)
 - Refactor server state into struct thread_ctx as a separate
   patch for a cleaner impl of the server side.
 - Rework server: main-thread epoll accepts an arbitrary number
   of connections; SO_INCOMING_NAPI_ID dispatches each to its worker.
   (David)
 - Drop unused thread_id field (David)
 - rss_multiqueue: use a single listening port with an RSS context
   spanning N queues; query NAPI IDs at runtime via netlink
   queue_get(); pass them to the binary via a new -n option

Link: https://lore.kernel.org/netdev/20260408163816.2760-1-juanlu@fastmail.com/

Juanlu Herrero (6):
  selftests: net: fix get_refill_ring_size() to use its local variable
  selftests: net: remove unused variable in process_recvzc()
  selftests: net: refactor server state into struct thread_ctx
  selftests: net: add multithread client support to iou-zcrx
  selftests: net: add multithread server support to iou-zcrx
  selftests: net: add rss_multiqueue test variant to iou-zcrx

 .../testing/selftests/drivers/net/hw/Makefile |   2 +-
 .../selftests/drivers/net/hw/iou-zcrx.c       | 379 ++++++++++++------
 .../selftests/drivers/net/hw/iou-zcrx.py      |  59 ++-
 3 files changed, 317 insertions(+), 123 deletions(-)

-- 
2.52.0


^ permalink raw reply

* [PATCH v2 1/6] selftests: net: fix get_refill_ring_size() to use its local variable
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

In preparation for multi-threaded rss selftests, fix
get_refill_ring_size to use the local `size` variable,
instead of the `global_size`.

Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
 tools/testing/selftests/drivers/net/hw/iou-zcrx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 240d13dbc54e7..334985083f611 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -132,10 +132,10 @@ static inline size_t get_refill_ring_size(unsigned int rq_entries)
 {
 	size_t size;
 
-	ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe);
+	size = rq_entries * sizeof(struct io_uring_zcrx_rqe);
 	/* add space for the header (head/tail/etc.) */
-	ring_size += page_size;
-	return ALIGN_UP(ring_size, page_size);
+	size += page_size;
+	return ALIGN_UP(size, page_size);
 }
 
 static void setup_zcrx(struct io_uring *ring)
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 2/6] selftests: net: remove unused variable in process_recvzc()
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

Remove unused `sqe` variable in preparation for multiqueue
rss selftest changes to process_recvzc() in the following
commit.

Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
 tools/testing/selftests/drivers/net/hw/iou-zcrx.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 334985083f611..c15916311f0dd 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -269,7 +269,6 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 	unsigned rq_mask = rq_ring.ring_entries - 1;
 	struct io_uring_zcrx_cqe *rcqe;
 	struct io_uring_zcrx_rqe *rqe;
-	struct io_uring_sqe *sqe;
 	uint64_t mask;
 	char *data;
 	ssize_t n;
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 3/6] selftests: net: refactor server state into struct thread_ctx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

Move server-side state (io_uring ring, zcrx area, refill ring, receive
tracking) from global variables into a local struct thread_ctx. This is
a pure refactor with no behavior change: run_server still allocates a
single context on the stack and runs single-threaded, using io_uring
accept and recvzc as before.

This prepares the ground for the multithread server support in the
following commits, which spawns N worker threads each with their own
struct thread_ctx.

Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
 .../selftests/drivers/net/hw/iou-zcrx.c       | 156 +++++++++---------
 1 file changed, 80 insertions(+), 76 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index c15916311f0dd..8dcb2f061f00a 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -87,14 +87,18 @@ static unsigned int cfg_rx_buf_len;
 static bool cfg_dry_run;
 
 static char *payload;
-static void *area_ptr;
-static void *ring_ptr;
-static size_t ring_size;
-static struct io_uring_zcrx_rq rq_ring;
-static unsigned long area_token;
-static int connfd;
-static bool stop;
-static size_t received;
+
+struct thread_ctx {
+	struct io_uring		ring;
+	void			*area_ptr;
+	void			*ring_ptr;
+	size_t			ring_size;
+	struct io_uring_zcrx_rq	rq_ring;
+	unsigned long		area_token;
+	int			connfd;
+	bool			stop;
+	size_t			received;
+};
 
 static unsigned long gettimeofday_ms(void)
 {
@@ -138,7 +142,7 @@ static inline size_t get_refill_ring_size(unsigned int rq_entries)
 	return ALIGN_UP(size, page_size);
 }
 
-static void setup_zcrx(struct io_uring *ring)
+static void setup_zcrx(struct thread_ctx *ctx)
 {
 	unsigned int ifindex;
 	unsigned int rq_entries = 4096;
@@ -149,44 +153,44 @@ static void setup_zcrx(struct io_uring *ring)
 		error(1, 0, "bad interface name: %s", cfg_ifname);
 
 	if (cfg_rx_buf_len && cfg_rx_buf_len != page_size) {
-		area_ptr = mmap(NULL,
-				AREA_SIZE,
-				PROT_READ | PROT_WRITE,
-				MAP_ANONYMOUS | MAP_PRIVATE |
-				MAP_HUGETLB | MAP_HUGE_2MB,
-				-1,
-				0);
-		if (area_ptr == MAP_FAILED) {
+		ctx->area_ptr = mmap(NULL,
+				     AREA_SIZE,
+				     PROT_READ | PROT_WRITE,
+				     MAP_ANONYMOUS | MAP_PRIVATE |
+				     MAP_HUGETLB | MAP_HUGE_2MB,
+				     -1,
+				     0);
+		if (ctx->area_ptr == MAP_FAILED) {
 			printf("Can't allocate huge pages\n");
 			exit(SKIP_CODE);
 		}
 	} else {
-		area_ptr = mmap(NULL,
-				AREA_SIZE,
-				PROT_READ | PROT_WRITE,
-				MAP_ANONYMOUS | MAP_PRIVATE,
-				0,
-				0);
-		if (area_ptr == MAP_FAILED)
+		ctx->area_ptr = mmap(NULL,
+				     AREA_SIZE,
+				     PROT_READ | PROT_WRITE,
+				     MAP_ANONYMOUS | MAP_PRIVATE,
+				     0,
+				     0);
+		if (ctx->area_ptr == MAP_FAILED)
 			error(1, 0, "mmap(): zero copy area");
 	}
 
-	ring_size = get_refill_ring_size(rq_entries);
-	ring_ptr = mmap(NULL,
-			ring_size,
-			PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_PRIVATE,
-			0,
-			0);
+	ctx->ring_size = get_refill_ring_size(rq_entries);
+	ctx->ring_ptr = mmap(NULL,
+			     ctx->ring_size,
+			     PROT_READ | PROT_WRITE,
+			     MAP_ANONYMOUS | MAP_PRIVATE,
+			     0,
+			     0);
 
 	struct io_uring_region_desc region_reg = {
-		.size = ring_size,
-		.user_addr = (__u64)(unsigned long)ring_ptr,
+		.size = ctx->ring_size,
+		.user_addr = (__u64)(unsigned long)ctx->ring_ptr,
 		.flags = IORING_MEM_REGION_TYPE_USER,
 	};
 
 	struct io_uring_zcrx_area_reg area_reg = {
-		.addr = (__u64)(unsigned long)area_ptr,
+		.addr = (__u64)(unsigned long)ctx->area_ptr,
 		.len = AREA_SIZE,
 		.flags = 0,
 	};
@@ -200,7 +204,7 @@ static void setup_zcrx(struct io_uring *ring)
 		.rx_buf_len = cfg_rx_buf_len,
 	};
 
-	ret = io_uring_register_ifq(ring, (void *)&reg);
+	ret = io_uring_register_ifq(&ctx->ring, (void *)&reg);
 	if (cfg_rx_buf_len && (ret == -EINVAL || ret == -EOPNOTSUPP ||
 			       ret == -ERANGE)) {
 		printf("Large chunks are not supported %i\n", ret);
@@ -209,64 +213,64 @@ static void setup_zcrx(struct io_uring *ring)
 		error(1, 0, "io_uring_register_ifq(): %d", ret);
 	}
 
-	rq_ring.khead = (unsigned int *)((char *)ring_ptr + reg.offsets.head);
-	rq_ring.ktail = (unsigned int *)((char *)ring_ptr + reg.offsets.tail);
-	rq_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
-	rq_ring.rq_tail = 0;
-	rq_ring.ring_entries = reg.rq_entries;
+	ctx->rq_ring.khead = (unsigned int *)((char *)ctx->ring_ptr + reg.offsets.head);
+	ctx->rq_ring.ktail = (unsigned int *)((char *)ctx->ring_ptr + reg.offsets.tail);
+	ctx->rq_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ctx->ring_ptr + reg.offsets.rqes);
+	ctx->rq_ring.rq_tail = 0;
+	ctx->rq_ring.ring_entries = reg.rq_entries;
 
-	area_token = area_reg.rq_area_token;
+	ctx->area_token = area_reg.rq_area_token;
 }
 
-static void add_accept(struct io_uring *ring, int sockfd)
+static void add_accept(struct thread_ctx *ctx, int sockfd)
 {
 	struct io_uring_sqe *sqe;
 
-	sqe = io_uring_get_sqe(ring);
+	sqe = io_uring_get_sqe(&ctx->ring);
 
 	io_uring_prep_accept(sqe, sockfd, NULL, NULL, 0);
 	sqe->user_data = 1;
 }
 
-static void add_recvzc(struct io_uring *ring, int sockfd)
+static void add_recvzc(struct thread_ctx *ctx, int sockfd)
 {
 	struct io_uring_sqe *sqe;
 
-	sqe = io_uring_get_sqe(ring);
+	sqe = io_uring_get_sqe(&ctx->ring);
 
 	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, 0, 0);
 	sqe->ioprio |= IORING_RECV_MULTISHOT;
 	sqe->user_data = 2;
 }
 
-static void add_recvzc_oneshot(struct io_uring *ring, int sockfd, size_t len)
+static void add_recvzc_oneshot(struct thread_ctx *ctx, int sockfd, size_t len)
 {
 	struct io_uring_sqe *sqe;
 
-	sqe = io_uring_get_sqe(ring);
+	sqe = io_uring_get_sqe(&ctx->ring);
 
 	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, len, 0);
 	sqe->ioprio |= IORING_RECV_MULTISHOT;
 	sqe->user_data = 2;
 }
 
-static void process_accept(struct io_uring *ring, struct io_uring_cqe *cqe)
+static void process_accept(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
 {
 	if (cqe->res < 0)
 		error(1, 0, "accept()");
-	if (connfd)
+	if (ctx->connfd)
 		error(1, 0, "Unexpected second connection");
 
-	connfd = cqe->res;
+	ctx->connfd = cqe->res;
 	if (cfg_oneshot)
-		add_recvzc_oneshot(ring, connfd, page_size);
+		add_recvzc_oneshot(ctx, ctx->connfd, page_size);
 	else
-		add_recvzc(ring, connfd);
+		add_recvzc(ctx, ctx->connfd);
 }
 
-static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
+static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
 {
-	unsigned rq_mask = rq_ring.ring_entries - 1;
+	unsigned rq_mask = ctx->rq_ring.ring_entries - 1;
 	struct io_uring_zcrx_cqe *rcqe;
 	struct io_uring_zcrx_rqe *rqe;
 	uint64_t mask;
@@ -275,7 +279,7 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 	int i;
 
 	if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs == 0) {
-		stop = true;
+		ctx->stop = true;
 		return;
 	}
 
@@ -284,56 +288,56 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 
 	if (cfg_oneshot) {
 		if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs) {
-			add_recvzc_oneshot(ring, connfd, page_size);
+			add_recvzc_oneshot(ctx, ctx->connfd, page_size);
 			cfg_oneshot_recvs--;
 		}
 	} else if (!(cqe->flags & IORING_CQE_F_MORE)) {
-		add_recvzc(ring, connfd);
+		add_recvzc(ctx, ctx->connfd);
 	}
 
 	rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
 
 	n = cqe->res;
 	mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
-	data = (char *)area_ptr + (rcqe->off & mask);
+	data = (char *)ctx->area_ptr + (rcqe->off & mask);
 
 	for (i = 0; i < n; i++) {
-		if (*(data + i) != payload[(received + i)])
+		if (*(data + i) != payload[(ctx->received + i)])
 			error(1, 0, "payload mismatch at %d", i);
 	}
-	received += n;
+	ctx->received += n;
 
-	rqe = &rq_ring.rqes[(rq_ring.rq_tail & rq_mask)];
-	rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | area_token;
+	rqe = &ctx->rq_ring.rqes[(ctx->rq_ring.rq_tail & rq_mask)];
+	rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | ctx->area_token;
 	rqe->len = cqe->res;
-	io_uring_smp_store_release(rq_ring.ktail, ++rq_ring.rq_tail);
+	io_uring_smp_store_release(ctx->rq_ring.ktail, ++ctx->rq_ring.rq_tail);
 }
 
-static void server_loop(struct io_uring *ring)
+static void server_loop(struct thread_ctx *ctx)
 {
 	struct io_uring_cqe *cqe;
 	unsigned int count = 0;
 	unsigned int head;
 	int i, ret;
 
-	io_uring_submit_and_wait(ring, 1);
+	io_uring_submit_and_wait(&ctx->ring, 1);
 
-	io_uring_for_each_cqe(ring, head, cqe) {
+	io_uring_for_each_cqe(&ctx->ring, head, cqe) {
 		if (cqe->user_data == 1)
-			process_accept(ring, cqe);
+			process_accept(ctx, cqe);
 		else if (cqe->user_data == 2)
-			process_recvzc(ring, cqe);
+			process_recvzc(ctx, cqe);
 		else
 			error(1, 0, "unknown cqe");
 		count++;
 	}
-	io_uring_cq_advance(ring, count);
+	io_uring_cq_advance(&ctx->ring, count);
 }
 
 static void run_server(void)
 {
+	struct thread_ctx ctx = {};
 	unsigned int flags = 0;
-	struct io_uring ring;
 	int fd, enable, ret;
 	uint64_t tstop;
 
@@ -359,19 +363,19 @@ static void run_server(void)
 	flags |= IORING_SETUP_SUBMIT_ALL;
 	flags |= IORING_SETUP_CQE32;
 
-	io_uring_queue_init(512, &ring, flags);
+	io_uring_queue_init(512, &ctx.ring, flags);
 
-	setup_zcrx(&ring);
+	setup_zcrx(&ctx);
 	if (cfg_dry_run)
 		return;
 
-	add_accept(&ring, fd);
+	add_accept(&ctx, fd);
 
 	tstop = gettimeofday_ms() + 5000;
-	while (!stop && gettimeofday_ms() < tstop)
-		server_loop(&ring);
+	while (!ctx.stop && gettimeofday_ms() < tstop)
+		server_loop(&ctx);
 
-	if (!stop)
+	if (!ctx.stop)
 		error(1, 0, "test failed\n");
 }
 
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 4/6] selftests: net: add multithread client support to iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

Add pthreads to the iou-zcrx client so that multiple connections can be
established simultaneously. Each client thread connects to the server
and sends its payload independently.

Introduce the -t option to control the number of threads (default 1),
preserving backwards compatibility with existing tests.

Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
 .../testing/selftests/drivers/net/hw/Makefile |  2 +-
 .../selftests/drivers/net/hw/iou-zcrx.c       | 38 +++++++++++++++++--
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 85ca4d1ecf9ec..4f8c3d0b6acdb 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -83,5 +83,5 @@ include ../../../net/ynl.mk
 include ../../../net/bpf.mk
 
 ifeq ($(HAS_IOURING_ZCRX),y)
-$(OUTPUT)/iou-zcrx: LDLIBS += -luring
+$(OUTPUT)/iou-zcrx: LDLIBS += -luring -lpthread
 endif
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 8dcb2f061f00a..6eb738ef4b5cc 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -4,6 +4,7 @@
 #include <error.h>
 #include <fcntl.h>
 #include <limits.h>
+#include <pthread.h>
 #include <stdbool.h>
 #include <stdint.h>
 #include <stdio.h>
@@ -85,6 +86,7 @@ static int cfg_send_size = SEND_SIZE;
 static struct sockaddr_in6 cfg_addr;
 static unsigned int cfg_rx_buf_len;
 static bool cfg_dry_run;
+static int cfg_num_threads = 1;
 
 static char *payload;
 
@@ -379,7 +381,7 @@ static void run_server(void)
 		error(1, 0, "test failed\n");
 }
 
-static void run_client(void)
+static void *client_worker(void *arg)
 {
 	ssize_t to_send = cfg_send_size;
 	ssize_t sent = 0;
@@ -405,12 +407,39 @@ static void run_client(void)
 	}
 
 	close(fd);
+	return NULL;
+}
+
+static void run_client(void)
+{
+	struct thread_ctx *ctxs;
+	pthread_t *threads;
+	int i, ret;
+
+	ctxs = calloc(cfg_num_threads, sizeof(*ctxs));
+	threads = calloc(cfg_num_threads, sizeof(*threads));
+	if (!ctxs || !threads)
+		error(1, 0, "calloc()");
+
+	for (i = 0; i < cfg_num_threads; i++) {
+		ret = pthread_create(&threads[i], NULL, client_worker,
+				     &ctxs[i]);
+		if (ret)
+			error(1, ret, "pthread_create()");
+	}
+
+	for (i = 0; i < cfg_num_threads; i++)
+		pthread_join(threads[i], NULL);
+
+	free(threads);
+	free(ctxs);
 }
 
 static void usage(const char *filepath)
 {
 	error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port> "
-		    "-l<payload_size> -i<ifname> -q<rxq_id>", filepath);
+		    "-l<payload_size> -i<ifname> -q<rxq_id> -t<num_threads>",
+		    filepath);
 }
 
 static void parse_opts(int argc, char **argv)
@@ -428,7 +457,7 @@ static void parse_opts(int argc, char **argv)
 		usage(argv[0]);
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:d")) != -1) {
+	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:dt:")) != -1) {
 		switch (c) {
 		case 's':
 			if (cfg_client)
@@ -469,6 +498,9 @@ static void parse_opts(int argc, char **argv)
 		case 'd':
 			cfg_dry_run = true;
 			break;
+		case 't':
+			cfg_num_threads = strtoul(optarg, NULL, 0);
+			break;
 		}
 	}
 
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 5/6] selftests: net: add multithread server support to iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

Add a multithreaded server with a two-phase architecture: a main thread
runs an epoll loop on the listening socket and dispatches each accepted
connfd to a worker thread by direct array assignment. After the accept
loop ends, a barrier release lets each worker submit one
IORING_OP_RECV_ZC SQE per assigned connfd (tagged with a connection
index in user_data) and process completions in its own io_uring CQE
loop. Each per-worker connfd array has a single writer (main, before
barrier) and a single reader (the worker, after barrier), so no
eventfd, mutex, or queue is required.

With multiple queues, connections are dispatched to the correct worker
by SO_INCOMING_NAPI_ID using a NAPI-ID-to-thread lookup table populated
via a new -n option.

Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
 .../selftests/drivers/net/hw/iou-zcrx.c       | 238 +++++++++++++-----
 1 file changed, 171 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 6eb738ef4b5cc..03ae5228cb5a4 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -87,8 +87,15 @@ static struct sockaddr_in6 cfg_addr;
 static unsigned int cfg_rx_buf_len;
 static bool cfg_dry_run;
 static int cfg_num_threads = 1;
+static int cfg_napi_ids[64];
+static int cfg_num_napi_ids;
 
 static char *payload;
+static pthread_barrier_t barrier;
+
+#define MAX_CONNS_PER_WORKER	64
+#define FIRST_ACCEPT_TIMEOUT_MS	4000
+#define ACCEPT_TIMEOUT_MS	200
 
 struct thread_ctx {
 	struct io_uring		ring;
@@ -97,9 +104,11 @@ struct thread_ctx {
 	size_t			ring_size;
 	struct io_uring_zcrx_rq	rq_ring;
 	unsigned long		area_token;
-	int			connfd;
-	bool			stop;
-	size_t			received;
+	int			queue_id;
+
+	int			connfds[MAX_CONNS_PER_WORKER];
+	size_t			received[MAX_CONNS_PER_WORKER];
+	int			nr_conns;
 };
 
 static unsigned long gettimeofday_ms(void)
@@ -199,7 +208,7 @@ static void setup_zcrx(struct thread_ctx *ctx)
 
 	struct t_io_uring_zcrx_ifq_reg reg = {
 		.if_idx = ifindex,
-		.if_rxq = cfg_queue_id,
+		.if_rxq = ctx->queue_id,
 		.rq_entries = rq_entries,
 		.area_ptr = (__u64)(unsigned long)&area_reg,
 		.region_ptr = (__u64)(unsigned long)&region_reg,
@@ -224,53 +233,32 @@ static void setup_zcrx(struct thread_ctx *ctx)
 	ctx->area_token = area_reg.rq_area_token;
 }
 
-static void add_accept(struct thread_ctx *ctx, int sockfd)
+static void add_recvzc(struct thread_ctx *ctx, int conn_idx)
 {
 	struct io_uring_sqe *sqe;
 
 	sqe = io_uring_get_sqe(&ctx->ring);
 
-	io_uring_prep_accept(sqe, sockfd, NULL, NULL, 0);
-	sqe->user_data = 1;
-}
-
-static void add_recvzc(struct thread_ctx *ctx, int sockfd)
-{
-	struct io_uring_sqe *sqe;
-
-	sqe = io_uring_get_sqe(&ctx->ring);
-
-	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, 0, 0);
+	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, ctx->connfds[conn_idx],
+			 NULL, 0, 0);
 	sqe->ioprio |= IORING_RECV_MULTISHOT;
-	sqe->user_data = 2;
+	sqe->user_data = conn_idx;
 }
 
-static void add_recvzc_oneshot(struct thread_ctx *ctx, int sockfd, size_t len)
+static void add_recvzc_oneshot(struct thread_ctx *ctx, int conn_idx, size_t len)
 {
 	struct io_uring_sqe *sqe;
 
 	sqe = io_uring_get_sqe(&ctx->ring);
 
-	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, len, 0);
+	io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, ctx->connfds[conn_idx],
+			 NULL, len, 0);
 	sqe->ioprio |= IORING_RECV_MULTISHOT;
-	sqe->user_data = 2;
+	sqe->user_data = conn_idx;
 }
 
-static void process_accept(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
-{
-	if (cqe->res < 0)
-		error(1, 0, "accept()");
-	if (ctx->connfd)
-		error(1, 0, "Unexpected second connection");
-
-	ctx->connfd = cqe->res;
-	if (cfg_oneshot)
-		add_recvzc_oneshot(ctx, ctx->connfd, page_size);
-	else
-		add_recvzc(ctx, ctx->connfd);
-}
-
-static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
+static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe,
+			   int conn_idx)
 {
 	unsigned rq_mask = ctx->rq_ring.ring_entries - 1;
 	struct io_uring_zcrx_cqe *rcqe;
@@ -281,7 +269,7 @@ static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
 	int i;
 
 	if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs == 0) {
-		ctx->stop = true;
+		ctx->nr_conns--;
 		return;
 	}
 
@@ -290,11 +278,11 @@ static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
 
 	if (cfg_oneshot) {
 		if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs) {
-			add_recvzc_oneshot(ctx, ctx->connfd, page_size);
+			add_recvzc_oneshot(ctx, conn_idx, page_size);
 			cfg_oneshot_recvs--;
 		}
 	} else if (!(cqe->flags & IORING_CQE_F_MORE)) {
-		add_recvzc(ctx, ctx->connfd);
+		add_recvzc(ctx, conn_idx);
 	}
 
 	rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
@@ -304,10 +292,10 @@ static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
 	data = (char *)ctx->area_ptr + (rcqe->off & mask);
 
 	for (i = 0; i < n; i++) {
-		if (*(data + i) != payload[(ctx->received + i)])
+		if (*(data + i) != payload[(ctx->received[conn_idx] + i)])
 			error(1, 0, "payload mismatch at %d", i);
 	}
-	ctx->received += n;
+	ctx->received[conn_idx] += n;
 
 	rqe = &ctx->rq_ring.rqes[(ctx->rq_ring.rq_tail & rq_mask)];
 	rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | ctx->area_token;
@@ -320,28 +308,80 @@ static void server_loop(struct thread_ctx *ctx)
 	struct io_uring_cqe *cqe;
 	unsigned int count = 0;
 	unsigned int head;
-	int i, ret;
 
 	io_uring_submit_and_wait(&ctx->ring, 1);
 
 	io_uring_for_each_cqe(&ctx->ring, head, cqe) {
-		if (cqe->user_data == 1)
-			process_accept(ctx, cqe);
-		else if (cqe->user_data == 2)
-			process_recvzc(ctx, cqe);
-		else
-			error(1, 0, "unknown cqe");
+		process_recvzc(ctx, cqe, cqe->user_data);
 		count++;
 	}
 	io_uring_cq_advance(&ctx->ring, count);
 }
 
-static void run_server(void)
+static void *server_worker(void *arg)
 {
-	struct thread_ctx ctx = {};
+	struct thread_ctx *ctx = arg;
 	unsigned int flags = 0;
-	int fd, enable, ret;
 	uint64_t tstop;
+	int i;
+
+	flags |= IORING_SETUP_COOP_TASKRUN;
+	flags |= IORING_SETUP_SINGLE_ISSUER;
+	flags |= IORING_SETUP_DEFER_TASKRUN;
+	flags |= IORING_SETUP_SUBMIT_ALL;
+	flags |= IORING_SETUP_CQE32;
+
+	io_uring_queue_init(512, &ctx->ring, flags);
+	setup_zcrx(ctx);
+
+	pthread_barrier_wait(&barrier);
+
+	if (cfg_dry_run)
+		return NULL;
+
+	pthread_barrier_wait(&barrier);
+
+	for (i = 0; i < ctx->nr_conns; i++) {
+		if (cfg_oneshot)
+			add_recvzc_oneshot(ctx, i, page_size);
+		else
+			add_recvzc(ctx, i);
+	}
+
+	tstop = gettimeofday_ms() + 5000;
+	while (ctx->nr_conns > 0 && gettimeofday_ms() < tstop)
+		server_loop(ctx);
+
+	if (ctx->nr_conns != 0)
+		error(1, 0, "test failed: %d connections incomplete",
+		      ctx->nr_conns);
+
+	return NULL;
+}
+
+static int find_thread_by_napi(int napi_id)
+{
+	int i;
+
+	for (i = 0; i < cfg_num_napi_ids; i++) {
+		if (cfg_napi_ids[i] == napi_id)
+			return i;
+	}
+	return -1;
+}
+
+static void run_server(void)
+{
+	struct epoll_event ev = { .events = EPOLLIN };
+	int timeout_ms = FIRST_ACCEPT_TIMEOUT_MS;
+	struct thread_ctx *ctxs;
+	pthread_t *threads;
+	int fd, epfd, ret, enable, i;
+
+	ctxs = calloc(cfg_num_threads, sizeof(*ctxs));
+	threads = calloc(cfg_num_threads, sizeof(*threads));
+	if (!ctxs || !threads)
+		error(1, 0, "calloc()");
 
 	fd = socket(AF_INET6, SOCK_STREAM, 0);
 	if (fd == -1)
@@ -359,26 +399,78 @@ static void run_server(void)
 	if (listen(fd, 1024) < 0)
 		error(1, 0, "listen()");
 
-	flags |= IORING_SETUP_COOP_TASKRUN;
-	flags |= IORING_SETUP_SINGLE_ISSUER;
-	flags |= IORING_SETUP_DEFER_TASKRUN;
-	flags |= IORING_SETUP_SUBMIT_ALL;
-	flags |= IORING_SETUP_CQE32;
+	pthread_barrier_init(&barrier, NULL, cfg_num_threads + 1);
 
-	io_uring_queue_init(512, &ctx.ring, flags);
+	for (i = 0; i < cfg_num_threads; i++)
+		ctxs[i].queue_id = cfg_queue_id + i;
+
+	for (i = 0; i < cfg_num_threads; i++) {
+		ret = pthread_create(&threads[i], NULL, server_worker, &ctxs[i]);
+		if (ret)
+			error(1, ret, "pthread_create()");
+	}
+
+	pthread_barrier_wait(&barrier);
 
-	setup_zcrx(&ctx);
 	if (cfg_dry_run)
-		return;
+		goto join;
 
-	add_accept(&ctx, fd);
+	epfd = epoll_create1(0);
+	if (epfd < 0)
+		error(1, errno, "epoll_create1()");
 
-	tstop = gettimeofday_ms() + 5000;
-	while (!ctx.stop && gettimeofday_ms() < tstop)
-		server_loop(&ctx);
+	if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) < 0)
+		error(1, errno, "epoll_ctl()");
+
+	while (1) {
+		struct epoll_event out_ev;
+		int nfds, idx, connfd;
+
+		nfds = epoll_wait(epfd, &out_ev, 1, timeout_ms);
+		if (nfds < 0)
+			error(1, errno, "epoll_wait()");
+		if (nfds == 0)
+			break;
+		timeout_ms = ACCEPT_TIMEOUT_MS;
+
+		connfd = accept(fd, NULL, NULL);
+		if (connfd < 0)
+			error(1, errno, "accept()");
+
+		if (cfg_num_napi_ids > 0) {
+			int napi_id;
+			socklen_t len = sizeof(napi_id);
+
+			ret = getsockopt(connfd, SOL_SOCKET,
+					 SO_INCOMING_NAPI_ID,
+					 &napi_id, &len);
+			if (ret < 0)
+				error(1, errno, "getsockopt(SO_INCOMING_NAPI_ID)");
+
+			idx = find_thread_by_napi(napi_id);
+			if (idx < 0)
+				error(1, 0, "unknown NAPI ID: %d", napi_id);
+		} else {
+			idx = 0;
+		}
+
+		if (ctxs[idx].nr_conns >= MAX_CONNS_PER_WORKER)
+			error(1, 0, "worker %d connection overflow", idx);
+		ctxs[idx].connfds[ctxs[idx].nr_conns++] = connfd;
+	}
 
-	if (!ctx.stop)
-		error(1, 0, "test failed\n");
+	close(epfd);
+
+	pthread_barrier_wait(&barrier);
+
+join:
+	for (i = 0; i < cfg_num_threads; i++)
+		pthread_join(threads[i], NULL);
+
+	pthread_barrier_destroy(&barrier);
+	close(fd);
+	free(threads);
+	free(ctxs);
 }
 
 static void *client_worker(void *arg)
@@ -438,8 +530,8 @@ static void run_client(void)
 static void usage(const char *filepath)
 {
 	error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port> "
-		    "-l<payload_size> -i<ifname> -q<rxq_id> -t<num_threads>",
-		    filepath);
+		    "-l<payload_size> -i<ifname> -q<rxq_id> -t<num_threads> "
+		    "-n<napi_id_csv>", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
@@ -457,7 +549,7 @@ static void parse_opts(int argc, char **argv)
 		usage(argv[0]);
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:dt:")) != -1) {
+	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:dt:n:")) != -1) {
 		switch (c) {
 		case 's':
 			if (cfg_client)
@@ -501,6 +593,18 @@ static void parse_opts(int argc, char **argv)
 		case 't':
 			cfg_num_threads = strtoul(optarg, NULL, 0);
 			break;
+		case 'n': {
+			char *tok, *str = optarg;
+
+			cfg_num_napi_ids = 0;
+			while ((tok = strsep(&str, ",")) != NULL) {
+				if (cfg_num_napi_ids >= 64)
+					error(1, 0, "too many NAPI IDs");
+				cfg_napi_ids[cfg_num_napi_ids++] =
+					strtoul(tok, NULL, 0);
+			}
+			break;
+		}
 		}
 	}
 
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 6/6] selftests: net: add rss_multiqueue test variant to iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
  To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>

Add a new rss_multiqueue Python test variant that exercises multi-queue
zero-copy receive on a single listening socket, where the server
dispatches accepted connections to worker threads by SO_INCOMING_NAPI_ID.

The setup creates an RSS context spanning N receive queues and a single
flow rule that uses that context, then queries the NAPI ID for each
queue at runtime via netlink queue_get(). The NAPI IDs are passed to
the C binary via a new -n option so it can map each accepted connection
to the worker handling that NAPI's queue. The client spawns more
connections than worker threads to exercise multiple connections per
worker.

Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
 .../selftests/drivers/net/hw/iou-zcrx.py      | 59 ++++++++++++++++++-
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
index e81724cb5542a..896376b26e01a 100755
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
@@ -29,6 +29,12 @@ def create_rss_ctx(cfg):
     return int(values)
 
 
+def create_rss_ctx_multi(cfg, start, count):
+    output = ethtool(f"-X {cfg.ifname} context new start {start} equal {count}").stdout
+    values = re.search(r'New RSS context is (\d+)', output).group(1)
+    return int(values)
+
+
 def set_flow_rule(cfg):
     output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port {cfg.port} action {cfg.target}").stdout
     values = re.search(r'ID (\d+)', output).group(1)
@@ -100,16 +106,65 @@ def rss(cfg):
     defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
 
 
+def rss_multiqueue(cfg):
+    channels = cfg.ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
+    channels = channels['combined-count']
+    if channels < 3:
+        raise KsftSkipEx('Test requires NETIF with at least 3 combined channels')
+
+    rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+    rx_rings = rings['rx']
+    hds_thresh = rings.get('hds-thresh', 0)
+
+    cfg.ethnl.rings_set({'header': {'dev-index': cfg.ifindex},
+                         'tcp-data-split': 'enabled',
+                         'hds-thresh': 0,
+                         'rx': 64})
+    defer(cfg.ethnl.rings_set, {'header': {'dev-index': cfg.ifindex},
+                                'tcp-data-split': 'unknown',
+                                'hds-thresh': hds_thresh,
+                                'rx': rx_rings})
+    defer(mp_clear_wait, cfg)
+
+    cfg.num_threads = 2
+    cfg.target = channels - cfg.num_threads
+    ethtool(f"-X {cfg.ifname} equal {cfg.target}")
+    defer(ethtool, f"-X {cfg.ifname} default")
+
+    rss_ctx_id = create_rss_ctx_multi(cfg, cfg.target, cfg.num_threads)
+    defer(ethtool, f"-X {cfg.ifname} delete context {rss_ctx_id}")
+
+    flow_rule_id = set_flow_rule_rss(cfg, rss_ctx_id)
+    defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
+
+    napi_ids = []
+    for i in range(cfg.num_threads):
+        queue = cfg.netnl.queue_get({'ifindex': cfg.ifindex,
+                                     'id': cfg.target + i,
+                                     'type': 'rx'})
+        napi_ids.append(str(queue['napi-id']))
+    cfg.napi_ids = ','.join(napi_ids)
+
+
 @ksft_variants([
     KsftNamedVariant("single", single),
     KsftNamedVariant("rss", rss),
+    KsftNamedVariant("rss_multiqueue", rss_multiqueue),
 ])
 def test_zcrx(cfg, setup) -> None:
     cfg.require_ipver('6')
 
+    cfg.num_threads = 1
+    cfg.napi_ids = None
+
     setup(cfg)
-    rx_cmd = f"{cfg.bin_local} -s -p {cfg.port} -i {cfg.ifname} -q {cfg.target}"
-    tx_cmd = f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {cfg.port} -l 12840"
+
+    rx_cmd = (f"{cfg.bin_local} -s -p {cfg.port} -i {cfg.ifname} "
+              f"-q {cfg.target} -t {cfg.num_threads}")
+    if cfg.napi_ids:
+        rx_cmd += f" -n {cfg.napi_ids}"
+    tx_cmd = (f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {cfg.port} "
+              f"-l 12840 -t {cfg.num_threads}")
     with bkg(rx_cmd, exit_wait=True):
         wait_port_listen(cfg.port, proto="tcp")
         cmd(tx_cmd, host=cfg.remote)
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH] rds: zero per-item info buffer before handing it to visitors
From: Sharath Srinivasan @ 2026-04-17 16:53 UTC (permalink / raw)
  To: Michael Bommarito, Allison Henderson, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20260417141916.494761-1-michael.bommarito@gmail.com>


On 2026-04-17 7:19 a.m., Michael Bommarito wrote:
> Yet another from my "clanker."  This only applies to people who
> don't use CONFIG_INIT_STACK_ALL_ZERO, but I presume that's
> still enough people that it's worth backporting since it can
> be chained through leaked addresses to defeat KASLR.
> 
> rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
> caller-allocated on-stack u64 buffer to a per-connection visitor and
> then copy the full item_len bytes back to user space via
> rds_info_copy() regardless of how much of the buffer the visitor
> actually wrote.
> 
> rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
> write a subset of their output struct when the underlying
> rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
> and the two GIDs via explicit memsets).  Several u32 fields
> (max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
> cache_allocs) and the 2-byte alignment hole between sl and
> cache_allocs remain as whatever stack contents preceded the visitor
> call and are then memcpy_to_user()'d out to user space.
> 
> struct rds_info_rdma_connection and struct rds6_info_rdma_connection
> are the only rds_info_* structs in include/uapi/linux/rds.h that are
> not marked __attribute__((packed)), so they have a real alignment
> hole.  The other info visitors (rds_conn_info_visitor,
> rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
> their packed output struct today and are not known to be vulnerable,
> but a future visitor that adds a conditional write-path would have
> the same bug.
> 
> Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
> a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
> binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
> any netdev is sufficient), sendto()'s any peer on the same subnet
> (fails cleanly but installs an rds_connection in the global hash in
> RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
> RDS_INFO_IB_CONNECTIONS).  The returned 68-byte item contains 26
> bytes of stack garbage including kernel text/data pointers:
> 
>     0..7   0a 63 00 01 0a 63 00 02     src=10.99.0.1 dst=10.99.0.2
>     8..39  00 ...                      gids (memset-zeroed)
>     40..47 e0 92 a3 81 ff ff ff ff     kernel pointer (max_send_wr)
>     48..55 7f 37 b5 81 ff ff ff ff     kernel pointer (rdma_mr_max)
>     56..59 01 00 08 00                 rdma_mr_size (garbage)
>     60..61 00 00                       tos, sl
>     62..63 00 00                       alignment padding
>     64..67 18 00 00 00                 cache_allocs (garbage)
> 
> Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
> and rds_walk_conn_path_info() before invoking the visitor.  This
> covers the IPv4/IPv6 IB visitors and hardens all current and future
> visitors against the same class of bug.
> 
> No functional change for visitors that fully populate their output.
> 
> Fixes: ec16227e1414 ("RDS/IB: Infiniband transport")

LGTM. Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>

Thanks,
Sharath

> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Assisted-by: Claude:claude-opus-4-7
> ---
>  net/rds/connection.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/net/rds/connection.c b/net/rds/connection.c
> index 412441aaa298..c10b7ed06c49 100644
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -701,6 +701,13 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
>  	     i++, head++) {
>  		hlist_for_each_entry_rcu(conn, head, c_hash_node) {
>  
> +			/* Zero the per-item buffer before handing it to the
> +			 * visitor so any field the visitor does not write -
> +			 * including implicit alignment padding - cannot leak
> +			 * stack contents to user space via rds_info_copy().
> +			 */
> +			memset(buffer, 0, item_len);
> +
>  			/* XXX no c_lock usage.. */
>  			if (!visitor(conn, buffer))
>  				continue;
> @@ -750,6 +757,13 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
>  			 */
>  			cp = conn->c_path;
>  
> +			/* Zero the per-item buffer for the same reason as
> +			 * rds_for_each_conn_info(): any byte the visitor
> +			 * does not write (including alignment padding) must
> +			 * not leak stack contents via rds_info_copy().
> +			 */
> +			memset(buffer, 0, item_len);
> +
>  			/* XXX no cp_lock usage.. */
>  			if (!visitor(cp, buffer))
>  				continue;


^ permalink raw reply

* [PATCH net] ipv6: Implement limits on extension header parsing
From: Daniel Borkmann @ 2026-04-17 17:18 UTC (permalink / raw)
  To: kuba; +Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
	netdev

ipv6_{skip_exthdr,find_hdr}() and ip6_tnl_parse_tlv_enc_lim() iterate
over IPv6 extension headers until they find a non-extension-header
protocol or run out of packet data. The loops have no iteration counter,
relying solely on the packet length to bound them. For a crafted packet
with 8-byte extension headers filling a 64KB jumbogram, this means a
worst case of up to ~8k iterations with a skb_header_pointer call each.
ipv6_skip_exthdr(), for example, is used where it parses the inner
quoted packet inside an incoming ICMPv6 error:

  - icmpv6_rcv
    - checksum validation
    - case ICMPV6_DEST_UNREACH
      - icmpv6_notify
        - pskb_may_pull()       <- pull inner IPv6 header
        - ipv6_skip_exthdr()    <- iterates here
        - pskb_may_pull()
        - ipprot->err_handler() <- sk lookup (matching sk not required)

The per-iteration cost of ipv6_skip_exthdr itself is generally light,
but skb_header_pointer becomes more costly on reassembled packets: the
first ~1KB of the inner packet are in the skb's linear area, but the
remaining ~63KB are in the frag_list where skb_copy_bits is needed to
read data.

Add a configurable limit via a new sysctl net.ipv6.max_ext_hdrs_number
(default 32, minimum 1). All three extension header walking functions
are bound by this limit. The sysctl is in line with commit 47d3d7ac656a
("ipv6: Implement limits on Hop-by-Hop and Destination options"). The
init_net is used since plumbing a struct net * through all helpers
would touch a lot of callsites.

There's an ongoing IETF draft-ietf-6man-eh-limits-18 that states that
8 extension headers before the transport header is the baseline which
routers MUST handle; section 7 details also why limits are needed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 Documentation/networking/ip-sysctl.rst |  7 +++++++
 include/net/ipv6.h                     |  2 ++
 include/net/netns/ipv6.h               |  1 +
 net/ipv6/af_inet6.c                    |  1 +
 net/ipv6/exthdrs_core.c                | 11 +++++++++++
 net/ipv6/ip6_tunnel.c                  |  5 +++++
 net/ipv6/sysctl_net_ipv6.c             |  8 ++++++++
 7 files changed, 35 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 6921d8594b84..4559a956bbd9 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -2503,6 +2503,13 @@ max_hbh_length - INTEGER
 
 	Default: INT_MAX (unlimited)
 
+max_ext_hdrs_number - INTEGER
+	Maximum number of IPv6 extension headers allowed in a packet.
+	Limits how many extension headers will be traversed. The value
+	is read from the initial netns.
+
+	Default: 32
+
 skip_notify_on_dev_down - BOOLEAN
 	Controls whether an RTM_DELROUTE message is generated for routes
 	removed when a device is taken down or deleted. IPv4 does not
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 53c5056508be..d7f0d55e6918 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -90,6 +90,8 @@ struct ip_tunnel_info;
 #define IP6_DEFAULT_MAX_DST_OPTS_LEN	 INT_MAX /* No limit */
 #define IP6_DEFAULT_MAX_HBH_OPTS_LEN	 INT_MAX /* No limit */
 
+#define IP6_DEFAULT_MAX_EXT_HDRS_CNT	 32
+
 /*
  *	Addr type
  *	
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 34bdb1308e8f..5be4dd1c9ae8 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -54,6 +54,7 @@ struct netns_sysctl_ipv6 {
 	int max_hbh_opts_cnt;
 	int max_dst_opts_len;
 	int max_hbh_opts_len;
+	int max_ext_hdrs_cnt;
 	int seg6_flowlabel;
 	u32 ioam6_id;
 	u64 ioam6_id_wide;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 4cbd45b68088..ed7fe6e4a6bd 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -965,6 +965,7 @@ static int __net_init inet6_net_init(struct net *net)
 	net->ipv6.sysctl.flowlabel_state_ranges = 0;
 	net->ipv6.sysctl.max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT;
 	net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
+	net->ipv6.sysctl.max_ext_hdrs_cnt = IP6_DEFAULT_MAX_EXT_HDRS_CNT;
 	net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
 	net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
 	net->ipv6.sysctl.fib_notify_on_flag_change = 0;
diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
index 49e31e4ae7b7..917307877cbb 100644
--- a/net/ipv6/exthdrs_core.c
+++ b/net/ipv6/exthdrs_core.c
@@ -4,6 +4,8 @@
  * not configured or static.
  */
 #include <linux/export.h>
+
+#include <net/net_namespace.h>
 #include <net/ipv6.h>
 
 /*
@@ -72,7 +74,9 @@ EXPORT_SYMBOL(ipv6_ext_hdr);
 int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
 		     __be16 *frag_offp)
 {
+	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
 	u8 nexthdr = *nexthdrp;
+	int exthdr_cnt = 0;
 
 	*frag_offp = 0;
 
@@ -80,6 +84,8 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
 		struct ipv6_opt_hdr _hdr, *hp;
 		int hdrlen;
 
+		if (unlikely(exthdr_cnt++ >= exthdr_max))
+			return -1;
 		if (nexthdr == NEXTHDR_NONE)
 			return -1;
 		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
@@ -188,8 +194,10 @@ EXPORT_SYMBOL_GPL(ipv6_find_tlv);
 int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 		  int target, unsigned short *fragoff, int *flags)
 {
+	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
 	unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
 	u8 nexthdr = ipv6_hdr(skb)->nexthdr;
+	int exthdr_cnt = 0;
 	bool found;
 
 	if (fragoff)
@@ -216,6 +224,9 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 			return -ENOENT;
 		}
 
+		if (unlikely(exthdr_cnt++ >= exthdr_max))
+			return -EBADMSG;
+
 		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
 		if (!hp)
 			return -EBADMSG;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0b53488a9229..78e849e167ca 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -396,15 +396,20 @@ ip6_tnl_dev_uninit(struct net_device *dev)
 
 __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
 {
+	int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
 	const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)raw;
 	unsigned int nhoff = raw - skb->data;
 	unsigned int off = nhoff + sizeof(*ipv6h);
 	u8 nexthdr = ipv6h->nexthdr;
+	int exthdr_cnt = 0;
 
 	while (ipv6_ext_hdr(nexthdr) && nexthdr != NEXTHDR_NONE) {
 		struct ipv6_opt_hdr *hdr;
 		u16 optlen;
 
+		if (unlikely(exthdr_cnt++ >= exthdr_max))
+			break;
+
 		if (!pskb_may_pull(skb, off + sizeof(*hdr)))
 			break;
 
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index d2cd33e2698d..93f865545a7c 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &flowlabel_reflect_max,
 	},
+	{
+		.procname	= "max_ext_hdrs_number",
+		.data		= &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "max_dst_opts_number",
 		.data		= &init_net.ipv6.sysctl.max_dst_opts_cnt,
-- 
2.43.0


^ permalink raw reply related

* [PATCH net v2] ibmveth: Disable GSO for packets with small MSS
From: Mingming Cao @ 2026-04-17 17:29 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, edumazet, pabeni, horms, bjking1, haren, ricklind,
	maddy, mpe, linuxppc-dev, stable, Mingming Cao, Shaik Abdulla,
	Naveed Ahmed

Some physical adapters on Power systems do not support segmentation
offload when the MSS is less than 224 bytes. Attempting to send such
packets causes the adapter to freeze, stopping all traffic until
manually reset.

Implement ndo_features_check to disable GSO for packets with small MSS
values. The network stack will perform software segmentation instead.

The 224-byte minimum matches ibmvnic
commit <f10b09ef687f> ("ibmvnic: Enforce stronger sanity checks
on GSO packets")
which uses the same physical adapters in SEA configurations.

Validated using iptables to force small MSS values. Without the fix,
the adapter freezes. With the fix, packets are segmented in software
and transmission succeeds.

Fixes: 8641dd85799f ("ibmveth: Add support for TSO")
Cc: stable@vger.kernel.org
Reviewed-by: Brian King <bjking1@linux.ibm.com>
Tested-by: Shaik Abdulla <shaik.abdulla1@ibm.com>
Tested-by: Naveed Ahmed <naveedaus@in.ibm.com>
Signed-off-by: Mingming Cao <mmc@linux.ibm.com>
---
v2: Add Fixes tag as requested by automated checks 

 drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
 drivers/net/ethernet/ibm/ibmveth.h |  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 58cc3147afe2..7935c9384ef4 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1756,6 +1756,25 @@ static int ibmveth_set_mac_addr(struct net_device *dev, void *p)
 	return 0;
 }
 
+static netdev_features_t ibmveth_features_check(struct sk_buff *skb,
+						struct net_device *dev,
+						netdev_features_t features)
+{
+	/* Some physical adapters do not support segmentation offload with
+	 * MSS < 224. Disable GSO for such packets to avoid adapter freeze.
+	 */
+	if (skb_is_gso(skb)) {
+		if (skb_shinfo(skb)->gso_size < IBMVETH_MIN_LSO_MSS) {
+			netdev_warn_once(dev,
+					 "MSS %u too small for LSO, disabling GSO\n",
+					 skb_shinfo(skb)->gso_size);
+			features &= ~NETIF_F_GSO_MASK;
+		}
+	}
+
+	return features;
+}
+
 static const struct net_device_ops ibmveth_netdev_ops = {
 	.ndo_open		= ibmveth_open,
 	.ndo_stop		= ibmveth_close,
@@ -1767,6 +1786,7 @@ static const struct net_device_ops ibmveth_netdev_ops = {
 	.ndo_set_features	= ibmveth_set_features,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address    = ibmveth_set_mac_addr,
+	.ndo_features_check	= ibmveth_features_check,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= ibmveth_poll_controller,
 #endif
diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
index 068f99df133e..d87713668ed3 100644
--- a/drivers/net/ethernet/ibm/ibmveth.h
+++ b/drivers/net/ethernet/ibm/ibmveth.h
@@ -37,6 +37,7 @@
 #define IBMVETH_ILLAN_IPV4_TCP_CSUM		0x0000000000000002UL
 #define IBMVETH_ILLAN_ACTIVE_TRUNK		0x0000000000000001UL
 
+#define IBMVETH_MIN_LSO_MSS		224	/* Minimum MSS for LSO */
 /* hcall macros */
 #define h_register_logical_lan(ua, buflst, rxq, fltlst, mac) \
   plpar_hcall_norets(H_REGISTER_LOGICAL_LAN, ua, buflst, rxq, fltlst, mac)
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related

* Re: [PATCH net-next 5/6] net: stmmac: move PHY handling out of __stmmac_open()/release()
From: Andrew Lunn @ 2026-04-17 17:30 UTC (permalink / raw)
  To: Alexander Stein
  Cc: Russell King (Oracle), Maxime Chevallier, Heiner Kallweit,
	Alexandre Torgue, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Paolo Abeni
In-Reply-To: <13939918.O9o76ZdvQC@steina-w>

> Thanks for conforming this for another PHY. What I'm wondering right now:
> Why is the PHY stopped in the first place? We are just changing the MTU, no?

It is not too uncommon to see an MTU change destroying everything and
rebuilding it, especially when it was retrofitted into an older driver
which had fixed MTU.

> I have a proof of concept running, but it needs more cleanup due
> to code duplication.

You probably also want to take a look at the ethtool code for
configuring rings. Oh, it is even worse:

int stmmac_reinit_ringparam(struct net_device *dev, u32 rx_size, u32 tx_size)
{
        struct stmmac_priv *priv = netdev_priv(dev);
        int ret = 0;

        if (netif_running(dev))
                stmmac_release(dev);

        priv->dma_conf.dma_rx_size = rx_size;
        priv->dma_conf.dma_tx_size = tx_size;

        if (netif_running(dev))
                ret = stmmac_open(dev);

        return ret;
}

So not even using __stmmac_release() or __stmmac_open(), and leaving
you with a dead interface if there is not enough memory to allocate
the rings.

These paths should really share the same code.

      Andrew

^ permalink raw reply

* [PATCH 0/9] bitfield: add FIELD_GET_SIGNED()
From: Yury Norov @ 2026-04-17 17:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Jonathan Cameron,
	David Lechner, Nuno Sá, Andy Shevchenko, Ping-Ke Shih,
	Richard Cochran, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexandre Belloni, Yury Norov,
	Rasmus Villemoes, Hans de Goede, Linus Walleij, Sakari Ailus,
	Salah Triki, Achim Gratz, Ben Collins, linux-kernel, linux-iio,
	linux-wireless, netdev, linux-rtc
  Cc: Yury Norov

The bitfields are designed in assumption that fields contain unsigned
integer values, thus extracting the values from the field implies
zero-extending.

Some drivers need to sign-extend their fields, and currently do it like:

	dc_re += sign_extend32(FIELD_GET(0xfff000, tmp), 11);
	dc_im += sign_extend32(FIELD_GET(0xfff, tmp), 11);

It's error-prone because it relies on user to provide the correct
index of the most significant bit.

This series adds a signed version of FIELD_GET(), which is the more
convenient and compiles (on x86_64) to just a couple instructions:
shl and sar.

Patch #1 adds FIELD_GET_SIGNED(), and the rest of the series applies it
tree-wide.

Yury Norov (9):
  bitfield: add FIELD_GET_SIGNED()
  x86/extable: switch to using FIELD_GET_SIGNED()
  iio: intel_dc_ti_adc: switch to using
  iio: magnetometer: yas530: switch to using FIELD_GET_SIGNED()
  iio: pressure: bmp280: switch to using
  iio: mcp9600: switch to using FIELD_GET_SIGNED()
  wifi: rtw89: switch to using FIELD_GET_SIGNED()
  rtc: rv3032: switch to using FIELD_GET_SIGNED()
  ptp: switch to using FIELD_GET_SIGNED()

 arch/x86/include/asm/extable_fixup_types.h       | 13 ++++---------
 arch/x86/mm/extable.c                            |  2 +-
 drivers/iio/adc/intel_dc_ti_adc.c                |  4 ++--
 drivers/iio/magnetometer/yamaha-yas530.c         | 12 ++++++------
 drivers/iio/pressure/bmp280-core.c               |  2 +-
 drivers/iio/temperature/mcp9600.c                |  2 +-
 .../net/wireless/realtek/rtw89/rtw8852a_rfk.c    |  4 ++--
 .../net/wireless/realtek/rtw89/rtw8852b_common.c |  4 ++--
 .../net/wireless/realtek/rtw89/rtw8852b_rfk.c    |  4 ++--
 drivers/net/wireless/realtek/rtw89/rtw8852c.c    |  4 ++--
 drivers/ptp/ptp_fc3.c                            |  4 ++--
 drivers/rtc/rtc-rv3032.c                         |  2 +-
 include/linux/bitfield.h                         | 16 ++++++++++++++++
 13 files changed, 42 insertions(+), 31 deletions(-)

-- 
2.51.0


^ permalink raw reply

* [PATCH 1/9] bitfield: add FIELD_GET_SIGNED()
From: Yury Norov @ 2026-04-17 17:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Jonathan Cameron,
	David Lechner, Nuno Sá, Andy Shevchenko, Ping-Ke Shih,
	Richard Cochran, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexandre Belloni, Yury Norov,
	Rasmus Villemoes, Hans de Goede, Linus Walleij, Sakari Ailus,
	Salah Triki, Achim Gratz, Ben Collins, linux-kernel, linux-iio,
	linux-wireless, netdev, linux-rtc
  Cc: Yury Norov
In-Reply-To: <20260417173621.368914-1-ynorov@nvidia.com>

The bitfields are designed in assumption that fields contain unsigned
integer values, thus extracting the values from the field implies
zero-extending.

Some drivers need to sign-extend their fields, and currently do it like:

	dc_re += sign_extend32(FIELD_GET(0xfff000, tmp), 11);
	dc_im += sign_extend32(FIELD_GET(0xfff, tmp), 11);

It's error-prone because it relies on user to provide the correct
index of the most significant bit and proper 32 vs 64 function flavor.

Thus, introduce a FIELD_GET_SIGNED() macro, which is the more
convenient and compiles (on x86_64) to just a couple instructions:
shl and sar.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 include/linux/bitfield.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
index 54aeeef1f0ec..35ef63972810 100644
--- a/include/linux/bitfield.h
+++ b/include/linux/bitfield.h
@@ -178,6 +178,22 @@
 		__FIELD_GET(_mask, _reg, "FIELD_GET: ");		\
 	})
 
+/**
+ * FIELD_GET_SIGNED() - extract a signed bitfield element
+ * @mask: shifted mask defining the field's length and position
+ * @reg:  value of entire bitfield
+ *
+ * Returns the sign-extended field specified by @_mask from the
+ * bitfield passed in as @_reg by masking and shifting it down.
+ */
+#define FIELD_GET_SIGNED(mask, reg)					\
+	({								\
+		__BF_FIELD_CHECK(mask, reg, 0U, "FIELD_GET_SIGNED: ");	\
+		 ((__signed_scalar_typeof(mask))((long long)(reg) <<	\
+		 __builtin_clzll(mask) >> (__builtin_clzll(mask) +	\
+						__builtin_ctzll(mask))));\
+	})
+
 /**
  * FIELD_MODIFY() - modify a bitfield element
  * @_mask: shifted mask defining the field's length and position
-- 
2.51.0


^ permalink raw reply related

* [PATCH 2/9] x86/extable: switch to using FIELD_GET_SIGNED()
From: Yury Norov @ 2026-04-17 17:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Jonathan Cameron,
	David Lechner, Nuno Sá, Andy Shevchenko, Ping-Ke Shih,
	Richard Cochran, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexandre Belloni, Yury Norov,
	Rasmus Villemoes, Hans de Goede, Linus Walleij, Sakari Ailus,
	Salah Triki, Achim Gratz, Ben Collins, linux-kernel, linux-iio,
	linux-wireless, netdev, linux-rtc
  Cc: Yury Norov
In-Reply-To: <20260417173621.368914-1-ynorov@nvidia.com>

The EX_DATA register is laid out such that EX_DATA_IMM occupied MSB.
It's done to make sure that FIELD_GET() will sign-extend the IMM
field during extraction.

To enforce that, all EX_DATA masks are made signed integers. This
works, but relies on the particular implementation of FIELD_GET(),
i.e. masking then shifting, not vice versa; and the particular
placement of the fields in the register.

Switch to using the dedicated FIELD_GET_SIGNED(), and relax those
limitations.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 arch/x86/include/asm/extable_fixup_types.h | 13 ++++---------
 arch/x86/mm/extable.c                      |  2 +-
 2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 906b0d5541e8..fd0cfb472103 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -2,15 +2,10 @@
 #ifndef _ASM_X86_EXTABLE_FIXUP_TYPES_H
 #define _ASM_X86_EXTABLE_FIXUP_TYPES_H
 
-/*
- * Our IMM is signed, as such it must live at the top end of the word. Also,
- * since C99 hex constants are of ambiguous type, force cast the mask to 'int'
- * so that FIELD_GET() will DTRT and sign extend the value when it extracts it.
- */
-#define EX_DATA_TYPE_MASK		((int)0x000000FF)
-#define EX_DATA_REG_MASK		((int)0x00000F00)
-#define EX_DATA_FLAG_MASK		((int)0x0000F000)
-#define EX_DATA_IMM_MASK		((int)0xFFFF0000)
+#define EX_DATA_TYPE_MASK		(0x000000FF)
+#define EX_DATA_REG_MASK		(0x00000F00)
+#define EX_DATA_FLAG_MASK		(0x0000F000)
+#define EX_DATA_IMM_MASK		(0xFFFF0000)
 
 #define EX_DATA_REG_SHIFT		8
 #define EX_DATA_FLAG_SHIFT		12
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 6b9ff1c6cafa..ae663cf88a3c 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -322,7 +322,7 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 
 	type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
 	reg  = FIELD_GET(EX_DATA_REG_MASK,  e->data);
-	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
+	imm  = FIELD_GET_SIGNED(EX_DATA_IMM_MASK, e->data);
 
 	switch (type) {
 	case EX_TYPE_DEFAULT:
-- 
2.51.0


^ permalink raw reply related

* [PATCH 5/9] iio: pressure: bmp280: switch to using FIELD_GET_SIGNED()
From: Yury Norov @ 2026-04-17 17:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Jonathan Cameron,
	David Lechner, Nuno Sá, Andy Shevchenko, Ping-Ke Shih,
	Richard Cochran, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexandre Belloni, Yury Norov,
	Rasmus Villemoes, Hans de Goede, Linus Walleij, Sakari Ailus,
	Salah Triki, Achim Gratz, Ben Collins, linux-kernel, linux-iio,
	linux-wireless, netdev, linux-rtc
  Cc: Yury Norov
In-Reply-To: <20260417173621.368914-1-ynorov@nvidia.com>

Switch from sign_extend32(FIELD_GET()) to the dedicated
FIELD_GET_SIGNED() and don't calculate the fields length explicitly.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 drivers/iio/pressure/bmp280-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iio/pressure/bmp280-core.c b/drivers/iio/pressure/bmp280-core.c
index d983ce9c0b99..f722aea16e0e 100644
--- a/drivers/iio/pressure/bmp280-core.c
+++ b/drivers/iio/pressure/bmp280-core.c
@@ -392,7 +392,7 @@ static int bme280_read_calib(struct bmp280_data *data)
 	h4_lower = FIELD_GET(BME280_COMP_H4_MASK_LOW, tmp_1);
 	calib->H4 = sign_extend32(h4_upper | h4_lower, 11);
 	tmp_3 = get_unaligned_le16(&data->bme280_humid_cal_buf[H5]);
-	calib->H5 = sign_extend32(FIELD_GET(BME280_COMP_H5_MASK, tmp_3), 11);
+	calib->H5 = FIELD_GET_SIGNED(BME280_COMP_H5_MASK, tmp_3);
 	calib->H6 = data->bme280_humid_cal_buf[H6];
 
 	return 0;
-- 
2.51.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox