* [PATCH net v2] ibmveth: Disable GSO for packets with small MSS
From: Mingming Cao @ 2026-04-17 17:29 UTC (permalink / raw)
To: netdev
Cc: davem, kuba, edumazet, pabeni, horms, bjking1, haren, ricklind,
maddy, mpe, linuxppc-dev, stable, Mingming Cao, Shaik Abdulla,
Naveed Ahmed
Some physical adapters on Power systems do not support segmentation
offload when the MSS is less than 224 bytes. Attempting to send such
packets causes the adapter to freeze, stopping all traffic until
manually reset.
Implement ndo_features_check to disable GSO for packets with small MSS
values. The network stack will perform software segmentation instead.
The 224-byte minimum matches ibmvnic
commit <f10b09ef687f> ("ibmvnic: Enforce stronger sanity checks
on GSO packets")
which uses the same physical adapters in SEA configurations.
Validated using iptables to force small MSS values. Without the fix,
the adapter freezes. With the fix, packets are segmented in software
and transmission succeeds.
Fixes: 8641dd85799f ("ibmveth: Add support for TSO")
Cc: stable@vger.kernel.org
Reviewed-by: Brian King <bjking1@linux.ibm.com>
Tested-by: Shaik Abdulla <shaik.abdulla1@ibm.com>
Tested-by: Naveed Ahmed <naveedaus@in.ibm.com>
Signed-off-by: Mingming Cao <mmc@linux.ibm.com>
---
v2: Add Fixes tag as requested by automated checks
drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
drivers/net/ethernet/ibm/ibmveth.h | 1 +
2 files changed, 21 insertions(+)
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 58cc3147afe2..7935c9384ef4 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1756,6 +1756,25 @@ static int ibmveth_set_mac_addr(struct net_device *dev, void *p)
return 0;
}
+static netdev_features_t ibmveth_features_check(struct sk_buff *skb,
+ struct net_device *dev,
+ netdev_features_t features)
+{
+ /* Some physical adapters do not support segmentation offload with
+ * MSS < 224. Disable GSO for such packets to avoid adapter freeze.
+ */
+ if (skb_is_gso(skb)) {
+ if (skb_shinfo(skb)->gso_size < IBMVETH_MIN_LSO_MSS) {
+ netdev_warn_once(dev,
+ "MSS %u too small for LSO, disabling GSO\n",
+ skb_shinfo(skb)->gso_size);
+ features &= ~NETIF_F_GSO_MASK;
+ }
+ }
+
+ return features;
+}
+
static const struct net_device_ops ibmveth_netdev_ops = {
.ndo_open = ibmveth_open,
.ndo_stop = ibmveth_close,
@@ -1767,6 +1786,7 @@ static const struct net_device_ops ibmveth_netdev_ops = {
.ndo_set_features = ibmveth_set_features,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_mac_address = ibmveth_set_mac_addr,
+ .ndo_features_check = ibmveth_features_check,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = ibmveth_poll_controller,
#endif
diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
index 068f99df133e..d87713668ed3 100644
--- a/drivers/net/ethernet/ibm/ibmveth.h
+++ b/drivers/net/ethernet/ibm/ibmveth.h
@@ -37,6 +37,7 @@
#define IBMVETH_ILLAN_IPV4_TCP_CSUM 0x0000000000000002UL
#define IBMVETH_ILLAN_ACTIVE_TRUNK 0x0000000000000001UL
+#define IBMVETH_MIN_LSO_MSS 224 /* Minimum MSS for LSO */
/* hcall macros */
#define h_register_logical_lan(ua, buflst, rxq, fltlst, mac) \
plpar_hcall_norets(H_REGISTER_LOGICAL_LAN, ua, buflst, rxq, fltlst, mac)
--
2.39.3 (Apple Git-146)
^ permalink raw reply related
* [PATCH net] ipv6: Implement limits on extension header parsing
From: Daniel Borkmann @ 2026-04-17 17:18 UTC (permalink / raw)
To: kuba; +Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
netdev
ipv6_{skip_exthdr,find_hdr}() and ip6_tnl_parse_tlv_enc_lim() iterate
over IPv6 extension headers until they find a non-extension-header
protocol or run out of packet data. The loops have no iteration counter,
relying solely on the packet length to bound them. For a crafted packet
with 8-byte extension headers filling a 64KB jumbogram, this means a
worst case of up to ~8k iterations with a skb_header_pointer call each.
ipv6_skip_exthdr(), for example, is used where it parses the inner
quoted packet inside an incoming ICMPv6 error:
- icmpv6_rcv
- checksum validation
- case ICMPV6_DEST_UNREACH
- icmpv6_notify
- pskb_may_pull() <- pull inner IPv6 header
- ipv6_skip_exthdr() <- iterates here
- pskb_may_pull()
- ipprot->err_handler() <- sk lookup (matching sk not required)
The per-iteration cost of ipv6_skip_exthdr itself is generally light,
but skb_header_pointer becomes more costly on reassembled packets: the
first ~1KB of the inner packet are in the skb's linear area, but the
remaining ~63KB are in the frag_list where skb_copy_bits is needed to
read data.
Add a configurable limit via a new sysctl net.ipv6.max_ext_hdrs_number
(default 32, minimum 1). All three extension header walking functions
are bound by this limit. The sysctl is in line with commit 47d3d7ac656a
("ipv6: Implement limits on Hop-by-Hop and Destination options"). The
init_net is used since plumbing a struct net * through all helpers
would touch a lot of callsites.
There's an ongoing IETF draft-ietf-6man-eh-limits-18 that states that
8 extension headers before the transport header is the baseline which
routers MUST handle; section 7 details also why limits are needed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
Documentation/networking/ip-sysctl.rst | 7 +++++++
include/net/ipv6.h | 2 ++
include/net/netns/ipv6.h | 1 +
net/ipv6/af_inet6.c | 1 +
net/ipv6/exthdrs_core.c | 11 +++++++++++
net/ipv6/ip6_tunnel.c | 5 +++++
net/ipv6/sysctl_net_ipv6.c | 8 ++++++++
7 files changed, 35 insertions(+)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 6921d8594b84..4559a956bbd9 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -2503,6 +2503,13 @@ max_hbh_length - INTEGER
Default: INT_MAX (unlimited)
+max_ext_hdrs_number - INTEGER
+ Maximum number of IPv6 extension headers allowed in a packet.
+ Limits how many extension headers will be traversed. The value
+ is read from the initial netns.
+
+ Default: 32
+
skip_notify_on_dev_down - BOOLEAN
Controls whether an RTM_DELROUTE message is generated for routes
removed when a device is taken down or deleted. IPv4 does not
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 53c5056508be..d7f0d55e6918 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -90,6 +90,8 @@ struct ip_tunnel_info;
#define IP6_DEFAULT_MAX_DST_OPTS_LEN INT_MAX /* No limit */
#define IP6_DEFAULT_MAX_HBH_OPTS_LEN INT_MAX /* No limit */
+#define IP6_DEFAULT_MAX_EXT_HDRS_CNT 32
+
/*
* Addr type
*
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 34bdb1308e8f..5be4dd1c9ae8 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -54,6 +54,7 @@ struct netns_sysctl_ipv6 {
int max_hbh_opts_cnt;
int max_dst_opts_len;
int max_hbh_opts_len;
+ int max_ext_hdrs_cnt;
int seg6_flowlabel;
u32 ioam6_id;
u64 ioam6_id_wide;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 4cbd45b68088..ed7fe6e4a6bd 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -965,6 +965,7 @@ static int __net_init inet6_net_init(struct net *net)
net->ipv6.sysctl.flowlabel_state_ranges = 0;
net->ipv6.sysctl.max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT;
net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT;
+ net->ipv6.sysctl.max_ext_hdrs_cnt = IP6_DEFAULT_MAX_EXT_HDRS_CNT;
net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN;
net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN;
net->ipv6.sysctl.fib_notify_on_flag_change = 0;
diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
index 49e31e4ae7b7..917307877cbb 100644
--- a/net/ipv6/exthdrs_core.c
+++ b/net/ipv6/exthdrs_core.c
@@ -4,6 +4,8 @@
* not configured or static.
*/
#include <linux/export.h>
+
+#include <net/net_namespace.h>
#include <net/ipv6.h>
/*
@@ -72,7 +74,9 @@ EXPORT_SYMBOL(ipv6_ext_hdr);
int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
__be16 *frag_offp)
{
+ int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
u8 nexthdr = *nexthdrp;
+ int exthdr_cnt = 0;
*frag_offp = 0;
@@ -80,6 +84,8 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
struct ipv6_opt_hdr _hdr, *hp;
int hdrlen;
+ if (unlikely(exthdr_cnt++ >= exthdr_max))
+ return -1;
if (nexthdr == NEXTHDR_NONE)
return -1;
hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
@@ -188,8 +194,10 @@ EXPORT_SYMBOL_GPL(ipv6_find_tlv);
int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
int target, unsigned short *fragoff, int *flags)
{
+ int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
u8 nexthdr = ipv6_hdr(skb)->nexthdr;
+ int exthdr_cnt = 0;
bool found;
if (fragoff)
@@ -216,6 +224,9 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
return -ENOENT;
}
+ if (unlikely(exthdr_cnt++ >= exthdr_max))
+ return -EBADMSG;
+
hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
if (!hp)
return -EBADMSG;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0b53488a9229..78e849e167ca 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -396,15 +396,20 @@ ip6_tnl_dev_uninit(struct net_device *dev)
__u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
{
+ int exthdr_max = READ_ONCE(init_net.ipv6.sysctl.max_ext_hdrs_cnt);
const struct ipv6hdr *ipv6h = (const struct ipv6hdr *)raw;
unsigned int nhoff = raw - skb->data;
unsigned int off = nhoff + sizeof(*ipv6h);
u8 nexthdr = ipv6h->nexthdr;
+ int exthdr_cnt = 0;
while (ipv6_ext_hdr(nexthdr) && nexthdr != NEXTHDR_NONE) {
struct ipv6_opt_hdr *hdr;
u16 optlen;
+ if (unlikely(exthdr_cnt++ >= exthdr_max))
+ break;
+
if (!pskb_may_pull(skb, off + sizeof(*hdr)))
break;
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index d2cd33e2698d..93f865545a7c 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -135,6 +135,14 @@ static struct ctl_table ipv6_table_template[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = &flowlabel_reflect_max,
},
+ {
+ .procname = "max_ext_hdrs_number",
+ .data = &init_net.ipv6.sysctl.max_ext_hdrs_cnt,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ONE,
+ },
{
.procname = "max_dst_opts_number",
.data = &init_net.ipv6.sysctl.max_dst_opts_cnt,
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] rds: zero per-item info buffer before handing it to visitors
From: Sharath Srinivasan @ 2026-04-17 16:53 UTC (permalink / raw)
To: Michael Bommarito, Allison Henderson, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, netdev, linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20260417141916.494761-1-michael.bommarito@gmail.com>
On 2026-04-17 7:19 a.m., Michael Bommarito wrote:
> Yet another from my "clanker." This only applies to people who
> don't use CONFIG_INIT_STACK_ALL_ZERO, but I presume that's
> still enough people that it's worth backporting since it can
> be chained through leaked addresses to defeat KASLR.
>
> rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a
> caller-allocated on-stack u64 buffer to a per-connection visitor and
> then copy the full item_len bytes back to user space via
> rds_info_copy() regardless of how much of the buffer the visitor
> actually wrote.
>
> rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only
> write a subset of their output struct when the underlying
> rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl
> and the two GIDs via explicit memsets). Several u32 fields
> (max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size,
> cache_allocs) and the 2-byte alignment hole between sl and
> cache_allocs remain as whatever stack contents preceded the visitor
> call and are then memcpy_to_user()'d out to user space.
>
> struct rds_info_rdma_connection and struct rds6_info_rdma_connection
> are the only rds_info_* structs in include/uapi/linux/rds.h that are
> not marked __attribute__((packed)), so they have a real alignment
> hole. The other info visitors (rds_conn_info_visitor,
> rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of
> their packed output struct today and are not known to be vulnerable,
> but a future visitor that adds a conditional write-path would have
> the same bug.
>
> Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y:
> a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB,
> binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on
> any netdev is sufficient), sendto()'s any peer on the same subnet
> (fails cleanly but installs an rds_connection in the global hash in
> RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS,
> RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26
> bytes of stack garbage including kernel text/data pointers:
>
> 0..7 0a 63 00 01 0a 63 00 02 src=10.99.0.1 dst=10.99.0.2
> 8..39 00 ... gids (memset-zeroed)
> 40..47 e0 92 a3 81 ff ff ff ff kernel pointer (max_send_wr)
> 48..55 7f 37 b5 81 ff ff ff ff kernel pointer (rdma_mr_max)
> 56..59 01 00 08 00 rdma_mr_size (garbage)
> 60..61 00 00 tos, sl
> 62..63 00 00 alignment padding
> 64..67 18 00 00 00 cache_allocs (garbage)
>
> Fix by zeroing the per-item buffer in both rds_for_each_conn_info()
> and rds_walk_conn_path_info() before invoking the visitor. This
> covers the IPv4/IPv6 IB visitors and hardens all current and future
> visitors against the same class of bug.
>
> No functional change for visitors that fully populate their output.
>
> Fixes: ec16227e1414 ("RDS/IB: Infiniband transport")
LGTM. Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Thanks,
Sharath
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Assisted-by: Claude:claude-opus-4-7
> ---
> net/rds/connection.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/net/rds/connection.c b/net/rds/connection.c
> index 412441aaa298..c10b7ed06c49 100644
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -701,6 +701,13 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
> i++, head++) {
> hlist_for_each_entry_rcu(conn, head, c_hash_node) {
>
> + /* Zero the per-item buffer before handing it to the
> + * visitor so any field the visitor does not write -
> + * including implicit alignment padding - cannot leak
> + * stack contents to user space via rds_info_copy().
> + */
> + memset(buffer, 0, item_len);
> +
> /* XXX no c_lock usage.. */
> if (!visitor(conn, buffer))
> continue;
> @@ -750,6 +757,13 @@ static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
> */
> cp = conn->c_path;
>
> + /* Zero the per-item buffer for the same reason as
> + * rds_for_each_conn_info(): any byte the visitor
> + * does not write (including alignment padding) must
> + * not leak stack contents via rds_info_copy().
> + */
> + memset(buffer, 0, item_len);
> +
> /* XXX no cp_lock usage.. */
> if (!visitor(cp, buffer))
> continue;
^ permalink raw reply
* [PATCH v2 6/6] selftests: net: add rss_multiqueue test variant to iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>
Add a new rss_multiqueue Python test variant that exercises multi-queue
zero-copy receive on a single listening socket, where the server
dispatches accepted connections to worker threads by SO_INCOMING_NAPI_ID.
The setup creates an RSS context spanning N receive queues and a single
flow rule that uses that context, then queries the NAPI ID for each
queue at runtime via netlink queue_get(). The NAPI IDs are passed to
the C binary via a new -n option so it can map each accepted connection
to the worker handling that NAPI's queue. The client spawns more
connections than worker threads to exercise multiple connections per
worker.
Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
.../selftests/drivers/net/hw/iou-zcrx.py | 59 ++++++++++++++++++-
1 file changed, 57 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
index e81724cb5542a..896376b26e01a 100755
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
@@ -29,6 +29,12 @@ def create_rss_ctx(cfg):
return int(values)
+def create_rss_ctx_multi(cfg, start, count):
+ output = ethtool(f"-X {cfg.ifname} context new start {start} equal {count}").stdout
+ values = re.search(r'New RSS context is (\d+)', output).group(1)
+ return int(values)
+
+
def set_flow_rule(cfg):
output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port {cfg.port} action {cfg.target}").stdout
values = re.search(r'ID (\d+)', output).group(1)
@@ -100,16 +106,65 @@ def rss(cfg):
defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
+def rss_multiqueue(cfg):
+ channels = cfg.ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
+ channels = channels['combined-count']
+ if channels < 3:
+ raise KsftSkipEx('Test requires NETIF with at least 3 combined channels')
+
+ rings = cfg.ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+ rx_rings = rings['rx']
+ hds_thresh = rings.get('hds-thresh', 0)
+
+ cfg.ethnl.rings_set({'header': {'dev-index': cfg.ifindex},
+ 'tcp-data-split': 'enabled',
+ 'hds-thresh': 0,
+ 'rx': 64})
+ defer(cfg.ethnl.rings_set, {'header': {'dev-index': cfg.ifindex},
+ 'tcp-data-split': 'unknown',
+ 'hds-thresh': hds_thresh,
+ 'rx': rx_rings})
+ defer(mp_clear_wait, cfg)
+
+ cfg.num_threads = 2
+ cfg.target = channels - cfg.num_threads
+ ethtool(f"-X {cfg.ifname} equal {cfg.target}")
+ defer(ethtool, f"-X {cfg.ifname} default")
+
+ rss_ctx_id = create_rss_ctx_multi(cfg, cfg.target, cfg.num_threads)
+ defer(ethtool, f"-X {cfg.ifname} delete context {rss_ctx_id}")
+
+ flow_rule_id = set_flow_rule_rss(cfg, rss_ctx_id)
+ defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
+
+ napi_ids = []
+ for i in range(cfg.num_threads):
+ queue = cfg.netnl.queue_get({'ifindex': cfg.ifindex,
+ 'id': cfg.target + i,
+ 'type': 'rx'})
+ napi_ids.append(str(queue['napi-id']))
+ cfg.napi_ids = ','.join(napi_ids)
+
+
@ksft_variants([
KsftNamedVariant("single", single),
KsftNamedVariant("rss", rss),
+ KsftNamedVariant("rss_multiqueue", rss_multiqueue),
])
def test_zcrx(cfg, setup) -> None:
cfg.require_ipver('6')
+ cfg.num_threads = 1
+ cfg.napi_ids = None
+
setup(cfg)
- rx_cmd = f"{cfg.bin_local} -s -p {cfg.port} -i {cfg.ifname} -q {cfg.target}"
- tx_cmd = f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {cfg.port} -l 12840"
+
+ rx_cmd = (f"{cfg.bin_local} -s -p {cfg.port} -i {cfg.ifname} "
+ f"-q {cfg.target} -t {cfg.num_threads}")
+ if cfg.napi_ids:
+ rx_cmd += f" -n {cfg.napi_ids}"
+ tx_cmd = (f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {cfg.port} "
+ f"-l 12840 -t {cfg.num_threads}")
with bkg(rx_cmd, exit_wait=True):
wait_port_listen(cfg.port, proto="tcp")
cmd(tx_cmd, host=cfg.remote)
--
2.52.0
^ permalink raw reply related
* [PATCH v2 5/6] selftests: net: add multithread server support to iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>
Add a multithreaded server with a two-phase architecture: a main thread
runs an epoll loop on the listening socket and dispatches each accepted
connfd to a worker thread by direct array assignment. After the accept
loop ends, a barrier release lets each worker submit one
IORING_OP_RECV_ZC SQE per assigned connfd (tagged with a connection
index in user_data) and process completions in its own io_uring CQE
loop. Each per-worker connfd array has a single writer (main, before
barrier) and a single reader (the worker, after barrier), so no
eventfd, mutex, or queue is required.
With multiple queues, connections are dispatched to the correct worker
by SO_INCOMING_NAPI_ID using a NAPI-ID-to-thread lookup table populated
via a new -n option.
Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
.../selftests/drivers/net/hw/iou-zcrx.c | 238 +++++++++++++-----
1 file changed, 171 insertions(+), 67 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 6eb738ef4b5cc..03ae5228cb5a4 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -87,8 +87,15 @@ static struct sockaddr_in6 cfg_addr;
static unsigned int cfg_rx_buf_len;
static bool cfg_dry_run;
static int cfg_num_threads = 1;
+static int cfg_napi_ids[64];
+static int cfg_num_napi_ids;
static char *payload;
+static pthread_barrier_t barrier;
+
+#define MAX_CONNS_PER_WORKER 64
+#define FIRST_ACCEPT_TIMEOUT_MS 4000
+#define ACCEPT_TIMEOUT_MS 200
struct thread_ctx {
struct io_uring ring;
@@ -97,9 +104,11 @@ struct thread_ctx {
size_t ring_size;
struct io_uring_zcrx_rq rq_ring;
unsigned long area_token;
- int connfd;
- bool stop;
- size_t received;
+ int queue_id;
+
+ int connfds[MAX_CONNS_PER_WORKER];
+ size_t received[MAX_CONNS_PER_WORKER];
+ int nr_conns;
};
static unsigned long gettimeofday_ms(void)
@@ -199,7 +208,7 @@ static void setup_zcrx(struct thread_ctx *ctx)
struct t_io_uring_zcrx_ifq_reg reg = {
.if_idx = ifindex,
- .if_rxq = cfg_queue_id,
+ .if_rxq = ctx->queue_id,
.rq_entries = rq_entries,
.area_ptr = (__u64)(unsigned long)&area_reg,
.region_ptr = (__u64)(unsigned long)®ion_reg,
@@ -224,53 +233,32 @@ static void setup_zcrx(struct thread_ctx *ctx)
ctx->area_token = area_reg.rq_area_token;
}
-static void add_accept(struct thread_ctx *ctx, int sockfd)
+static void add_recvzc(struct thread_ctx *ctx, int conn_idx)
{
struct io_uring_sqe *sqe;
sqe = io_uring_get_sqe(&ctx->ring);
- io_uring_prep_accept(sqe, sockfd, NULL, NULL, 0);
- sqe->user_data = 1;
-}
-
-static void add_recvzc(struct thread_ctx *ctx, int sockfd)
-{
- struct io_uring_sqe *sqe;
-
- sqe = io_uring_get_sqe(&ctx->ring);
-
- io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, 0, 0);
+ io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, ctx->connfds[conn_idx],
+ NULL, 0, 0);
sqe->ioprio |= IORING_RECV_MULTISHOT;
- sqe->user_data = 2;
+ sqe->user_data = conn_idx;
}
-static void add_recvzc_oneshot(struct thread_ctx *ctx, int sockfd, size_t len)
+static void add_recvzc_oneshot(struct thread_ctx *ctx, int conn_idx, size_t len)
{
struct io_uring_sqe *sqe;
sqe = io_uring_get_sqe(&ctx->ring);
- io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, len, 0);
+ io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, ctx->connfds[conn_idx],
+ NULL, len, 0);
sqe->ioprio |= IORING_RECV_MULTISHOT;
- sqe->user_data = 2;
+ sqe->user_data = conn_idx;
}
-static void process_accept(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
-{
- if (cqe->res < 0)
- error(1, 0, "accept()");
- if (ctx->connfd)
- error(1, 0, "Unexpected second connection");
-
- ctx->connfd = cqe->res;
- if (cfg_oneshot)
- add_recvzc_oneshot(ctx, ctx->connfd, page_size);
- else
- add_recvzc(ctx, ctx->connfd);
-}
-
-static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
+static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe,
+ int conn_idx)
{
unsigned rq_mask = ctx->rq_ring.ring_entries - 1;
struct io_uring_zcrx_cqe *rcqe;
@@ -281,7 +269,7 @@ static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
int i;
if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs == 0) {
- ctx->stop = true;
+ ctx->nr_conns--;
return;
}
@@ -290,11 +278,11 @@ static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
if (cfg_oneshot) {
if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs) {
- add_recvzc_oneshot(ctx, ctx->connfd, page_size);
+ add_recvzc_oneshot(ctx, conn_idx, page_size);
cfg_oneshot_recvs--;
}
} else if (!(cqe->flags & IORING_CQE_F_MORE)) {
- add_recvzc(ctx, ctx->connfd);
+ add_recvzc(ctx, conn_idx);
}
rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
@@ -304,10 +292,10 @@ static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
data = (char *)ctx->area_ptr + (rcqe->off & mask);
for (i = 0; i < n; i++) {
- if (*(data + i) != payload[(ctx->received + i)])
+ if (*(data + i) != payload[(ctx->received[conn_idx] + i)])
error(1, 0, "payload mismatch at %d", i);
}
- ctx->received += n;
+ ctx->received[conn_idx] += n;
rqe = &ctx->rq_ring.rqes[(ctx->rq_ring.rq_tail & rq_mask)];
rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | ctx->area_token;
@@ -320,28 +308,80 @@ static void server_loop(struct thread_ctx *ctx)
struct io_uring_cqe *cqe;
unsigned int count = 0;
unsigned int head;
- int i, ret;
io_uring_submit_and_wait(&ctx->ring, 1);
io_uring_for_each_cqe(&ctx->ring, head, cqe) {
- if (cqe->user_data == 1)
- process_accept(ctx, cqe);
- else if (cqe->user_data == 2)
- process_recvzc(ctx, cqe);
- else
- error(1, 0, "unknown cqe");
+ process_recvzc(ctx, cqe, cqe->user_data);
count++;
}
io_uring_cq_advance(&ctx->ring, count);
}
-static void run_server(void)
+static void *server_worker(void *arg)
{
- struct thread_ctx ctx = {};
+ struct thread_ctx *ctx = arg;
unsigned int flags = 0;
- int fd, enable, ret;
uint64_t tstop;
+ int i;
+
+ flags |= IORING_SETUP_COOP_TASKRUN;
+ flags |= IORING_SETUP_SINGLE_ISSUER;
+ flags |= IORING_SETUP_DEFER_TASKRUN;
+ flags |= IORING_SETUP_SUBMIT_ALL;
+ flags |= IORING_SETUP_CQE32;
+
+ io_uring_queue_init(512, &ctx->ring, flags);
+ setup_zcrx(ctx);
+
+ pthread_barrier_wait(&barrier);
+
+ if (cfg_dry_run)
+ return NULL;
+
+ pthread_barrier_wait(&barrier);
+
+ for (i = 0; i < ctx->nr_conns; i++) {
+ if (cfg_oneshot)
+ add_recvzc_oneshot(ctx, i, page_size);
+ else
+ add_recvzc(ctx, i);
+ }
+
+ tstop = gettimeofday_ms() + 5000;
+ while (ctx->nr_conns > 0 && gettimeofday_ms() < tstop)
+ server_loop(ctx);
+
+ if (ctx->nr_conns != 0)
+ error(1, 0, "test failed: %d connections incomplete",
+ ctx->nr_conns);
+
+ return NULL;
+}
+
+static int find_thread_by_napi(int napi_id)
+{
+ int i;
+
+ for (i = 0; i < cfg_num_napi_ids; i++) {
+ if (cfg_napi_ids[i] == napi_id)
+ return i;
+ }
+ return -1;
+}
+
+static void run_server(void)
+{
+ struct epoll_event ev = { .events = EPOLLIN };
+ int timeout_ms = FIRST_ACCEPT_TIMEOUT_MS;
+ struct thread_ctx *ctxs;
+ pthread_t *threads;
+ int fd, epfd, ret, enable, i;
+
+ ctxs = calloc(cfg_num_threads, sizeof(*ctxs));
+ threads = calloc(cfg_num_threads, sizeof(*threads));
+ if (!ctxs || !threads)
+ error(1, 0, "calloc()");
fd = socket(AF_INET6, SOCK_STREAM, 0);
if (fd == -1)
@@ -359,26 +399,78 @@ static void run_server(void)
if (listen(fd, 1024) < 0)
error(1, 0, "listen()");
- flags |= IORING_SETUP_COOP_TASKRUN;
- flags |= IORING_SETUP_SINGLE_ISSUER;
- flags |= IORING_SETUP_DEFER_TASKRUN;
- flags |= IORING_SETUP_SUBMIT_ALL;
- flags |= IORING_SETUP_CQE32;
+ pthread_barrier_init(&barrier, NULL, cfg_num_threads + 1);
- io_uring_queue_init(512, &ctx.ring, flags);
+ for (i = 0; i < cfg_num_threads; i++)
+ ctxs[i].queue_id = cfg_queue_id + i;
+
+ for (i = 0; i < cfg_num_threads; i++) {
+ ret = pthread_create(&threads[i], NULL, server_worker, &ctxs[i]);
+ if (ret)
+ error(1, ret, "pthread_create()");
+ }
+
+ pthread_barrier_wait(&barrier);
- setup_zcrx(&ctx);
if (cfg_dry_run)
- return;
+ goto join;
- add_accept(&ctx, fd);
+ epfd = epoll_create1(0);
+ if (epfd < 0)
+ error(1, errno, "epoll_create1()");
- tstop = gettimeofday_ms() + 5000;
- while (!ctx.stop && gettimeofday_ms() < tstop)
- server_loop(&ctx);
+ if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) < 0)
+ error(1, errno, "epoll_ctl()");
+
+ while (1) {
+ struct epoll_event out_ev;
+ int nfds, idx, connfd;
+
+ nfds = epoll_wait(epfd, &out_ev, 1, timeout_ms);
+ if (nfds < 0)
+ error(1, errno, "epoll_wait()");
+ if (nfds == 0)
+ break;
+ timeout_ms = ACCEPT_TIMEOUT_MS;
+
+ connfd = accept(fd, NULL, NULL);
+ if (connfd < 0)
+ error(1, errno, "accept()");
+
+ if (cfg_num_napi_ids > 0) {
+ int napi_id;
+ socklen_t len = sizeof(napi_id);
+
+ ret = getsockopt(connfd, SOL_SOCKET,
+ SO_INCOMING_NAPI_ID,
+ &napi_id, &len);
+ if (ret < 0)
+ error(1, errno, "getsockopt(SO_INCOMING_NAPI_ID)");
+
+ idx = find_thread_by_napi(napi_id);
+ if (idx < 0)
+ error(1, 0, "unknown NAPI ID: %d", napi_id);
+ } else {
+ idx = 0;
+ }
+
+ if (ctxs[idx].nr_conns >= MAX_CONNS_PER_WORKER)
+ error(1, 0, "worker %d connection overflow", idx);
+ ctxs[idx].connfds[ctxs[idx].nr_conns++] = connfd;
+ }
- if (!ctx.stop)
- error(1, 0, "test failed\n");
+ close(epfd);
+
+ pthread_barrier_wait(&barrier);
+
+join:
+ for (i = 0; i < cfg_num_threads; i++)
+ pthread_join(threads[i], NULL);
+
+ pthread_barrier_destroy(&barrier);
+ close(fd);
+ free(threads);
+ free(ctxs);
}
static void *client_worker(void *arg)
@@ -438,8 +530,8 @@ static void run_client(void)
static void usage(const char *filepath)
{
error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port> "
- "-l<payload_size> -i<ifname> -q<rxq_id> -t<num_threads>",
- filepath);
+ "-l<payload_size> -i<ifname> -q<rxq_id> -t<num_threads> "
+ "-n<napi_id_csv>", filepath);
}
static void parse_opts(int argc, char **argv)
@@ -457,7 +549,7 @@ static void parse_opts(int argc, char **argv)
usage(argv[0]);
cfg_payload_len = max_payload_len;
- while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:dt:")) != -1) {
+ while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:dt:n:")) != -1) {
switch (c) {
case 's':
if (cfg_client)
@@ -501,6 +593,18 @@ static void parse_opts(int argc, char **argv)
case 't':
cfg_num_threads = strtoul(optarg, NULL, 0);
break;
+ case 'n': {
+ char *tok, *str = optarg;
+
+ cfg_num_napi_ids = 0;
+ while ((tok = strsep(&str, ",")) != NULL) {
+ if (cfg_num_napi_ids >= 64)
+ error(1, 0, "too many NAPI IDs");
+ cfg_napi_ids[cfg_num_napi_ids++] =
+ strtoul(tok, NULL, 0);
+ }
+ break;
+ }
}
}
--
2.52.0
^ permalink raw reply related
* [PATCH v2 4/6] selftests: net: add multithread client support to iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>
Add pthreads to the iou-zcrx client so that multiple connections can be
established simultaneously. Each client thread connects to the server
and sends its payload independently.
Introduce the -t option to control the number of threads (default 1),
preserving backwards compatibility with existing tests.
Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
.../testing/selftests/drivers/net/hw/Makefile | 2 +-
.../selftests/drivers/net/hw/iou-zcrx.c | 38 +++++++++++++++++--
2 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 85ca4d1ecf9ec..4f8c3d0b6acdb 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -83,5 +83,5 @@ include ../../../net/ynl.mk
include ../../../net/bpf.mk
ifeq ($(HAS_IOURING_ZCRX),y)
-$(OUTPUT)/iou-zcrx: LDLIBS += -luring
+$(OUTPUT)/iou-zcrx: LDLIBS += -luring -lpthread
endif
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 8dcb2f061f00a..6eb738ef4b5cc 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -4,6 +4,7 @@
#include <error.h>
#include <fcntl.h>
#include <limits.h>
+#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
@@ -85,6 +86,7 @@ static int cfg_send_size = SEND_SIZE;
static struct sockaddr_in6 cfg_addr;
static unsigned int cfg_rx_buf_len;
static bool cfg_dry_run;
+static int cfg_num_threads = 1;
static char *payload;
@@ -379,7 +381,7 @@ static void run_server(void)
error(1, 0, "test failed\n");
}
-static void run_client(void)
+static void *client_worker(void *arg)
{
ssize_t to_send = cfg_send_size;
ssize_t sent = 0;
@@ -405,12 +407,39 @@ static void run_client(void)
}
close(fd);
+ return NULL;
+}
+
+static void run_client(void)
+{
+ struct thread_ctx *ctxs;
+ pthread_t *threads;
+ int i, ret;
+
+ ctxs = calloc(cfg_num_threads, sizeof(*ctxs));
+ threads = calloc(cfg_num_threads, sizeof(*threads));
+ if (!ctxs || !threads)
+ error(1, 0, "calloc()");
+
+ for (i = 0; i < cfg_num_threads; i++) {
+ ret = pthread_create(&threads[i], NULL, client_worker,
+ &ctxs[i]);
+ if (ret)
+ error(1, ret, "pthread_create()");
+ }
+
+ for (i = 0; i < cfg_num_threads; i++)
+ pthread_join(threads[i], NULL);
+
+ free(threads);
+ free(ctxs);
}
static void usage(const char *filepath)
{
error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port> "
- "-l<payload_size> -i<ifname> -q<rxq_id>", filepath);
+ "-l<payload_size> -i<ifname> -q<rxq_id> -t<num_threads>",
+ filepath);
}
static void parse_opts(int argc, char **argv)
@@ -428,7 +457,7 @@ static void parse_opts(int argc, char **argv)
usage(argv[0]);
cfg_payload_len = max_payload_len;
- while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:d")) != -1) {
+ while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:dt:")) != -1) {
switch (c) {
case 's':
if (cfg_client)
@@ -469,6 +498,9 @@ static void parse_opts(int argc, char **argv)
case 'd':
cfg_dry_run = true;
break;
+ case 't':
+ cfg_num_threads = strtoul(optarg, NULL, 0);
+ break;
}
}
--
2.52.0
^ permalink raw reply related
* [PATCH v2 3/6] selftests: net: refactor server state into struct thread_ctx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>
Move server-side state (io_uring ring, zcrx area, refill ring, receive
tracking) from global variables into a local struct thread_ctx. This is
a pure refactor with no behavior change: run_server still allocates a
single context on the stack and runs single-threaded, using io_uring
accept and recvzc as before.
This prepares the ground for the multithread server support in the
following commits, which spawns N worker threads each with their own
struct thread_ctx.
Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
.../selftests/drivers/net/hw/iou-zcrx.c | 156 +++++++++---------
1 file changed, 80 insertions(+), 76 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index c15916311f0dd..8dcb2f061f00a 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -87,14 +87,18 @@ static unsigned int cfg_rx_buf_len;
static bool cfg_dry_run;
static char *payload;
-static void *area_ptr;
-static void *ring_ptr;
-static size_t ring_size;
-static struct io_uring_zcrx_rq rq_ring;
-static unsigned long area_token;
-static int connfd;
-static bool stop;
-static size_t received;
+
+struct thread_ctx {
+ struct io_uring ring;
+ void *area_ptr;
+ void *ring_ptr;
+ size_t ring_size;
+ struct io_uring_zcrx_rq rq_ring;
+ unsigned long area_token;
+ int connfd;
+ bool stop;
+ size_t received;
+};
static unsigned long gettimeofday_ms(void)
{
@@ -138,7 +142,7 @@ static inline size_t get_refill_ring_size(unsigned int rq_entries)
return ALIGN_UP(size, page_size);
}
-static void setup_zcrx(struct io_uring *ring)
+static void setup_zcrx(struct thread_ctx *ctx)
{
unsigned int ifindex;
unsigned int rq_entries = 4096;
@@ -149,44 +153,44 @@ static void setup_zcrx(struct io_uring *ring)
error(1, 0, "bad interface name: %s", cfg_ifname);
if (cfg_rx_buf_len && cfg_rx_buf_len != page_size) {
- area_ptr = mmap(NULL,
- AREA_SIZE,
- PROT_READ | PROT_WRITE,
- MAP_ANONYMOUS | MAP_PRIVATE |
- MAP_HUGETLB | MAP_HUGE_2MB,
- -1,
- 0);
- if (area_ptr == MAP_FAILED) {
+ ctx->area_ptr = mmap(NULL,
+ AREA_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE |
+ MAP_HUGETLB | MAP_HUGE_2MB,
+ -1,
+ 0);
+ if (ctx->area_ptr == MAP_FAILED) {
printf("Can't allocate huge pages\n");
exit(SKIP_CODE);
}
} else {
- area_ptr = mmap(NULL,
- AREA_SIZE,
- PROT_READ | PROT_WRITE,
- MAP_ANONYMOUS | MAP_PRIVATE,
- 0,
- 0);
- if (area_ptr == MAP_FAILED)
+ ctx->area_ptr = mmap(NULL,
+ AREA_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0,
+ 0);
+ if (ctx->area_ptr == MAP_FAILED)
error(1, 0, "mmap(): zero copy area");
}
- ring_size = get_refill_ring_size(rq_entries);
- ring_ptr = mmap(NULL,
- ring_size,
- PROT_READ | PROT_WRITE,
- MAP_ANONYMOUS | MAP_PRIVATE,
- 0,
- 0);
+ ctx->ring_size = get_refill_ring_size(rq_entries);
+ ctx->ring_ptr = mmap(NULL,
+ ctx->ring_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0,
+ 0);
struct io_uring_region_desc region_reg = {
- .size = ring_size,
- .user_addr = (__u64)(unsigned long)ring_ptr,
+ .size = ctx->ring_size,
+ .user_addr = (__u64)(unsigned long)ctx->ring_ptr,
.flags = IORING_MEM_REGION_TYPE_USER,
};
struct io_uring_zcrx_area_reg area_reg = {
- .addr = (__u64)(unsigned long)area_ptr,
+ .addr = (__u64)(unsigned long)ctx->area_ptr,
.len = AREA_SIZE,
.flags = 0,
};
@@ -200,7 +204,7 @@ static void setup_zcrx(struct io_uring *ring)
.rx_buf_len = cfg_rx_buf_len,
};
- ret = io_uring_register_ifq(ring, (void *)®);
+ ret = io_uring_register_ifq(&ctx->ring, (void *)®);
if (cfg_rx_buf_len && (ret == -EINVAL || ret == -EOPNOTSUPP ||
ret == -ERANGE)) {
printf("Large chunks are not supported %i\n", ret);
@@ -209,64 +213,64 @@ static void setup_zcrx(struct io_uring *ring)
error(1, 0, "io_uring_register_ifq(): %d", ret);
}
- rq_ring.khead = (unsigned int *)((char *)ring_ptr + reg.offsets.head);
- rq_ring.ktail = (unsigned int *)((char *)ring_ptr + reg.offsets.tail);
- rq_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
- rq_ring.rq_tail = 0;
- rq_ring.ring_entries = reg.rq_entries;
+ ctx->rq_ring.khead = (unsigned int *)((char *)ctx->ring_ptr + reg.offsets.head);
+ ctx->rq_ring.ktail = (unsigned int *)((char *)ctx->ring_ptr + reg.offsets.tail);
+ ctx->rq_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ctx->ring_ptr + reg.offsets.rqes);
+ ctx->rq_ring.rq_tail = 0;
+ ctx->rq_ring.ring_entries = reg.rq_entries;
- area_token = area_reg.rq_area_token;
+ ctx->area_token = area_reg.rq_area_token;
}
-static void add_accept(struct io_uring *ring, int sockfd)
+static void add_accept(struct thread_ctx *ctx, int sockfd)
{
struct io_uring_sqe *sqe;
- sqe = io_uring_get_sqe(ring);
+ sqe = io_uring_get_sqe(&ctx->ring);
io_uring_prep_accept(sqe, sockfd, NULL, NULL, 0);
sqe->user_data = 1;
}
-static void add_recvzc(struct io_uring *ring, int sockfd)
+static void add_recvzc(struct thread_ctx *ctx, int sockfd)
{
struct io_uring_sqe *sqe;
- sqe = io_uring_get_sqe(ring);
+ sqe = io_uring_get_sqe(&ctx->ring);
io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, 0, 0);
sqe->ioprio |= IORING_RECV_MULTISHOT;
sqe->user_data = 2;
}
-static void add_recvzc_oneshot(struct io_uring *ring, int sockfd, size_t len)
+static void add_recvzc_oneshot(struct thread_ctx *ctx, int sockfd, size_t len)
{
struct io_uring_sqe *sqe;
- sqe = io_uring_get_sqe(ring);
+ sqe = io_uring_get_sqe(&ctx->ring);
io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, len, 0);
sqe->ioprio |= IORING_RECV_MULTISHOT;
sqe->user_data = 2;
}
-static void process_accept(struct io_uring *ring, struct io_uring_cqe *cqe)
+static void process_accept(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
{
if (cqe->res < 0)
error(1, 0, "accept()");
- if (connfd)
+ if (ctx->connfd)
error(1, 0, "Unexpected second connection");
- connfd = cqe->res;
+ ctx->connfd = cqe->res;
if (cfg_oneshot)
- add_recvzc_oneshot(ring, connfd, page_size);
+ add_recvzc_oneshot(ctx, ctx->connfd, page_size);
else
- add_recvzc(ring, connfd);
+ add_recvzc(ctx, ctx->connfd);
}
-static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
+static void process_recvzc(struct thread_ctx *ctx, struct io_uring_cqe *cqe)
{
- unsigned rq_mask = rq_ring.ring_entries - 1;
+ unsigned rq_mask = ctx->rq_ring.ring_entries - 1;
struct io_uring_zcrx_cqe *rcqe;
struct io_uring_zcrx_rqe *rqe;
uint64_t mask;
@@ -275,7 +279,7 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
int i;
if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs == 0) {
- stop = true;
+ ctx->stop = true;
return;
}
@@ -284,56 +288,56 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
if (cfg_oneshot) {
if (cqe->res == 0 && cqe->flags == 0 && cfg_oneshot_recvs) {
- add_recvzc_oneshot(ring, connfd, page_size);
+ add_recvzc_oneshot(ctx, ctx->connfd, page_size);
cfg_oneshot_recvs--;
}
} else if (!(cqe->flags & IORING_CQE_F_MORE)) {
- add_recvzc(ring, connfd);
+ add_recvzc(ctx, ctx->connfd);
}
rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
n = cqe->res;
mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
- data = (char *)area_ptr + (rcqe->off & mask);
+ data = (char *)ctx->area_ptr + (rcqe->off & mask);
for (i = 0; i < n; i++) {
- if (*(data + i) != payload[(received + i)])
+ if (*(data + i) != payload[(ctx->received + i)])
error(1, 0, "payload mismatch at %d", i);
}
- received += n;
+ ctx->received += n;
- rqe = &rq_ring.rqes[(rq_ring.rq_tail & rq_mask)];
- rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | area_token;
+ rqe = &ctx->rq_ring.rqes[(ctx->rq_ring.rq_tail & rq_mask)];
+ rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | ctx->area_token;
rqe->len = cqe->res;
- io_uring_smp_store_release(rq_ring.ktail, ++rq_ring.rq_tail);
+ io_uring_smp_store_release(ctx->rq_ring.ktail, ++ctx->rq_ring.rq_tail);
}
-static void server_loop(struct io_uring *ring)
+static void server_loop(struct thread_ctx *ctx)
{
struct io_uring_cqe *cqe;
unsigned int count = 0;
unsigned int head;
int i, ret;
- io_uring_submit_and_wait(ring, 1);
+ io_uring_submit_and_wait(&ctx->ring, 1);
- io_uring_for_each_cqe(ring, head, cqe) {
+ io_uring_for_each_cqe(&ctx->ring, head, cqe) {
if (cqe->user_data == 1)
- process_accept(ring, cqe);
+ process_accept(ctx, cqe);
else if (cqe->user_data == 2)
- process_recvzc(ring, cqe);
+ process_recvzc(ctx, cqe);
else
error(1, 0, "unknown cqe");
count++;
}
- io_uring_cq_advance(ring, count);
+ io_uring_cq_advance(&ctx->ring, count);
}
static void run_server(void)
{
+ struct thread_ctx ctx = {};
unsigned int flags = 0;
- struct io_uring ring;
int fd, enable, ret;
uint64_t tstop;
@@ -359,19 +363,19 @@ static void run_server(void)
flags |= IORING_SETUP_SUBMIT_ALL;
flags |= IORING_SETUP_CQE32;
- io_uring_queue_init(512, &ring, flags);
+ io_uring_queue_init(512, &ctx.ring, flags);
- setup_zcrx(&ring);
+ setup_zcrx(&ctx);
if (cfg_dry_run)
return;
- add_accept(&ring, fd);
+ add_accept(&ctx, fd);
tstop = gettimeofday_ms() + 5000;
- while (!stop && gettimeofday_ms() < tstop)
- server_loop(&ring);
+ while (!ctx.stop && gettimeofday_ms() < tstop)
+ server_loop(&ctx);
- if (!stop)
+ if (!ctx.stop)
error(1, 0, "test failed\n");
}
--
2.52.0
^ permalink raw reply related
* [PATCH v2 2/6] selftests: net: remove unused variable in process_recvzc()
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>
Remove unused `sqe` variable in preparation for multiqueue
rss selftest changes to process_recvzc() in the following
commit.
Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
tools/testing/selftests/drivers/net/hw/iou-zcrx.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 334985083f611..c15916311f0dd 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -269,7 +269,6 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
unsigned rq_mask = rq_ring.ring_entries - 1;
struct io_uring_zcrx_cqe *rcqe;
struct io_uring_zcrx_rqe *rqe;
- struct io_uring_sqe *sqe;
uint64_t mask;
char *data;
ssize_t n;
--
2.52.0
^ permalink raw reply related
* [PATCH v2 1/6] selftests: net: fix get_refill_ring_size() to use its local variable
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <cover.1776444379.git.juanlu@fastmail.com>
In preparation for multi-threaded rss selftests, fix
get_refill_ring_size to use the local `size` variable,
instead of the `global_size`.
Signed-off-by: Juanlu Herrero <juanlu@fastmail.com>
---
tools/testing/selftests/drivers/net/hw/iou-zcrx.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 240d13dbc54e7..334985083f611 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -132,10 +132,10 @@ static inline size_t get_refill_ring_size(unsigned int rq_entries)
{
size_t size;
- ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe);
+ size = rq_entries * sizeof(struct io_uring_zcrx_rqe);
/* add space for the header (head/tail/etc.) */
- ring_size += page_size;
- return ALIGN_UP(ring_size, page_size);
+ size += page_size;
+ return ALIGN_UP(size, page_size);
}
static void setup_zcrx(struct io_uring *ring)
--
2.52.0
^ permalink raw reply related
* [PATCH v2 0/6] selftests: net: multithread + rss_multiqueue support for iou-zcrx
From: Juanlu Herrero @ 2026-04-17 16:49 UTC (permalink / raw)
To: dw, netdev; +Cc: kuba, Juanlu Herrero
In-Reply-To: <20260408163816.2760-1-juanlu@fastmail.com>
Add multithread support to the iou-zcrx selftest, plus a new
rss_multiqueue Python variant that exercises multi-queue zero-copy
receive on a single listening socket with NAPI-ID-based dispatch.
v2:
- merge iou-zcrx.c server changes, leaving iou-zcrx.py changes in the
last patch (David)
- Refactor server state into struct thread_ctx as a separate
patch for a cleaner impl of the server side.
- Rework server: main-thread epoll accepts an arbitrary number
of connections; SO_INCOMING_NAPI_ID dispatches each to its worker.
(David)
- Drop unused thread_id field (David)
- rss_multiqueue: use a single listening port with an RSS context
spanning N queues; query NAPI IDs at runtime via netlink
queue_get(); pass them to the binary via a new -n option
Link: https://lore.kernel.org/netdev/20260408163816.2760-1-juanlu@fastmail.com/
Juanlu Herrero (6):
selftests: net: fix get_refill_ring_size() to use its local variable
selftests: net: remove unused variable in process_recvzc()
selftests: net: refactor server state into struct thread_ctx
selftests: net: add multithread client support to iou-zcrx
selftests: net: add multithread server support to iou-zcrx
selftests: net: add rss_multiqueue test variant to iou-zcrx
.../testing/selftests/drivers/net/hw/Makefile | 2 +-
.../selftests/drivers/net/hw/iou-zcrx.c | 379 ++++++++++++------
.../selftests/drivers/net/hw/iou-zcrx.py | 59 ++-
3 files changed, 317 insertions(+), 123 deletions(-)
--
2.52.0
^ permalink raw reply
* [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio Lawson @ 2026-04-17 16:39 UTC (permalink / raw)
To: netdev; +Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <20260417-ksz87xx_errata_low_loss_connections-v4-1-6c7044ec4363@exotec.com>
Fixes: e66f840c08a2 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
---
drivers/net/dsa/microchip/ksz8.c | 6 ++++++
drivers/net/dsa/microchip/ksz8_reg.h | 3 +++
2 files changed, 9 insertions(+)
diff --git a/drivers/net/dsa/microchip/ksz8.c b/drivers/net/dsa/microchip/ksz8.c
index 0f2b8acee80f..62fc59c3da7e 100644
--- a/drivers/net/dsa/microchip/ksz8.c
+++ b/drivers/net/dsa/microchip/ksz8.c
@@ -1297,6 +1297,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val)
case PHY_REG_KSZ87XX_LPF_BW:
if (!ksz_is_ksz87xx(dev))
return -EOPNOTSUPP;
+ /* Only accept LPF bandwidth bits [7:6] */
+ if (val & ~KSZ87XX_LPF_VALID_MASK)
+ return -EINVAL;
ret = ksz8_ind_write8(dev, TABLE_LINK_MD, KSZ87XX_REG_PHY_LPF, (u8)val);
if (ret)
return ret;
@@ -1305,6 +1308,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val)
case PHY_REG_KSZ87XX_EQ_INIT:
if (!ksz_is_ksz87xx(dev))
return -EOPNOTSUPP;
+ /* Only accept DSP EQ initial value bits [5:0] */
+ if (val & ~KSZ87XX_DSP_EQ_VALID_MASK)
+ return -EINVAL;
ret = ksz8_ind_write8(dev, TABLE_LINK_MD, KSZ87XX_REG_DSP_EQ, (u8)val);
if (ret)
return ret;
diff --git a/drivers/net/dsa/microchip/ksz8_reg.h b/drivers/net/dsa/microchip/ksz8_reg.h
index 5df17c463f7c..cd41214f874e 100644
--- a/drivers/net/dsa/microchip/ksz8_reg.h
+++ b/drivers/net/dsa/microchip/ksz8_reg.h
@@ -206,6 +206,9 @@
#define KSZ87XX_REG_DSP_EQ 0x08 /* DSP EQ initial value */
#define KSZ87XX_REG_PHY_LPF 0x4C /* RX LPF bandwidth */
+#define KSZ87XX_DSP_EQ_VALID_MASK GENMASK(5, 0)
+#define KSZ87XX_LPF_VALID_MASK GENMASK(7, 6)
+
/* For KSZ8765. */
#define PORT_REMOTE_ASYM_PAUSE BIT(5)
#define PORT_REMOTE_SYM_PAUSE BIT(4)
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio LAWSON @ 2026-04-17 16:30 UTC (permalink / raw)
To: Sai Krishna Gajula, netdev@vger.kernel.org
Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <BYAPR18MB3735885B13017500E153FEA2A0202@BYAPR18MB3735.namprd18.prod.outlook.com>
On 4/17/26 18:10, Sai Krishna Gajula wrote:
>> -----Original Message-----
>> From: Fidelio Lawson <lawson.fidelio@gmail.com>
>> Sent: Friday, April 17, 2026 9:20 PM
>> To: netdev@vger.kernel.org
>> Cc: Marek Vasut <marex@nabladev.com>; Andrew Lunn <andrew@lunn.ch>;
>> Woojung Huh <woojung.huh@microchip.com>; Fidelio Lawson
>> <fidelio.lawson@exotec.com>
>> Subject: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx
>> Module 3 low-loss cable errata
>
> Since this errata is a fix and pushed to "net", adding fixes tag may be required.
>
Good point, thanks for spotting this.
I’ll add an appropriate fixes tag referencing the commit that introduced
the KSZ87xx support, and follow up with an updated fixup.
Thanks
^ permalink raw reply
* Re: [PATCH bpf v3 2/2] selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks
From: Martin KaFai Lau @ 2026-04-17 16:25 UTC (permalink / raw)
To: KaFai Wan
Cc: daniel, john.fastabend, sdf, ast, andrii, eddyz87, memxor, song,
yonghong.song, jolsa, davem, edumazet, kuba, pabeni, horms, shuah,
jiayuan.chen, bpf, netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260417092035.2299913-3-kafai.wan@linux.dev>
On Fri, Apr 17, 2026 at 05:20:35PM +0800, KaFai Wan wrote:
> diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> index 56685fc03c7e..7b9dbbb84316 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
> @@ -461,7 +461,7 @@ static void misc(void)
> const unsigned int nr_data = 2;
> struct bpf_link *link;
> struct sk_fds sk_fds;
> - int i, ret;
> + int i, ret, true_val = 1;
>
> lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map);
>
> @@ -477,6 +477,10 @@ static void misc(void)
> return;
> }
>
> + ret = setsockopt(sk_fds.active_fd, SOL_TCP, TCP_NODELAY, &true_val, sizeof(true_val));
Same comment as in v2. Why this setsockopt is needed?
The setsockopt in userspace is unnecessary. In the future,
we may need to understand why it is needed here in the first place.
^ permalink raw reply
* Re: [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option
From: Tejun Heo @ 2026-04-17 16:25 UTC (permalink / raw)
To: Herbert Xu
Cc: Thomas Graf, David Vernet, Andrea Righi, Changwoo Min,
Emil Tsalapatis, linux-crypto, sched-ext, linux-kernel,
Florian Westphal, netdev
In-Reply-To: <aeHmeAz-Z-Rx2MqX@gondor.apana.org.au>
Hello,
On Fri, Apr 17, 2026 at 03:51:20PM +0800, Herbert Xu wrote:
> rhashtable originated in networking where it tries very hard to
> stop the hash table from ever degenerating into a linked list.
I see.
> If your use-case is not as adversarial as that, and you're happy
> for the hash table to degenerate into a linked-list in the worst
> case, then yes it's aboslutely fine to not grow the table (or
> try to grow it and fail with kmalloc_nolock).
My use case is a bit different. I want a resizable hashtable which can be
used under raw spinlock and doesn't fail unnecessarily. My only adversary is
memory pressure and operation failures can be harmful. ie. If the system is
under severe memory pressure, hashtable becoming temporarily slower is not a
big problem as long as it restores reasonable operation once the system
recovers. However, if the insertion operation fails under e.g. sudden
network rx burst that drains atomic reserve, that can lead to fatal failure
- e.g. forks failing out of blue on a busy but mostly okay system. I think
this pretty much requires all hashtable growths to be asynchronous.
> It's just that we haven't had any users like this until now and
> the feature that you want got removed because of that.
>
> I'm more than happy to bring it back (commit 5f8ddeab10ce).
That'd be great but looking at the commit, I'm not sure it reliably avoids
allocation in the synchronous path.
Thanks.
--
tejun
^ permalink raw reply
* [PATCH 2/2 nf] netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
From: Fernando Fernandez Mancera @ 2026-04-17 16:20 UTC (permalink / raw)
To: netfilter-devel
Cc: netdev, coreteam, pablo, fw, phil, Fernando Fernandez Mancera,
Kito Xu (veritas501)
In-Reply-To: <20260417162057.3732-1-fmancera@suse.de>
The nf_osf_ttl() function accessed skb->dev to perform a local interface
address lookup without verifying that the device pointer was valid.
Additionally, the implementation utilized an in_dev_for_each_ifa_rcu
loop to match the packet source address against local interface
addresses. It assumed that packets from the same subnet should not see a
decrement on the initial TTL. A packet might appear it is from the same
subnet but it actually isn't especially in modern environments with
containers and virtual switching.
Remove the device dereference and interface loop. Replace the logic with
a switch statement that evaluates the TTL according to the ttl_check.
Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Kito Xu (veritas501) <hxzene@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
Note: if some help is needed during the backport I can assist.
---
net/netfilter/nfnetlink_osf.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)
diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index f58267986453..f0d1e596e146 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -31,26 +31,18 @@ EXPORT_SYMBOL_GPL(nf_osf_fingers);
static inline int nf_osf_ttl(const struct sk_buff *skb,
int ttl_check, unsigned char f_ttl)
{
- struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
const struct iphdr *ip = ip_hdr(skb);
- const struct in_ifaddr *ifa;
- int ret = 0;
- if (ttl_check == NF_OSF_TTL_TRUE)
+ switch (ttl_check) {
+ case NF_OSF_TTL_TRUE:
return ip->ttl == f_ttl;
- if (ttl_check == NF_OSF_TTL_NOCHECK)
- return 1;
- else if (ip->ttl <= f_ttl)
+ break;
+ case NF_OSF_TTL_NOCHECK:
return 1;
-
- in_dev_for_each_ifa_rcu(ifa, in_dev) {
- if (inet_ifa_match(ip->saddr, ifa)) {
- ret = (ip->ttl == f_ttl);
- break;
- }
+ case NF_OSF_TTL_LESS:
+ default:
+ return ip->ttl <= f_ttl;
}
-
- return ret;
}
struct nf_osf_hdr_ctx {
--
2.53.0
^ permalink raw reply related
* Re: [Intel-wired-lan] [PATCH iwl-net v2] igc: fix potential skb leak in igc_fpe_xmit_smd_frame()
From: Kohei Enju @ 2026-04-17 16:20 UTC (permalink / raw)
To: Simon Horman
Cc: intel-wired-lan, netdev, Tony Nguyen, Przemek Kitszel,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Faizal Rahim, kohei.enju, stable
In-Reply-To: <20260417115122.GA31784@horms.kernel.org>
On 04/17 12:51, Simon Horman wrote:
> On Wed, Apr 15, 2026 at 02:52:18AM +0000, Kohei Enju wrote:
> > When igc_fpe_init_tx_descriptor() fails, no one takes care of an
> > allocated skb, leaking it. [1]
> > Use dev_kfree_skb_any() on failure.
> >
> > Tested on an I226 adapter with the following command, while injecting
> > faults in igc_fpe_init_tx_descriptor() to trigger the error path.
> > # ethtool --set-mm $DEV verify-enabled on tx-enabled on pmac-enabled on
> >
> > [1]
> > unreferenced object 0xffff888113c6cdc0 (size 224):
> > ...
> > backtrace (crc be3d3fda):
> > kmem_cache_alloc_node_noprof+0x3b1/0x410
> > __alloc_skb+0xde/0x830
> > igc_fpe_xmit_smd_frame.isra.0+0xad/0x1b0
> > igc_fpe_send_mpacket+0x37/0x90
> > ethtool_mmsv_verify_timer+0x15e/0x300
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 5422570c0010 ("igc: add support for frame preemption verification")
> > Signed-off-by: Kohei Enju <kohei@enjuk.jp>
> > ---
> > Changes:
> > v2:
> > - change to idiomatic style with goto (Simon)
> > - add Cc to stable (Alex)
> > - add reprodunction steps (Alex)
> > v1: https://lore.kernel.org/all/20260329145122.126040-1-kohei@enjuk.jp/
>
> Thanks for the update.
>
> Reviewed-by: Simon Horman <horms@kernel.org>
>
> Sashiko has comments about a potential existing bug in the same code path.
> I'd appreciate it if, as a follow-up, you could look over that.
Thanks for the heads-up. I'll look into it.
>
> Thanks!
^ permalink raw reply
* [PATCH 1/2 nf] netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
From: Fernando Fernandez Mancera @ 2026-04-17 16:20 UTC (permalink / raw)
To: netfilter-devel
Cc: netdev, coreteam, pablo, fw, phil, Fernando Fernandez Mancera
In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once
and passed by reference to nf_osf_match_one() for each fingerprint
checked. During TCP option parsing, nf_osf_match_one() advances the
shared ctx->optp pointer.
If a fingerprint perfectly matches, the function returns early without
restoring ctx->optp to its initial state. If the user has configured
NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint.
However, because ctx->optp was not restored, the next call to
nf_osf_match_one() starts parsing from the end of the options buffer.
This causes subsequent matches to read garbage data and fail
immediately, making it impossible to log more than one match or logging
incorrect matches.
Instead of using a shared ctx->optp pointer, pass the context as a
constant pointer and use a local pointer (optp) for TCP option
traversal. This makes nf_osf_match_one() strictly stateless from the
caller's perspective, ensuring every fingerprint check starts at the
correct option offset.
Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
net/netfilter/nfnetlink_osf.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index 45d9ad231a92..f58267986453 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -64,9 +64,9 @@ struct nf_osf_hdr_ctx {
static bool nf_osf_match_one(const struct sk_buff *skb,
const struct nf_osf_user_finger *f,
int ttl_check,
- struct nf_osf_hdr_ctx *ctx)
+ const struct nf_osf_hdr_ctx *ctx)
{
- const __u8 *optpinit = ctx->optp;
+ const __u8 *optp = ctx->optp;
unsigned int check_WSS = 0;
int fmatch = FMATCH_WRONG;
int foptsize, optnum;
@@ -95,17 +95,17 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
check_WSS = f->wss.wc;
for (optnum = 0; optnum < f->opt_num; ++optnum) {
- if (f->opt[optnum].kind == *ctx->optp) {
+ if (f->opt[optnum].kind == *optp) {
__u32 len = f->opt[optnum].length;
- const __u8 *optend = ctx->optp + len;
+ const __u8 *optend = optp + len;
fmatch = FMATCH_OK;
- switch (*ctx->optp) {
+ switch (*optp) {
case OSFOPT_MSS:
- mss = ctx->optp[3];
+ mss = optp[3];
mss <<= 8;
- mss |= ctx->optp[2];
+ mss |= optp[2];
mss = ntohs((__force __be16)mss);
break;
@@ -113,7 +113,7 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
break;
}
- ctx->optp = optend;
+ optp = optend;
} else
fmatch = FMATCH_OPT_WRONG;
@@ -156,9 +156,6 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
}
}
- if (fmatch != FMATCH_OK)
- ctx->optp = optpinit;
-
return fmatch == FMATCH_OK;
}
--
2.53.0
^ permalink raw reply related
* RE: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Sai Krishna Gajula @ 2026-04-17 16:10 UTC (permalink / raw)
To: Fidelio Lawson, netdev@vger.kernel.org
Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <20260417155025.488290-1-fidelio.lawson@exotec.com>
> -----Original Message-----
> From: Fidelio Lawson <lawson.fidelio@gmail.com>
> Sent: Friday, April 17, 2026 9:20 PM
> To: netdev@vger.kernel.org
> Cc: Marek Vasut <marex@nabladev.com>; Andrew Lunn <andrew@lunn.ch>;
> Woojung Huh <woojung.huh@microchip.com>; Fidelio Lawson
> <fidelio.lawson@exotec.com>
> Subject: [PATCH] fixup! net: dsa: microchip: implement KSZ87xx
> Module 3 low-loss cable errata
Since this errata is a fix and pushed to "net", adding fixes tag may be required.
>
> --- drivers/net/dsa/microchip/ksz8. c | 6 ++++++
> drivers/net/dsa/microchip/ksz8_reg. h | 3 +++ 2 files changed, 9 insertions(+)
> diff --git a/drivers/net/dsa/microchip/ksz8. c
> b/drivers/net/dsa/microchip/ksz8. c index 0f2b8acee80f. . 62fc59c3da7e
> 100644
> ---
> drivers/net/dsa/microchip/ksz8.c | 6 ++++++
> drivers/net/dsa/microchip/ksz8_reg.h | 3 +++
> 2 files changed, 9 insertions(+)
>
> diff --git a/drivers/net/dsa/microchip/ksz8.c
> b/drivers/net/dsa/microchip/ksz8.c
> index 0f2b8acee80f..62fc59c3da7e 100644
> --- a/drivers/net/dsa/microchip/ksz8.c
> +++ b/drivers/net/dsa/microchip/ksz8.c
> @@ -1297,6 +1297,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16
> reg, u16 val)
> case PHY_REG_KSZ87XX_LPF_BW:
> if (!ksz_is_ksz87xx(dev))
> return -EOPNOTSUPP;
> + /* Only accept LPF bandwidth bits [7:6] */
> + if (val & ~KSZ87XX_LPF_VALID_MASK)
> + return -EINVAL;
> ret = ksz8_ind_write8(dev, TABLE_LINK_MD,
> KSZ87XX_REG_PHY_LPF, (u8)val);
> if (ret)
> return ret;
> @@ -1305,6 +1308,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16
> reg, u16 val)
> case PHY_REG_KSZ87XX_EQ_INIT:
> if (!ksz_is_ksz87xx(dev))
> return -EOPNOTSUPP;
> + /* Only accept DSP EQ initial value bits [5:0] */
> + if (val & ~KSZ87XX_DSP_EQ_VALID_MASK)
> + return -EINVAL;
> ret = ksz8_ind_write8(dev, TABLE_LINK_MD,
> KSZ87XX_REG_DSP_EQ, (u8)val);
> if (ret)
> return ret;
> diff --git a/drivers/net/dsa/microchip/ksz8_reg.h
> b/drivers/net/dsa/microchip/ksz8_reg.h
> index 5df17c463f7c..cd41214f874e 100644
> --- a/drivers/net/dsa/microchip/ksz8_reg.h
> +++ b/drivers/net/dsa/microchip/ksz8_reg.h
> @@ -206,6 +206,9 @@
> #define KSZ87XX_REG_DSP_EQ 0x08 /* DSP EQ initial value
> */
> #define KSZ87XX_REG_PHY_LPF 0x4C /* RX LPF
> bandwidth */
>
> +#define KSZ87XX_DSP_EQ_VALID_MASK GENMASK(5, 0)
> +#define KSZ87XX_LPF_VALID_MASK GENMASK(7, 6)
> +
> /* For KSZ8765. */
> #define PORT_REMOTE_ASYM_PAUSE BIT(5)
> #define PORT_REMOTE_SYM_PAUSE BIT(4)
> --
> 2.53.0
>
^ permalink raw reply
* Re: [PATCH net v3 2/4] nfc: llcp: fix TLV parsing in parse_gb_tlv and parse_connection_tlv
From: Simon Horman @ 2026-04-17 16:04 UTC (permalink / raw)
To: Lekë Hapçiu
Cc: netdev, davem, edumazet, kuba, pabeni, linux-kernel, stable,
Lekë Hapçiu
In-Reply-To: <20260414233534.55973-3-snowwlake@icloud.com>
On Wed, Apr 15, 2026 at 01:35:31AM +0200, Lekë Hapçiu wrote:
> From: Lekë Hapçiu <framemain@outlook.com>
>
> nfc_llcp_parse_gb_tlv() and nfc_llcp_parse_connection_tlv() walk TLV
> arrays whose length and content come from a peer-supplied frame. The
> parsing loop has three weaknesses:
>
> 1. `offset` is declared u8 while `tlv_array_len` is u16. In
> parse_connection_tlv() the TLV array can reach ~2173 bytes (MIUX
> up to 0x7FF), so 128 zero-length TLVs wrap `offset` back to 0 and
> the loop never terminates while `tlv` advances past the buffer.
>
> 2. The guard `offset < tlv_array_len` only proves one byte is
> available, but the body reads tlv[0] (type) and tlv[1] (length).
> When one byte remains, tlv[1] is out of bounds.
>
> 3. `length` is read from peer data and used to advance `tlv` without
> being checked against the remaining array space. A crafted length
> walks `tlv` past the buffer; the next iteration reads tlv[0]/tlv[1]
> from adjacent memory.
>
> The llcp_tlv8() and llcp_tlv16() accessors additionally read tlv[2]
> and tlv[2..3]; a zero-length TLV makes those reads out of bounds.
>
> Fix: promote `offset` to u16; add two per-iteration guards, one for
> the TLV header and one for the TLV value; require length >= 1 for all
> TLVs before the type dispatch and length >= 2 for the llcp_tlv16()
> accessors (MIUX, WKS). Return -EINVAL on malformed input.
>
> Reached on ATR_RES (parse_gb_tlv) and on CONNECT/CC PDUs before a
> connection is established (parse_connection_tlv). Both are
> triggerable from any NFC peer within ~4 cm, without authentication.
As per my comment on patch 1/4, I don't understand the relationship
between the last sentence above and this patch.
>
> Reported-by: Simon Horman <horms@kernel.org>
> Fixes: d646960f7986 ("NFC: Add LLCP sockets")
I think the hash but not the subject is correct in the fixes line.
IOW, I think this should be:
Fixes: d646960f7986 ("NFC: Initial LLCP support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Lekë Hapçiu <framemain@outlook.com>
Otherwise, looks good to me.
While looking over this I noticed that nfc_llcp_connect_sn() seems
to have the same kind of problem. You may wish to address that as
a follow-up.
...
^ permalink raw reply
* Re: [PATCH v7 0/5] netem: bug fixes
From: Simon Horman @ 2026-04-17 16:02 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20260415142822.133241-1-stephen@networkplumber.org>
On Wed, Apr 15, 2026 at 07:27:03AM -0700, Stephen Hemminger wrote:
> These bugs were found when doing AI assisted review of sch_netem.c
> during investigation of the packet duplication recursion problem
> addressed in Jamal's series.
>
> The fixes cover:
>
> - probability gaps in the 4-state Markov loss model
> - queue limit not accounting for reordered packets
> - PRNG reseeded on every tc change, breaking reproducibility
> - slot delay configuration not validated for inverted ranges
> - slot delay arithmetic overflow for ranges above ~2.1 seconds
>
> v7 - queue limit check Fixes: goes back further to earlier change
> - use NL_SET_ERR_MSG_ATTR
>
> Stephen Hemminger (5):
> net/sched: netem: fix probability gaps in 4-state loss model
> net/sched: netem: fix queue limit check to include reordered packets
> net/sched: netem: only reseed PRNG when seed is explicitly provided
> net/sched: netem: check for invalid slot range
> net/sched: netem: fix slot delay calculation overflow
To the maintainers: I'd like to ask for more time to complete review of this.
^ permalink raw reply
* [PATCH] fixup! net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio Lawson @ 2026-04-17 15:50 UTC (permalink / raw)
To: netdev; +Cc: Marek Vasut, Andrew Lunn, Woojung Huh, Fidelio Lawson
In-Reply-To: <20260417-ksz87xx_errata_low_loss_connections-v4-1-6c7044ec4363@exotec.com>
---
drivers/net/dsa/microchip/ksz8.c | 6 ++++++
drivers/net/dsa/microchip/ksz8_reg.h | 3 +++
2 files changed, 9 insertions(+)
diff --git a/drivers/net/dsa/microchip/ksz8.c b/drivers/net/dsa/microchip/ksz8.c
index 0f2b8acee80f..62fc59c3da7e 100644
--- a/drivers/net/dsa/microchip/ksz8.c
+++ b/drivers/net/dsa/microchip/ksz8.c
@@ -1297,6 +1297,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val)
case PHY_REG_KSZ87XX_LPF_BW:
if (!ksz_is_ksz87xx(dev))
return -EOPNOTSUPP;
+ /* Only accept LPF bandwidth bits [7:6] */
+ if (val & ~KSZ87XX_LPF_VALID_MASK)
+ return -EINVAL;
ret = ksz8_ind_write8(dev, TABLE_LINK_MD, KSZ87XX_REG_PHY_LPF, (u8)val);
if (ret)
return ret;
@@ -1305,6 +1308,9 @@ int ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val)
case PHY_REG_KSZ87XX_EQ_INIT:
if (!ksz_is_ksz87xx(dev))
return -EOPNOTSUPP;
+ /* Only accept DSP EQ initial value bits [5:0] */
+ if (val & ~KSZ87XX_DSP_EQ_VALID_MASK)
+ return -EINVAL;
ret = ksz8_ind_write8(dev, TABLE_LINK_MD, KSZ87XX_REG_DSP_EQ, (u8)val);
if (ret)
return ret;
diff --git a/drivers/net/dsa/microchip/ksz8_reg.h b/drivers/net/dsa/microchip/ksz8_reg.h
index 5df17c463f7c..cd41214f874e 100644
--- a/drivers/net/dsa/microchip/ksz8_reg.h
+++ b/drivers/net/dsa/microchip/ksz8_reg.h
@@ -206,6 +206,9 @@
#define KSZ87XX_REG_DSP_EQ 0x08 /* DSP EQ initial value */
#define KSZ87XX_REG_PHY_LPF 0x4C /* RX LPF bandwidth */
+#define KSZ87XX_DSP_EQ_VALID_MASK GENMASK(5, 0)
+#define KSZ87XX_LPF_VALID_MASK GENMASK(7, 6)
+
/* For KSZ8765. */
#define PORT_REMOTE_ASYM_PAUSE BIT(5)
#define PORT_REMOTE_SYM_PAUSE BIT(4)
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net v2 1/2] bnge: fix initial HWRM sequence
From: Vikas Gupta @ 2026-04-17 15:47 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, edumazet, pabeni, andrew+netdev, horms, netdev,
linux-kernel, vsrama-krishna.nemani, bhargava.marreddy,
rajashekar.hudumula, ajit.khaparde, dharmender.garg,
rahul-rg.gupta
In-Reply-To: <20260417074254.42f01fa7@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1683 bytes --]
On Fri, Apr 17, 2026 at 8:12 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 17 Apr 2026 11:46:08 +0530 Vikas Gupta wrote:
> > > > -err_func_unrgtr:
> > > > - bnge_fw_unregister_dev(bd);
> > > > +err_free_ctx_mem:
> > > > + bnge_free_ctx_mem(bd);
> > > > return rc;
> > > > }
> > >
> > > This error path appears to have the same regression. If
> > > bnge_hwrm_func_drv_rgtr() fails after bnge_func_qcaps() has already
> > > configured the backing store, freeing the context memory directly without
> > > unregistering might allow the hardware to access freed memory.
> >
> > Even if bnge_hwrm_func_drv_rgtr() fails, it is still safe to free the context
> > memory at the host because the driver unloads from this point.
>
> Looking closer, indeed, the way bnge_hwrm_func_drv_unrgtr() is written
> the AI suggestion is pointless. Hopefully you're right cause debugging
> FW corrupting host memory after reboot on bnxt is not fun.
>
> > AI reviews appear to ignore logic related to handling context memory
> > in the patch.
> > I see no valid comments on the patch.
>
> Why is bnge_func_qcaps() allocating context mem? It may be the case
> that context mem has to be allocated but bnge_func_qcaps() doesn't
> sound like a function that'd perform such key part of init.
> Why not just move the alloc earlier in bnge_fw_register_dev() ?
I agree that bnge_func_qcaps(), which appears to be a query function,
should not allocate memory. I can refactor bnge_func_qcaps() and
move bnge_alloc_ctx_mem() to bnge_fw_register_dev() so that
bnge_func_qcaps() remains solely a query function.
I`ll make changes in v3.
Thanks,
Vikas
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5465 bytes --]
^ permalink raw reply
* [PATCH bpf-next] selftests/bpf: drop xdping tool
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-17 15:33 UTC (permalink / raw)
To: Andrii Nakryiko, Eduard Zingerman, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa, Shuah Khan, David S. Miller,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev
Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, linux-kernel, bpf,
linux-kselftest, netdev, Alan Maguire,
Alexis Lothoré (eBPF Foundation)
As part of a larger cleanup effort in the bpf selftests directory,
tests and scripts are either being converted to the test_progs framework
(so they are executed automatically in bpf CI), or removed if not
relevant for such integration.
The test_xdping.sh script (with the associated xdping.c) acts as a RTT
measurement tool, by attaching two small xdp programs to two interfaces.
Converting this test to test_progs may not make much sense:
- RTT measurement does not really fit in the scope of a functional test,
this is rather about measuring some performance level.
- there are other existing tests in test_progs that actively validate
XDP features like program attachment, return value processing, packet
modification, etc
Drop test_xdping.sh and the corresponding xdping.c userspace part. Keep
the ebpf part (xdping_kern.c), as it is used by another test integrated
in test_progs (btf_dump)
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
tools/testing/selftests/bpf/.gitignore | 1 -
tools/testing/selftests/bpf/Makefile | 3 -
tools/testing/selftests/bpf/test_xdping.sh | 103 ------------
tools/testing/selftests/bpf/xdping.c | 254 -----------------------------
4 files changed, 361 deletions(-)
diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index bfdc5518ecc8..986a6389186b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -21,7 +21,6 @@ test_lirc_mode2_user
flow_dissector_load
test_tcpnotify_user
test_libbpf
-xdping
test_cpp
*.d
*.subskel.h
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 78e60040811e..00a986a7d088 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -111,7 +111,6 @@ TEST_FILES = xsk_prereqs.sh $(wildcard progs/btf_dump_test_case_*.c)
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
test_lirc_mode2.sh \
- test_xdping.sh \
test_bpftool_build.sh \
test_doc_build.sh \
test_xsk.sh \
@@ -134,7 +133,6 @@ TEST_GEN_PROGS_EXTENDED = \
xdp_features \
xdp_hw_metadata \
xdp_synproxy \
- xdping \
xskxceiver
TEST_GEN_FILES += $(TEST_KMODS) liburandom_read.so urandom_read sign-file uprobe_multi
@@ -320,7 +318,6 @@ $(OUTPUT)/test_tcpnotify_user: $(CGROUP_HELPERS) $(TESTING_HELPERS) $(TRACE_HELP
$(OUTPUT)/test_sock_fields: $(CGROUP_HELPERS) $(TESTING_HELPERS)
$(OUTPUT)/test_tag: $(TESTING_HELPERS)
$(OUTPUT)/test_lirc_mode2_user: $(TESTING_HELPERS)
-$(OUTPUT)/xdping: $(TESTING_HELPERS)
$(OUTPUT)/flow_dissector_load: $(TESTING_HELPERS)
$(OUTPUT)/test_maps: $(TESTING_HELPERS)
$(OUTPUT)/test_verifier: $(TESTING_HELPERS) $(CAP_HELPERS) $(UNPRIV_HELPERS)
diff --git a/tools/testing/selftests/bpf/test_xdping.sh b/tools/testing/selftests/bpf/test_xdping.sh
deleted file mode 100755
index c3d82e0a7378..000000000000
--- a/tools/testing/selftests/bpf/test_xdping.sh
+++ /dev/null
@@ -1,103 +0,0 @@
-#!/bin/bash
-# SPDX-License-Identifier: GPL-2.0
-
-# xdping tests
-# Here we setup and teardown configuration required to run
-# xdping, exercising its options.
-#
-# Setup is similar to test_tunnel tests but without the tunnel.
-#
-# Topology:
-# ---------
-# root namespace | tc_ns0 namespace
-# |
-# ---------- | ----------
-# | veth1 | --------- | veth0 |
-# ---------- peer ----------
-#
-# Device Configuration
-# --------------------
-# Root namespace with BPF
-# Device names and addresses:
-# veth1 IP: 10.1.1.200
-# xdp added to veth1, xdpings originate from here.
-#
-# Namespace tc_ns0 with BPF
-# Device names and addresses:
-# veth0 IPv4: 10.1.1.100
-# For some tests xdping run in server mode here.
-#
-
-readonly TARGET_IP="10.1.1.100"
-readonly TARGET_NS="xdp_ns0"
-
-readonly LOCAL_IP="10.1.1.200"
-
-setup()
-{
- ip netns add $TARGET_NS
- ip link add veth0 type veth peer name veth1
- ip link set veth0 netns $TARGET_NS
- ip netns exec $TARGET_NS ip addr add ${TARGET_IP}/24 dev veth0
- ip addr add ${LOCAL_IP}/24 dev veth1
- ip netns exec $TARGET_NS ip link set veth0 up
- ip link set veth1 up
-}
-
-cleanup()
-{
- set +e
- ip netns delete $TARGET_NS 2>/dev/null
- ip link del veth1 2>/dev/null
- if [[ $server_pid -ne 0 ]]; then
- kill -TERM $server_pid
- fi
-}
-
-test()
-{
- client_args="$1"
- server_args="$2"
-
- echo "Test client args '$client_args'; server args '$server_args'"
-
- server_pid=0
- if [[ -n "$server_args" ]]; then
- ip netns exec $TARGET_NS ./xdping $server_args &
- server_pid=$!
- sleep 10
- fi
- ./xdping $client_args $TARGET_IP
-
- if [[ $server_pid -ne 0 ]]; then
- kill -TERM $server_pid
- server_pid=0
- fi
-
- echo "Test client args '$client_args'; server args '$server_args': PASS"
-}
-
-set -e
-
-server_pid=0
-
-trap cleanup EXIT
-
-setup
-
-for server_args in "" "-I veth0 -s -S" ; do
- # client in skb mode
- client_args="-I veth1 -S"
- test "$client_args" "$server_args"
-
- # client with count of 10 RTT measurements.
- client_args="-I veth1 -S -c 10"
- test "$client_args" "$server_args"
-done
-
-# Test drv mode
-test "-I veth1 -N" "-I veth0 -s -N"
-test "-I veth1 -N -c 10" "-I veth0 -s -N"
-
-echo "OK. All tests passed"
-exit 0
diff --git a/tools/testing/selftests/bpf/xdping.c b/tools/testing/selftests/bpf/xdping.c
deleted file mode 100644
index 9ed8c796645d..000000000000
--- a/tools/testing/selftests/bpf/xdping.c
+++ /dev/null
@@ -1,254 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved. */
-
-#include <linux/bpf.h>
-#include <linux/if_link.h>
-#include <arpa/inet.h>
-#include <assert.h>
-#include <errno.h>
-#include <signal.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <unistd.h>
-#include <libgen.h>
-#include <net/if.h>
-#include <sys/types.h>
-#include <sys/socket.h>
-#include <netdb.h>
-
-#include "bpf/bpf.h"
-#include "bpf/libbpf.h"
-
-#include "xdping.h"
-#include "testing_helpers.h"
-
-static int ifindex;
-static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
-
-static void cleanup(int sig)
-{
- bpf_xdp_detach(ifindex, xdp_flags, NULL);
- if (sig)
- exit(1);
-}
-
-static int get_stats(int fd, __u16 count, __u32 raddr)
-{
- struct pinginfo pinginfo = { 0 };
- char inaddrbuf[INET_ADDRSTRLEN];
- struct in_addr inaddr;
- __u16 i;
-
- inaddr.s_addr = raddr;
-
- printf("\nXDP RTT data:\n");
-
- if (bpf_map_lookup_elem(fd, &raddr, &pinginfo)) {
- perror("bpf_map_lookup elem");
- return 1;
- }
-
- for (i = 0; i < count; i++) {
- if (pinginfo.times[i] == 0)
- break;
-
- printf("64 bytes from %s: icmp_seq=%d ttl=64 time=%#.5f ms\n",
- inet_ntop(AF_INET, &inaddr, inaddrbuf,
- sizeof(inaddrbuf)),
- count + i + 1,
- (double)pinginfo.times[i]/1000000);
- }
-
- if (i < count) {
- fprintf(stderr, "Expected %d samples, got %d.\n", count, i);
- return 1;
- }
-
- bpf_map_delete_elem(fd, &raddr);
-
- return 0;
-}
-
-static void show_usage(const char *prog)
-{
- fprintf(stderr,
- "usage: %s [OPTS] -I interface destination\n\n"
- "OPTS:\n"
- " -c count Stop after sending count requests\n"
- " (default %d, max %d)\n"
- " -I interface interface name\n"
- " -N Run in driver mode\n"
- " -s Server mode\n"
- " -S Run in skb mode\n",
- prog, XDPING_DEFAULT_COUNT, XDPING_MAX_COUNT);
-}
-
-int main(int argc, char **argv)
-{
- __u32 mode_flags = XDP_FLAGS_DRV_MODE | XDP_FLAGS_SKB_MODE;
- struct addrinfo *a, hints = { .ai_family = AF_INET };
- __u16 count = XDPING_DEFAULT_COUNT;
- struct pinginfo pinginfo = { 0 };
- const char *optstr = "c:I:NsS";
- struct bpf_program *main_prog;
- int prog_fd = -1, map_fd = -1;
- struct sockaddr_in rin;
- struct bpf_object *obj;
- struct bpf_map *map;
- char *ifname = NULL;
- char filename[256];
- int opt, ret = 1;
- __u32 raddr = 0;
- int server = 0;
- char cmd[256];
-
- while ((opt = getopt(argc, argv, optstr)) != -1) {
- switch (opt) {
- case 'c':
- count = atoi(optarg);
- if (count < 1 || count > XDPING_MAX_COUNT) {
- fprintf(stderr,
- "min count is 1, max count is %d\n",
- XDPING_MAX_COUNT);
- return 1;
- }
- break;
- case 'I':
- ifname = optarg;
- ifindex = if_nametoindex(ifname);
- if (!ifindex) {
- fprintf(stderr, "Could not get interface %s\n",
- ifname);
- return 1;
- }
- break;
- case 'N':
- xdp_flags |= XDP_FLAGS_DRV_MODE;
- break;
- case 's':
- /* use server program */
- server = 1;
- break;
- case 'S':
- xdp_flags |= XDP_FLAGS_SKB_MODE;
- break;
- default:
- show_usage(basename(argv[0]));
- return 1;
- }
- }
-
- if (!ifname) {
- show_usage(basename(argv[0]));
- return 1;
- }
- if (!server && optind == argc) {
- show_usage(basename(argv[0]));
- return 1;
- }
-
- if ((xdp_flags & mode_flags) == mode_flags) {
- fprintf(stderr, "-N or -S can be specified, not both.\n");
- show_usage(basename(argv[0]));
- return 1;
- }
-
- if (!server) {
- /* Only supports IPv4; see hints initialization above. */
- if (getaddrinfo(argv[optind], NULL, &hints, &a) || !a) {
- fprintf(stderr, "Could not resolve %s\n", argv[optind]);
- return 1;
- }
- memcpy(&rin, a->ai_addr, sizeof(rin));
- raddr = rin.sin_addr.s_addr;
- freeaddrinfo(a);
- }
-
- /* Use libbpf 1.0 API mode */
- libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
-
- snprintf(filename, sizeof(filename), "%s_kern.bpf.o", argv[0]);
-
- if (bpf_prog_test_load(filename, BPF_PROG_TYPE_XDP, &obj, &prog_fd)) {
- fprintf(stderr, "load of %s failed\n", filename);
- return 1;
- }
-
- main_prog = bpf_object__find_program_by_name(obj,
- server ? "xdping_server" : "xdping_client");
- if (main_prog)
- prog_fd = bpf_program__fd(main_prog);
- if (!main_prog || prog_fd < 0) {
- fprintf(stderr, "could not find xdping program");
- return 1;
- }
-
- map = bpf_object__next_map(obj, NULL);
- if (map)
- map_fd = bpf_map__fd(map);
- if (!map || map_fd < 0) {
- fprintf(stderr, "Could not find ping map");
- goto done;
- }
-
- signal(SIGINT, cleanup);
- signal(SIGTERM, cleanup);
-
- printf("Setting up XDP for %s, please wait...\n", ifname);
-
- printf("XDP setup disrupts network connectivity, hit Ctrl+C to quit\n");
-
- if (bpf_xdp_attach(ifindex, prog_fd, xdp_flags, NULL) < 0) {
- fprintf(stderr, "Link set xdp fd failed for %s\n", ifname);
- goto done;
- }
-
- if (server) {
- close(prog_fd);
- close(map_fd);
- printf("Running server on %s; press Ctrl+C to exit...\n",
- ifname);
- do { } while (1);
- }
-
- /* Start xdping-ing from last regular ping reply, e.g. for a count
- * of 10 ICMP requests, we start xdping-ing using reply with seq number
- * 10. The reason the last "real" ping RTT is much higher is that
- * the ping program sees the ICMP reply associated with the last
- * XDP-generated packet, so ping doesn't get a reply until XDP is done.
- */
- pinginfo.seq = htons(count);
- pinginfo.count = count;
-
- if (bpf_map_update_elem(map_fd, &raddr, &pinginfo, BPF_ANY)) {
- fprintf(stderr, "could not communicate with BPF map: %s\n",
- strerror(errno));
- cleanup(0);
- goto done;
- }
-
- /* We need to wait for XDP setup to complete. */
- sleep(10);
-
- snprintf(cmd, sizeof(cmd), "ping -c %d -I %s %s",
- count, ifname, argv[optind]);
-
- printf("\nNormal ping RTT data\n");
- printf("[Ignore final RTT; it is distorted by XDP using the reply]\n");
-
- ret = system(cmd);
-
- if (!ret)
- ret = get_stats(map_fd, count, raddr);
-
- cleanup(0);
-
-done:
- if (prog_fd > 0)
- close(prog_fd);
- if (map_fd > 0)
- close(map_fd);
-
- return ret;
-}
---
base-commit: b7fb68124aa80db90394236a9a4a6add12f4425d
change-id: 20260417-xdping-5c2ef5a63899
Best regards,
--
Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
^ permalink raw reply related
* [PATCH v4 net 2/3] net: mlx5e: fix CWR handling in drivers to preserve ACE signal
From: chia-yu.chang @ 2026-04-17 15:26 UTC (permalink / raw)
To: linyunsheng, andrew+netdev, parav, jasowang, mst, shenjian15,
salil.mehta, shaojijie, saeedm, tariqt, mbloch, leonro,
linux-rdma, netdev, davem, edumazet, kuba, pabeni, horms, ij,
ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
In-Reply-To: <20260417152642.71674-1-chia-yu.chang@nokia-bell-labs.com>
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Currently, mlx5 Rx paths use the SKB_GSO_TCP_ECN flag when a TCP segment
with the CWR flag set. This is wrong because SKB_GSO_TCP_ECN is only
valid for RFC3168 ECN on Tx, and using it on Rx allows RFC3168 ECN
offload to clear the CWR flag. As a result, incoming TCP segments
may lose their ACE signal integrity required for AccECN (RFC9768),
especially when the packet is forwarded and later re-segmented by GSO.
Fix this by setting SKB_GSO_TCP_ACCECN for any Rx segment with the CWR
flag set. SKB_GSO_TCP_ACCECN ensures that RFC3168 ECN offload will
not clear the CWR flag, therefore preserving the ACE signal.
Fixes: 92552d3abd329 ("net/mlx5e: HW_GRO cqe handler implementation")
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b60aa47c75b..9b1c80079532 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1180,7 +1180,7 @@ static void mlx5e_shampo_update_ipv4_tcp_hdr(struct mlx5e_rq *rq, struct iphdr *
skb->csum_offset = offsetof(struct tcphdr, check);
if (tcp->cwr)
- skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+ skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
}
static void mlx5e_shampo_update_ipv6_tcp_hdr(struct mlx5e_rq *rq, struct ipv6hdr *ipv6,
@@ -1201,7 +1201,7 @@ static void mlx5e_shampo_update_ipv6_tcp_hdr(struct mlx5e_rq *rq, struct ipv6hdr
skb->csum_offset = offsetof(struct tcphdr, check);
if (tcp->cwr)
- skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+ skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
}
static void mlx5e_shampo_update_hdr(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe, bool match)
--
2.34.1
^ permalink raw reply related
* [PATCH v4 net 3/3] net: hns3: fix CWR handling in drivers to preserve ACE signal
From: chia-yu.chang @ 2026-04-17 15:26 UTC (permalink / raw)
To: linyunsheng, andrew+netdev, parav, jasowang, mst, shenjian15,
salil.mehta, shaojijie, saeedm, tariqt, mbloch, leonro,
linux-rdma, netdev, davem, edumazet, kuba, pabeni, horms, ij,
ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
In-Reply-To: <20260417152642.71674-1-chia-yu.chang@nokia-bell-labs.com>
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Currently, hns3 Rx paths use SKB_GSO_TCP_ECN flag when a TCP segment
with the CWR flag set. This is wrong because SKB_GSO_TCP_ECN is only
valid for RFC3168 ECN on Tx, and using it on Rx allows RFC3168 ECN
offload to clear the CWR flag. As a result, incoming TCP segments
lose their ACE signal integrity required for AccECN (RFC9768),
especially when the packet is forwarded and later re-segmented by GSO.
Fix this by setting SKB_GSO_TCP_ACCECN for any Rx segment with the CWR
flag set. SKB_GSO_TCP_ACCECN ensure that RFC3168 ECN offload will
not clear the CWR flag, therefore preserving the ACE signal.
Fixes: d474d88f88261 ("net: hns3: add hns3_gro_complete for HW GRO process")
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index a3206c97923e..e1b0dba56182 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -3904,7 +3904,7 @@ static int hns3_gro_complete(struct sk_buff *skb, u32 l234info)
skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
if (th->cwr)
- skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+ skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
if (l234info & BIT(HNS3_RXD_GRO_FIXID_B))
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_FIXEDID;
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox