Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v1] net: liquidio: resolve VF pci_dev on demand for FLR requests
From: Simon Horman @ 2026-04-21 15:33 UTC (permalink / raw)
  To: Yuho Choi
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Andrew Lunn, Eric Dumazet, Kory Maincent, Vadim Fedorenko,
	Marco Crivellari, linux-kernel, Myeonghun Pak, Ijae Kim,
	Taegyu Kim
In-Reply-To: <20260420023304.57105-1-dbgh9129@gmail.com>

On Sun, Apr 19, 2026 at 10:33:04PM -0400, Yuho Choi wrote:
> The PF SR-IOV enable path caches VF pci_dev pointers in
> dpiring_to_vfpcidev_lut[] by iterating with pci_get_device(). Those
> entries do not own a reference, because the iterator drops the previous
> device reference on each step. The cached pointer is then dereferenced
> later when handling OCTEON_VF_FLR_REQUEST.
> 
> This can leave stale VF pci_dev pointers in the lookup table and makes
> the FLR path rely on a PCI device object whose lifetime is not pinned.
> 
> Drop the long-lived lookup table and resolve the VF pci_dev only when an
> FLR request arrives. Use the PF's SR-IOV metadata to derive the VF's
> bus/devfn, get a referenced pci_dev for immediate use, issue the FLR,
> and then drop the reference.
> 
> Fixes: ca6139ffc67ee ("liquidio CN23XX: sysfs VF config support")
> Fixes: 8c978d059224 ("liquidio CN23XX: Mailbox support")
> Co-developed-by: Myeonghun Pak <mhun512@gmail.com>
> Signed-off-by: Myeonghun Pak <mhun512@gmail.com>
> Co-developed-by: Ijae Kim <ae878000@gmail.com>
> Signed-off-by: Ijae Kim <ae878000@gmail.com>
> Co-developed-by: Taegyu Kim <tmk5904@psu.edu>
> Signed-off-by: Taegyu Kim <tmk5904@psu.edu>
> Signed-off-by: Yuho Choi <dbgh9129@gmail.com>

As this fixes code present in the net tree, it should be targeted
at that tree, like this:

Subject: [PATCH net] ...

In this case the CI defaulted to the net-next tree.
Which might be harmless. But please keep this in mind for next time.

...

> diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
> index ad685f5d0a136..b967c7928b4a7 100644
> --- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
> +++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
> @@ -26,6 +26,29 @@
>  #include "octeon_mailbox.h"
>  #include "cn23xx_pf_device.h"
>  
> +static struct pci_dev *lio_vf_pci_dev_by_qno(struct octeon_device *oct, u32 q_no)
> +{
> +	int vfidx, bus, devfn;
> +
> +	if (!oct->sriov_info.rings_per_vf)
> +		return NULL;
> +
> +	if (q_no % oct->sriov_info.rings_per_vf)
> +		return NULL;
> +
> +	vfidx = q_no / oct->sriov_info.rings_per_vf;
> +	if (vfidx >= oct->sriov_info.num_vfs_alloced)
> +		return NULL;
> +
> +	bus = pci_iov_virtfn_bus(oct->pci_dev, vfidx);

When applied against net-next this causes a linker error with x86_64
allmodconfig (at least) because pci_iov_virtfn_bus is not defined.

> +	devfn = pci_iov_virtfn_devfn(oct->pci_dev, vfidx);
> +	if (bus < 0 || devfn < 0)
> +		return NULL;
> +
> +	return pci_get_domain_bus_and_slot(pci_domain_nr(oct->pci_dev->bus),
> +					   bus, devfn);
> +}
> +
>  /**
>   * octeon_mbox_read:
>   * @mbox: Pointer mailbox

-- 
pw-bot: changes-requested

^ permalink raw reply

* [PATCH 3/3] selftests: mptcp: cover RECVERR and MSG_ERRQUEUE
From: David Carlier @ 2026-04-21 15:22 UTC (permalink / raw)
  To: netdev, mptcp
  Cc: matttbe, martineau, geliang, davem, edumazet, kuba, pabeni, horms,
	David Carlier
In-Reply-To: <20260421152216.38127-1-devnexen@gmail.com>

Add MPTCP selftest coverage for RECVERR sockopt round-trips and
parent-socket MSG_ERRQUEUE delivery.

Enable TX software timestamping, send data over an MPTCP socket, wait
for POLLERR, and verify that recvmsg(MSG_ERRQUEUE) returns timestamping
metadata on the MPTCP parent socket.

Signed-off-by: David Carlier <devnexen@gmail.com>
Assisted-by: Codex:gpt-5
---
 .../selftests/net/mptcp/mptcp_sockopt.c       | 152 ++++++++++++++++++
 1 file changed, 152 insertions(+)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_sockopt.c b/tools/testing/selftests/net/mptcp/mptcp_sockopt.c
index b6e58d936ebe..b499e7585d38 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_sockopt.c
+++ b/tools/testing/selftests/net/mptcp/mptcp_sockopt.c
@@ -26,9 +26,17 @@
 
 #include <linux/tcp.h>
 #include <linux/compiler.h>
+#include <linux/errqueue.h>
+#include <linux/net_tstamp.h>
+
+#include <poll.h>
 
 static int pf = AF_INET;
 
+#ifndef SCM_TIMESTAMPING
+#define SCM_TIMESTAMPING SO_TIMESTAMPING
+#endif
+
 #ifndef IPPROTO_MPTCP
 #define IPPROTO_MPTCP 262
 #endif
@@ -128,6 +136,9 @@ struct so_state {
 #define MIN(a, b) ((a) < (b) ? (a) : (b))
 #endif
 
+static void enable_tx_timestamping(int fd);
+static void test_msg_errqueue_timestamping(int fd);
+
 static void __noreturn die_perror(const char *msg)
 {
 	perror(msg);
@@ -598,6 +609,8 @@ static void connect_one_server(int fd, int pipefd)
 
 	assert(strncmp(buf2, "xmit", 4) == 0);
 
+	enable_tx_timestamping(fd);
+
 	ret = write(fd, buf, len);
 	if (ret < 0)
 		die_perror("write");
@@ -605,6 +618,8 @@ static void connect_one_server(int fd, int pipefd)
 	if (ret != (ssize_t)len)
 		xerror("short write");
 
+	test_msg_errqueue_timestamping(fd);
+
 	total = 0;
 	do {
 		ret = read(fd, buf2 + total, sizeof(buf2) - total);
@@ -769,6 +784,142 @@ static void test_ip_tos_sockopt(int fd)
 		xerror("expect socklen_t == -1");
 }
 
+static void test_ip_recverr_sockopt(int fd)
+{
+	struct iovec iov = {
+		.iov_base = &(char){ 0 },
+		.iov_len = 1,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+	};
+	int one = 1, zero = 0, val = -1;
+	socklen_t s = sizeof(val);
+	int level, optname, r;
+
+	switch (pf) {
+	case AF_INET:
+		level = SOL_IP;
+		optname = IP_RECVERR;
+		break;
+	case AF_INET6:
+		level = SOL_IPV6;
+		optname = IPV6_RECVERR;
+		break;
+	default:
+		xerror("Unknown pf %d\n", pf);
+	}
+
+	r = setsockopt(fd, level, optname, &one, sizeof(one));
+	if (r)
+		die_perror("setsockopt recverr on");
+
+	r = getsockopt(fd, level, optname, &val, &s);
+	if (r)
+		die_perror("getsockopt recverr on");
+	if (s != sizeof(val) || val != one)
+		xerror("recverr on mismatch val=%d len=%u", val, s);
+
+	r = recvmsg(fd, &msg, MSG_ERRQUEUE | MSG_DONTWAIT);
+	if (r != -1 || errno != EAGAIN)
+		xerror("expected empty errqueue to return EAGAIN, ret=%d errno=%d", r, errno);
+
+	r = setsockopt(fd, level, optname, &zero, sizeof(zero));
+	if (r)
+		die_perror("setsockopt recverr off");
+
+	val = -1;
+	s = sizeof(val);
+	r = getsockopt(fd, level, optname, &val, &s);
+	if (r)
+		die_perror("getsockopt recverr off");
+	if (s != sizeof(val) || val != zero)
+		xerror("recverr off mismatch val=%d len=%u", val, s);
+}
+
+static void enable_tx_timestamping(int fd)
+{
+	int val = SOF_TIMESTAMPING_SOFTWARE |
+		  SOF_TIMESTAMPING_TX_SOFTWARE |
+		  SOF_TIMESTAMPING_OPT_TSONLY;
+	int ret;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING_OLD,
+			 &val, sizeof(val));
+	if (ret)
+		die_perror("setsockopt SO_TIMESTAMPING");
+}
+
+static void test_msg_errqueue_timestamping(int fd)
+{
+	char ctrl[512] = { 0 };
+	char data[32] = { 0 };
+	struct iovec iov = {
+		.iov_base = data,
+		.iov_len = sizeof(data),
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = ctrl,
+		.msg_controllen = sizeof(ctrl),
+	};
+	struct pollfd pfd = {
+		.fd = fd,
+		.events = POLLERR,
+	};
+	struct cmsghdr *cm;
+	struct scm_timestamping *tss = NULL;
+	struct sock_extended_err *serr = NULL;
+	int ret, i;
+
+	for (i = 0; i < 10; i++) {
+		ret = poll(&pfd, 1, 1000);
+		if (ret < 0)
+			die_perror("poll errqueue");
+		if (ret == 0)
+			continue;
+		if (!(pfd.revents & POLLERR))
+			xerror("expected POLLERR, got revents %#x", pfd.revents);
+		break;
+	}
+
+	if (i == 10)
+		xerror("timed out waiting for MSG_ERRQUEUE event");
+
+	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+	if (ret < 0)
+		die_perror("recvmsg timestamping errqueue");
+	if (!(msg.msg_flags & MSG_ERRQUEUE))
+		xerror("expected MSG_ERRQUEUE in msg_flags, got %#x",
+		       msg.msg_flags);
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (cm->cmsg_level == SOL_SOCKET &&
+		    cm->cmsg_type == SCM_TIMESTAMPING)
+			tss = (void *)CMSG_DATA(cm);
+		if ((cm->cmsg_level == SOL_IP &&
+		     cm->cmsg_type == IP_RECVERR) ||
+		    (cm->cmsg_level == SOL_IPV6 &&
+		     cm->cmsg_type == IPV6_RECVERR))
+			serr = (void *)CMSG_DATA(cm);
+	}
+
+	if (!tss)
+		xerror("missing SCM_TIMESTAMPING cmsg");
+	if (!serr)
+		xerror("missing sock_extended_err cmsg");
+	if (serr->ee_errno != ENOMSG ||
+	    serr->ee_origin != SO_EE_ORIGIN_TIMESTAMPING)
+		xerror("unexpected timestamping err ee_errno=%u ee_origin=%u",
+		       serr->ee_errno, serr->ee_origin);
+	if (!tss->ts[0].tv_sec && !tss->ts[0].tv_nsec &&
+	    !tss->ts[1].tv_sec && !tss->ts[1].tv_nsec &&
+	    !tss->ts[2].tv_sec && !tss->ts[2].tv_nsec)
+		xerror("all timestamp slots are zero");
+}
+
 static int client(int pipefd)
 {
 	int fd = -1;
@@ -787,6 +938,7 @@ static int client(int pipefd)
 	}
 
 	test_ip_tos_sockopt(fd);
+	test_ip_recverr_sockopt(fd);
 
 	connect_one_server(fd, pipefd);
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/3] mptcp: support MSG_ERRQUEUE on the parent socket
From: David Carlier @ 2026-04-21 15:22 UTC (permalink / raw)
  To: netdev, mptcp
  Cc: matttbe, martineau, geliang, davem, edumazet, kuba, pabeni, horms,
	David Carlier
In-Reply-To: <20260421152216.38127-1-devnexen@gmail.com>

Handle MSG_ERRQUEUE on the MPTCP socket by selecting a subflow with
pending errqueue data, moving one error skb to the parent socket, and
consuming it through the parent socket ABI.

This surfaces subflow errqueue activity through poll(), keeps the
userspace ABI tied to the socket being used, and restores the skb to
the subflow errqueue if requeueing to the parent fails under rmem
pressure.

Signed-off-by: David Carlier <devnexen@gmail.com>
Assisted-by: Codex:gpt-5
---
 net/mptcp/protocol.c | 121 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 103 insertions(+), 18 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index fbffd3a43fe8..1b2e3bede122 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -819,26 +819,29 @@ static bool __mptcp_subflow_error_report(struct sock *sk, struct sock *ssk)
 {
 	int ssk_state;
 	int err;
+	bool has_errqueue;
 
-	/* only propagate errors on fallen-back sockets or
-	 * on MPC connect
-	 */
-	if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(mptcp_sk(sk)))
-		return false;
-
+	has_errqueue = !skb_queue_empty_lockless(&ssk->sk_error_queue);
 	err = sock_error(ssk);
-	if (!err)
+	if (!err && !has_errqueue)
 		return false;
 
-	/* We need to propagate only transition to CLOSE state.
-	 * Orphaned socket will see such state change via
-	 * subflow_sched_work_if_closed() and that path will properly
-	 * destroy the msk as needed.
+	/* Errqueue notifications should wake poll()/recvmsg(MSG_ERRQUEUE) on
+	 * the MPTCP socket, but only fallback sockets and the MPC connect path
+	 * inherit TCP's sk_err semantics.
 	 */
-	ssk_state = inet_sk_state_load(ssk);
-	if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
-		mptcp_set_state(sk, ssk_state);
-	WRITE_ONCE(sk->sk_err, -err);
+	if (err &&
+	    (sk->sk_state == TCP_SYN_SENT || __mptcp_check_fallback(mptcp_sk(sk)))) {
+		/* We need to propagate only transition to CLOSE state.
+		 * Orphaned socket will see such state change via
+		 * subflow_sched_work_if_closed() and that path will properly
+		 * destroy the msk as needed.
+		 */
+		ssk_state = inet_sk_state_load(ssk);
+		if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
+			mptcp_set_state(sk, ssk_state);
+		WRITE_ONCE(sk->sk_err, -err);
+	}
 
 	/* This barrier is coupled with smp_rmb() in mptcp_poll() */
 	smp_wmb();
@@ -2286,6 +2289,68 @@ static unsigned int mptcp_inq_hint(const struct sock *sk)
 	return 0;
 }
 
+static struct sock *mptcp_pick_errqueue_subflow(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow;
+	struct sock *ssk = NULL;
+
+	lock_sock(sk);
+	mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+		struct sock *subflow_sk = mptcp_subflow_tcp_sock(subflow);
+
+		if (skb_queue_empty_lockless(&subflow_sk->sk_error_queue))
+			continue;
+
+		if (!refcount_inc_not_zero(&subflow_sk->sk_refcnt))
+			continue;
+
+		ssk = subflow_sk;
+		break;
+	}
+	release_sock(sk);
+
+	return ssk;
+}
+
+static bool mptcp_has_error_queue(const struct sock *sk)
+{
+	return !skb_queue_empty_lockless(&sk->sk_error_queue);
+}
+
+static int mptcp_recv_error(struct sock *sk, struct msghdr *msg, int len)
+{
+	struct sk_buff *skb;
+	struct sock *ssk;
+	int ret, ret2;
+
+	if (mptcp_has_error_queue(sk))
+		return inet_recv_error(sk, msg, len);
+
+	ssk = mptcp_pick_errqueue_subflow(sk);
+	if (!ssk)
+		return -EAGAIN;
+
+	skb = sock_dequeue_err_skb(ssk);
+	if (!skb)
+		goto put_ssk;
+
+	ret = sock_queue_err_skb(sk, skb);
+	if (ret) {
+		ret2 = sock_queue_err_skb(ssk, skb);
+		sock_put(ssk);
+		if (ret2)
+			kfree_skb(skb);
+		return ret;
+	}
+
+	sock_put(ssk);
+	return inet_recv_error(sk, msg, len);
+
+put_ssk:
+	sock_put(ssk);
+	return -EAGAIN;
+}
+
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int flags)
 {
@@ -2295,9 +2360,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	int target;
 	long timeo;
 
-	/* MSG_ERRQUEUE is really a no-op till we support IP_RECVERR */
 	if (unlikely(flags & MSG_ERRQUEUE))
-		return inet_recv_error(sk, msg, len);
+		return mptcp_recv_error(sk, msg, len);
 
 	lock_sock(sk);
 	if (unlikely(sk->sk_state == TCP_LISTEN)) {
@@ -4296,6 +4360,26 @@ static __poll_t mptcp_check_writeable(struct mptcp_sock *msk)
 	return 0;
 }
 
+static bool mptcp_subflow_has_error(struct sock *sk)
+{
+	struct mptcp_subflow_context *subflow;
+	bool has_error = false;
+
+	mptcp_data_lock(sk);
+	mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+
+		if (READ_ONCE(ssk->sk_err) ||
+		    !skb_queue_empty_lockless(&ssk->sk_error_queue)) {
+			has_error = true;
+			break;
+		}
+	}
+	mptcp_data_unlock(sk);
+
+	return has_error;
+}
+
 static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait)
 {
@@ -4339,7 +4423,8 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 
 	/* This barrier is coupled with smp_wmb() in __mptcp_error_report() */
 	smp_rmb();
-	if (READ_ONCE(sk->sk_err))
+	if (READ_ONCE(sk->sk_err) || mptcp_has_error_queue(sk) ||
+	    mptcp_subflow_has_error(sk))
 		mask |= EPOLLERR;
 
 	return mask;
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/3] mptcp: propagate RECVERR sockopts to subflows
From: David Carlier @ 2026-04-21 15:22 UTC (permalink / raw)
  To: netdev, mptcp
  Cc: matttbe, martineau, geliang, davem, edumazet, kuba, pabeni, horms,
	David Carlier
In-Reply-To: <20260421152216.38127-1-devnexen@gmail.com>

Propagate IP_RECVERR/IP_RECVERR_RFC4884 and
IPV6_RECVERR/IPV6_RECVERR_RFC4884 from the MPTCP socket to
existing and future subflows.

Apply the matching sockopt according to the subflow family so mixed-
family subflows stay aligned with the parent socket configuration,
including disable-time errqueue purge semantics.

Signed-off-by: David Carlier <devnexen@gmail.com>
Assisted-by: Codex:gpt-5
---
 net/mptcp/sockopt.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 125 insertions(+)

diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index de90a2897d2d..b2b7ef888dff 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -8,6 +8,8 @@
 
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
 #include <net/sock.h>
 #include <net/protocol.h>
 #include <net/tcp.h>
@@ -384,6 +386,70 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
 	return -EOPNOTSUPP;
 }
 
+static bool mptcp_recverr_enabled(const struct sock *sk, bool rfc4884)
+{
+	bool enabled;
+
+	enabled = rfc4884 ? inet_test_bit(RECVERR_RFC4884, sk) :
+			    inet_test_bit(RECVERR, sk);
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		enabled |= rfc4884 ? inet6_test_bit(RECVERR6_RFC4884, sk) :
+				     inet6_test_bit(RECVERR6, sk);
+#endif
+
+	return enabled;
+}
+
+static int mptcp_subflow_set_recverr(struct sock *sk, struct sock *ssk,
+				     bool rfc4884)
+{
+	int level, optname, val;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (ssk->sk_family == AF_INET6) {
+		level = SOL_IPV6;
+		optname = rfc4884 ? IPV6_RECVERR_RFC4884 : IPV6_RECVERR;
+	} else
+#endif
+	{
+		level = SOL_IP;
+		optname = rfc4884 ? IP_RECVERR_RFC4884 : IP_RECVERR;
+	}
+
+	val = mptcp_recverr_enabled(sk, rfc4884);
+	return tcp_setsockopt(ssk, level, optname, KERNEL_SOCKPTR(&val),
+			      sizeof(val));
+}
+
+static int mptcp_setsockopt_v6_recverr(struct mptcp_sock *msk, int optname,
+				       sockptr_t optval, unsigned int optlen)
+{
+	struct mptcp_subflow_context *subflow;
+	struct sock *sk = (struct sock *)msk;
+	int ret;
+
+	ret = ipv6_setsockopt(sk, SOL_IPV6, optname, optval, optlen);
+	if (ret)
+		return ret;
+
+	lock_sock(sk);
+	sockopt_seq_inc(msk);
+	mptcp_for_each_subflow(msk, subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+		bool rfc4884 = optname == IPV6_RECVERR_RFC4884;
+
+		ret = mptcp_subflow_set_recverr(sk, ssk, rfc4884);
+		if (ret)
+			break;
+		subflow->setsockopt_seq = msk->setsockopt_seq;
+	}
+	release_sock(sk);
+
+	return ret;
+}
+
 static int mptcp_setsockopt_v6(struct mptcp_sock *msk, int optname,
 			       sockptr_t optval, unsigned int optlen)
 {
@@ -426,6 +492,10 @@ static int mptcp_setsockopt_v6(struct mptcp_sock *msk, int optname,
 
 		release_sock(sk);
 		break;
+	case IPV6_RECVERR:
+	case IPV6_RECVERR_RFC4884:
+		ret = mptcp_setsockopt_v6_recverr(msk, optname, optval, optlen);
+		break;
 	}
 
 	return ret;
@@ -760,6 +830,33 @@ static int mptcp_setsockopt_v4_set_tos(struct mptcp_sock *msk, int optname,
 	return 0;
 }
 
+static int mptcp_setsockopt_v4_recverr(struct mptcp_sock *msk, int optname,
+				       sockptr_t optval, unsigned int optlen)
+{
+	struct mptcp_subflow_context *subflow;
+	struct sock *sk = (struct sock *)msk;
+	int err;
+
+	err = ip_setsockopt(sk, SOL_IP, optname, optval, optlen);
+	if (err)
+		return err;
+
+	lock_sock(sk);
+	sockopt_seq_inc(msk);
+	mptcp_for_each_subflow(msk, subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+		bool rfc4884 = optname == IP_RECVERR_RFC4884;
+
+		err = mptcp_subflow_set_recverr(sk, ssk, rfc4884);
+		if (err)
+			break;
+		subflow->setsockopt_seq = msk->setsockopt_seq;
+	}
+	release_sock(sk);
+
+	return err;
+}
+
 static int mptcp_setsockopt_v4(struct mptcp_sock *msk, int optname,
 			       sockptr_t optval, unsigned int optlen)
 {
@@ -771,6 +868,9 @@ static int mptcp_setsockopt_v4(struct mptcp_sock *msk, int optname,
 		return mptcp_setsockopt_sol_ip_set(msk, optname, optval, optlen);
 	case IP_TOS:
 		return mptcp_setsockopt_v4_set_tos(msk, optname, optval, optlen);
+	case IP_RECVERR:
+	case IP_RECVERR_RFC4884:
+		return mptcp_setsockopt_v4_recverr(msk, optname, optval, optlen);
 	}
 
 	return -EOPNOTSUPP;
@@ -1459,6 +1559,12 @@ static int mptcp_getsockopt_v4(struct mptcp_sock *msk, int optname,
 	case IP_LOCAL_PORT_RANGE:
 		return mptcp_put_int_option(msk, optval, optlen,
 				READ_ONCE(inet_sk(sk)->local_port_range));
+	case IP_RECVERR:
+		return mptcp_put_int_option(msk, optval, optlen,
+				inet_test_bit(RECVERR, sk));
+	case IP_RECVERR_RFC4884:
+		return mptcp_put_int_option(msk, optval, optlen,
+				inet_test_bit(RECVERR_RFC4884, sk));
 	}
 
 	return -EOPNOTSUPP;
@@ -1479,6 +1585,12 @@ static int mptcp_getsockopt_v6(struct mptcp_sock *msk, int optname,
 	case IPV6_FREEBIND:
 		return mptcp_put_int_option(msk, optval, optlen,
 					    inet_test_bit(FREEBIND, sk));
+	case IPV6_RECVERR:
+		return mptcp_put_int_option(msk, optval, optlen,
+					    inet6_test_bit(RECVERR6, sk));
+	case IPV6_RECVERR_RFC4884:
+		return mptcp_put_int_option(msk, optval, optlen,
+					    inet6_test_bit(RECVERR6_RFC4884, sk));
 	}
 
 	return -EOPNOTSUPP;
@@ -1536,6 +1648,7 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
 {
 	static const unsigned int tx_rx_locks = SOCK_RCVBUF_LOCK | SOCK_SNDBUF_LOCK;
 	struct sock *sk = (struct sock *)msk;
+	bool recverr, recverr_rfc4884;
 	bool keep_open;
 
 	keep_open = sock_flag(sk, SOCK_KEEPOPEN);
@@ -1586,6 +1699,18 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
 	inet_assign_bit(FREEBIND, ssk, inet_test_bit(FREEBIND, sk));
 	inet_assign_bit(BIND_ADDRESS_NO_PORT, ssk, inet_test_bit(BIND_ADDRESS_NO_PORT, sk));
 	WRITE_ONCE(inet_sk(ssk)->local_port_range, READ_ONCE(inet_sk(sk)->local_port_range));
+	recverr = mptcp_recverr_enabled(sk, false);
+	recverr_rfc4884 = mptcp_recverr_enabled(sk, true);
+#if IS_ENABLED(CONFIG_IPV6)
+	if (ssk->sk_family == AF_INET6) {
+		inet6_assign_bit(RECVERR6, ssk, recverr);
+		inet6_assign_bit(RECVERR6_RFC4884, ssk, recverr_rfc4884);
+	} else
+#endif
+	{
+		inet_assign_bit(RECVERR, ssk, recverr);
+		inet_assign_bit(RECVERR_RFC4884, ssk, recverr_rfc4884);
+	}
 }
 
 void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
-- 
2.53.0


^ permalink raw reply related

* [PATCH 0/3] mptcp: add RECVERR and MSG_ERRQUEUE support
From: David Carlier @ 2026-04-21 15:22 UTC (permalink / raw)
  To: netdev, mptcp
  Cc: matttbe, martineau, geliang, davem, edumazet, kuba, pabeni, horms,
	David Carlier

MPTCP already advertises IP_RECVERR/IPV6_RECVERR as supported, but the
parent socket does not currently provide usable MSG_ERRQUEUE handling.

This series wires the MPTCP socket up to the IPv4/IPv6 error queue
paths. It propagates RECVERR-related sockopts to existing and future
subflows, makes poll() report pending errqueue activity through the
parent socket, and allows recvmsg(MSG_ERRQUEUE) on the MPTCP socket to
consume queued errors with the parent socket ABI.

The series also handles mixed-family subflows by applying the matching
sockopt according to each subflow family, and avoids silently losing an
error skb if requeueing to the parent socket fails under rmem pressure.

Patch 1 propagates the RECVERR sockopts to subflows.
Patch 2 implements parent-socket MSG_ERRQUEUE handling and poll()
reporting.
Patch 3 adds selftest coverage for RECVERR sockopt round-trips and
timestamping-driven MSG_ERRQUEUE delivery on the MPTCP parent socket.

Testing:
- make -C tools/testing/selftests/net/mptcp mptcp_sockopt
- git diff --check

David Carlier (3):
  mptcp: propagate RECVERR sockopts to subflows
  mptcp: support MSG_ERRQUEUE on the parent socket
  selftests: mptcp: cover RECVERR and MSG_ERRQUEUE

 net/mptcp/protocol.c                          | 121 +++++++++++---
 net/mptcp/sockopt.c                           | 125 ++++++++++++++
 .../selftests/net/mptcp/mptcp_sockopt.c       | 152 ++++++++++++++++++
 3 files changed, 380 insertions(+), 18 deletions(-)

-- 
2.53.0

^ permalink raw reply

* Re: [PATCH net] netdevsim: Initialize all fields of ip header when building dummy sk_buff
From: Jakub Kicinski @ 2026-04-21 15:20 UTC (permalink / raw)
  To: Nikola Z. Ivanov
  Cc: andrew+netdev, davem, edumazet, pabeni, netdev, linux-kernel
In-Reply-To: <20260421073738.22110-1-zlatistiv@gmail.com>

On Tue, 21 Apr 2026 10:37:38 +0300 Nikola Z. Ivanov wrote:
> Additionally remove the now redundant zero assignments
> and reorder the remaining ones so that they more closely
> match the order of the fields as they appear in the ip header.

Doesn't matter, now that the whole thing is zero-initialized.
I don't think it's worth the noise in the git history.
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net v3 1/1] net: hsr: limit node table growth
From: Andrew Lunn @ 2026-04-21 15:18 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, Felix Maurer, Sebastian Andrzej Siewior, davem, edumazet,
	kuba, pabeni, horms, kees, kexinsun, luka.gejak, Arvid.Brodin,
	m-karicheri2, yuantan098, yifanwucs, tomapufckgml, bird,
	xuyuqiabc, royenheart
In-Reply-To: <3bdbe54e81bd89c1443b05500368fb45bddc3191.1776754203.git.royenheart@gmail.com>

> +static unsigned int hsr_node_table_size = 1024;
> +module_param_named(node_table_size, hsr_node_table_size, uint, 0644);
> +MODULE_PARM_DESC(node_table_size,
> +		 "Maximum number of learned entries in each HSR/PRP node table (0 = unlimited)");
> +

Please don't use module parameters. Look at other parts of the network
stack where such limits are imposed. They all use sysctl values.

    Andrew

---
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
From: Jakub Kicinski @ 2026-04-21 15:09 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem,
	Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com,
	Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org,
	Gal Pressman
In-Reply-To: <3ca1bee450608d37cd0f9199ebc44c52c084cb08.camel@nvidia.com>

On Tue, 21 Apr 2026 14:33:51 +0000 Cosmin Ratiu wrote:
> > > priv->psp and steering at the time of mlx5e_psp_register() is inert
> > > without the PSP device. Cleaning it on psp_dev_create() failure
> > > would
> > > be weird, it's cleaned up anyway on netdev teardown. The fact that
> > > only
> > > memory allocations can fail inside psp_dev_create() is irrelevant
> > > here.
> > > psp_dev_create() failing shouldn't bring down the whole netdevice,
> > > so
> > > logging a message and continuing is ok (which is what is also done
> > > for
> > > macsec and ktls).  
> > 
> > This is a misguided cargo cult. Or something motivated by OOT
> > compatibility. Alex D sometimes tries to do the same thing with Meta
> > drivers. I don't get it. Of course we want the device to be
> > operational
> > if some *device* init fails. The compatibility matrix with all device
> > generations and fw versions could justify that. But continuing init
> > when a single-page kmalloc failed is pure silliness.  
> 
> I am not sure about the wider context, but from the POV of the driver,
> it's calling $thing from the kernel which can fail and it needs to do
> something about it, either fail the entire netdev bringup or accept
> that $thing won't be functional and continue without it. The driver
> shouldn't need to know what $thing does inside and how it can fail,
> which can change over time. Today it's a kmalloc(), tomorrow it may be
> something else.

Like what?

> It doesn't and shouldn't matter for the local decision
> to continue or not without $thing working.
> 
> Isn't this reasonable?

No, the normal thing to do is to propagate errors.
If you want to diverge from that _you_ should have a reason,
a better reason than a vague "kernel can fail".
I'd prefer for the driver to fail in an obvious way.
Which will be immediately spotted by the operator, not 2 weeks
later when 10% of the fleet is upgraded already.
The only exception I'd make is to keep devlink registered in
case the fix is to flash a different FW.

^ permalink raw reply

* Re: Discuss: Future of AX25, NETROM and ROSE in the kernel ?
From: Andrew Lunn @ 2026-04-21 15:00 UTC (permalink / raw)
  To: Steven R. Loomis
  Cc: Dan Cross, hugh, Steve Conklin, Stuart Longland VK4MSL,
	linux-hams, netdev
In-Reply-To: <5880136B-67E4-4D7E-A64C-FFF5A3E4A56F@gmail.com>

On Tue, Apr 21, 2026 at 09:43:52AM -0500, Steven R. Loomis wrote:
> So, is an important next step, having a repo with *just* the ham items with its own build structure?  because there are a number of other modules in the repo mentioned:
> 
> https://github.com/linux-netdev/mod-orphan

Please don't top post.

This git site is just to collect the modules as out of tree code, so
they don't fully disappear. They will just languish here and bitrot
over time.

If there is going to be an active Maintainer, you need a full tree, so
you can send git pull requests. Once you have the tree, send us a
patch for MAINTAINERs, adding a T: entry pointing to your tree, and
set the L: entry pointing to your mailing list etc.

    Andrew

^ permalink raw reply

* [PATCH] net: ne2k-pci: fix missing residual byte in block output for 32-bit IO
From: Titouan Ameline de Cadeville @ 2026-04-21 14:57 UTC (permalink / raw)
  To: netdev
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-kernel,
	Titouan Ameline de Cadeville

ne2k_pci_block_output() handles residual bytes after the main outsl()
loop when the transfer count is not a multiple of 4. It correctly
handles the 2-byte residual case with outw(),  but is missingg the
1 byte residual case. This means for packets where count % 4 == 1 or
count % 4 == 3,  the final byte is never written to the NIC's data
port.

In practice, this is masked by the count being rounded up to a 4-byte
boundary earlier in the function for ONLY_32BIT_IO cards, but that
rounding itself causes a little information leak by sending
uninitialized kernel buffer bytes on the wire

Add the missing outb() call for the odd byte case, mirroring what
ne2k_pci_block_input() already does correctly.

Signed-off-by: Titouan Ameline de Cadeville <titouan.ameline@gmail.com>
---
 drivers/net/ethernet/8390/ne2k-pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/8390/ne2k-pci.c b/drivers/net/ethernet/8390/ne2k-pci.c
index 1a34da07c0db..1bd5b94b5d22 100644
--- a/drivers/net/ethernet/8390/ne2k-pci.c
+++ b/drivers/net/ethernet/8390/ne2k-pci.c
@@ -632,6 +632,8 @@ static void ne2k_pci_block_output(struct net_device *dev, int count,
 				outw(le16_to_cpu(*b++), NE_BASE + NE_DATAPORT);
 				buf = (char *)b;
 			}
+			if (count & 1)
+				outb(*buf, NE_BASE + NE_DATAPORT);
 		}
 	}
 
-- 
2.44.2


^ permalink raw reply related

* [PATCH net v3 1/1] net: hsr: limit node table growth
From: Ren Wei @ 2026-04-21 14:50 UTC (permalink / raw)
  To: netdev, Felix Maurer, Sebastian Andrzej Siewior
  Cc: davem, edumazet, kuba, pabeni, horms, kees, kexinsun, luka.gejak,
	Arvid.Brodin, m-karicheri2, yuantan098, yifanwucs, tomapufckgml,
	bird, xuyuqiabc, royenheart, n05ec

From: Haoze Xie <royenheart@gmail.com>

The HSR/PRP node learning paths allocate one persistent entry per
previously unseen source MAC. Since learned entries stay alive until the
prune timer catches up, the node tables can otherwise grow without a
bound under high churn of learned senders.

Limit the number of learned entries in each node table and stop adding
new ones once the configured limit is reached. This keeps node-table
resource use bounded across the affected learning paths.

Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Fixes: 451d8123f897 ("net: prp: add packet handling support")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Yuqi Xu <xuyuqiabc@gmail.com>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
changes in v3:
- replace the v2 learning-suppression approach with direct node-table growth limiting
- add a node_table_size module parameter and stop learning new entries once each table reaches the configured limit
- fix the full-table handling so failed learning returns NULL instead of reusing an existing node
- v2 Link: https://lore.kernel.org/all/b053e938014c9bac22f7f687ecc2970f23a2b74a.1775281843.git.royenheart@gmail.com/

changes in v2:
- generalize the fix beyond PRP SAN traffic and cover HSR/PRP tagged sender floods
- decide whether learning is needed from local-exclusive delivery instead of protocol-specific SAN checks
- use the normal NULL return semantics from hsr_get_node() instead of ERR_PTR-based error plumbing
- skip duplicate-discard state checks when no node state exists
- v1 Link: https://lore.kernel.org/all/9c88b4b7844f867d065e7a7aba28b2c026386168.1775056603.git.royenheart@outlook.com/

 net/hsr/hsr_framereg.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/hsr/hsr_framereg.c b/net/hsr/hsr_framereg.c
index d09875b33588..8a5a2a54a81f 100644
--- a/net/hsr/hsr_framereg.c
+++ b/net/hsr/hsr_framereg.c
@@ -14,12 +14,18 @@
 #include <kunit/visibility.h>
 #include <linux/if_ether.h>
 #include <linux/etherdevice.h>
+#include <linux/moduleparam.h>
 #include <linux/slab.h>
 #include <linux/rculist.h>
 #include "hsr_main.h"
 #include "hsr_framereg.h"
 #include "hsr_netlink.h"
 
+static unsigned int hsr_node_table_size = 1024;
+module_param_named(node_table_size, hsr_node_table_size, uint, 0644);
+MODULE_PARM_DESC(node_table_size,
+		 "Maximum number of learned entries in each HSR/PRP node table (0 = unlimited)");
+
 bool hsr_addr_is_redbox(struct hsr_priv *hsr, unsigned char *addr)
 {
 	if (!hsr->redbox || !is_valid_ether_addr(hsr->macaddress_redbox))
@@ -189,6 +195,7 @@ static struct hsr_node *hsr_add_node(struct hsr_priv *hsr,
 				     enum hsr_port_type rx_port)
 {
 	struct hsr_node *new_node, *node = NULL;
+	unsigned int node_count = 0;
 	unsigned long now;
 	size_t block_sz;
 	int i;
@@ -226,20 +233,31 @@ static struct hsr_node *hsr_add_node(struct hsr_priv *hsr,
 	spin_lock_bh(&hsr->list_lock);
 	list_for_each_entry_rcu(node, node_db, mac_list,
 				lockdep_is_held(&hsr->list_lock)) {
+		node_count++;
 		if (ether_addr_equal(node->macaddress_A, addr))
-			goto out;
+			goto out_found;
 		if (ether_addr_equal(node->macaddress_B, addr))
-			goto out;
+			goto out_found;
 	}
+
+	if (hsr_node_table_size && node_count >= hsr_node_table_size)
+		goto out_drop;
 	list_add_tail_rcu(&new_node->mac_list, node_db);
 	spin_unlock_bh(&hsr->list_lock);
 	return new_node;
-out:
+out_found:
 	spin_unlock_bh(&hsr->list_lock);
+	xa_destroy(&new_node->seq_blocks);
 	kfree(new_node->block_buf);
-free:
 	kfree(new_node);
 	return node;
+out_drop:
+	spin_unlock_bh(&hsr->list_lock);
+	xa_destroy(&new_node->seq_blocks);
+	kfree(new_node->block_buf);
+free:
+	kfree(new_node);
+	return NULL;
 }
 
 void prp_update_san_info(struct hsr_node *node, bool is_sup)
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net-next] nfp: fix swapped arguments in nfp_encode_basic_qdr() calls
From: Jakub Kicinski @ 2026-04-21 14:46 UTC (permalink / raw)
  To: Alexey Kodanev
  Cc: netdev, Simon Horman, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, oss-drivers
In-Reply-To: <20260421085124.147049-1-aleksei.kodanev@bell-sw.com>

On Tue, 21 Apr 2026 08:51:24 +0000 Alexey Kodanev wrote:
> Fixes: 4cb584e0ee7d ("nfp: add CPP access core")

Fixes should be tagged for net, not net-next.

> Signed-off-by: Alexey Kodanev <aleksei.kodanev@bell-sw.com>
> ---
>  drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c
> index 79470f198a62..5c1edd143cee 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_target.c
> @@ -493,7 +493,7 @@ static int nfp_encode_basic(u64 *addr, int dest_island, int cpp_tgt,
>  			 * the address but we can verify if the existing
>  			 * contents will point to a valid island.
>  			 */
> -			return nfp_encode_basic_qdr(*addr, cpp_tgt, dest_island,
> +			return nfp_encode_basic_qdr(*addr, dest_island, cpp_tgt,
>  						    mode, addr40, isld1, isld0);

Please add warning prints to the error branches in
nfp_encode_basic_qdr() to help identify the source of failure.
Since this code worked and this is just a safety check there's
a high chance we'll break more than we fix with this.

^ permalink raw reply

* Re: Discuss: Future of AX25, NETROM and ROSE in the kernel ?
From: Steven R. Loomis @ 2026-04-21 14:43 UTC (permalink / raw)
  To: Dan Cross; +Cc: hugh, Steve Conklin, Stuart Longland VK4MSL, linux-hams, netdev
In-Reply-To: <CAEoi9W4L3WTVv5Hhiec8D8J=654h-p+Mh_Nd9bbm_cDyardsVA@mail.gmail.com>

So, is an important next step, having a repo with *just* the ham items with its own build structure?  because there are a number of other modules in the repo mentioned:

https://github.com/linux-netdev/mod-orphan

--
Steven R. Loomis
K6SPI


> El abr 21, 2026, a las 7:06 a.m., Dan Cross <crossd@gmail.com> escribió:
> 
> On Tue, Apr 21, 2026 at 2:28 AM Hugh Blemings <hugh@blemings.org> wrote:
>> Hi All,
>> 
>> Just to note in this thread (top posting as it's a bit orthogonal to the
>> rest of this discussion) that events have preceeded us somewhat here
>> 
>> A patch just recently submitted removes the AX25, NETROM and ROSE code
>> from the kernel moving it to the mod-orphan sub tree of netdev
>> 
>> https://lore.kernel.org/netdev/20260421021824.1293976-1-kuba@kernel.org/T/#u
> 
> Wow, that happened much faster than I had anticipated.
> 
>> A shame but perhaps inevitable - but I think we have a good plan
>> unfolding to both take care of medium term maintenance of the kernel
>> code (in tree or out as it may be) as well as a move to userspace in the
>> longer term.
>> 
>> For the benefit of the netdev readership - we had a thread over in
>> linux-hams on this but that may not have been visible to folks in
>> netdev.  TL;DR: we think we have a way forward but appreciate this may
>> not be quick enough to meet the requirements/concerns put forward
>> 
>> If we can delay removal, that'd be grand, but appreciate that moment may
>> have passed.
> 
> Personally, I think this may actually turn out to be a good thing.  If
> nothing else, it's a forcing function for the ham community to get
> serious about providing an implementation that works well, and in the
> short term, an out-of-tree module can keep things working for folks
> while alternatives are investigated and prepared.
> 
> I appreciate that folks want to discuss timing, but it doesn't appear
> that there is much else to be done at this point. It would make sense
> to continue discussion of alternatives over on linux-hams, sparing the
> already-overloaded readers of the netdev list from the sordid details.
> 
>        - Dan C.
>          (KZ2X)



^ permalink raw reply

* Re: [PATCH net v2] ipv6: rpl: reserve mac_len headroom when recompressed SRH grows
From: Jakub Kicinski @ 2026-04-21 14:39 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: netdev, linux-kernel, David S. Miller, David Ahern, Eric Dumazet,
	Paolo Abeni, Simon Horman, stable
In-Reply-To: <2026042142-vanquish-unhealthy-7a85@gregkh>

On Tue, 21 Apr 2026 15:11:57 +0200 Greg Kroah-Hartman wrote:
> Crap, nope, this is wrong, let me go fix this...

Please honor the 24h between reposts rule on netdev.
Also known as "look at the patch before you send it not on the list" rule.

^ permalink raw reply

* Re: [PATCH net v2 1/1] net: hsr: avoid learning unknown senders for local delivery
From: Haoze Xie @ 2026-04-21 14:39 UTC (permalink / raw)
  To: Felix Maurer, Ao Zhou
  Cc: netdev, Sebastian Andrzej Siewior, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Murali Karicheri,
	Shaurya Rane, Ingo Molnar, Kees Cook, Yifan Wu, Juefei Pu,
	Yuan Tan, Xin Liu, Yuqi Xu, royenheart
In-Reply-To: <adYwjxLBBaLY52Wb@thinkpad>

On 4/8/2026 6:40 PM, Felix Maurer wrote:
> On Sat, Apr 04, 2026 at 07:30:47PM +0800, Ao Zhou wrote:
>> From: Haoze Xie <royenheart@gmail.com>
>>
>> Traffic that is directly addressed to the local HSR/PRP master can be
>> delivered locally without creating a persistent node entry. Learning one
>> node per previously unseen source MAC lets forged sender floods grow
>> node_db until the prune timer catches up.
>>
>> Determine whether a frame is locally exclusive before node lookup and
>> skip learning for unknown senders in that case. When no node state
>> exists, also skip duplicate discard checks that depend on it.
>>
>> This keeps locally-destined traffic reachable while avoiding node table
>> growth from source-MAC floods in both the PRP SAN path and the HSR/PRP
>> tagged sender paths.
> 
> I see the problem you are trying to solve here, but I don't think this
> patch provides a significant improvement over the current situation.
> Yes, this will disable learning of new nodes from regular traffic (and
> thereby completely prevent the duplicate discard algorithm from
> working). New nodes would only be learned from supervision frames. But
> nothing prevents a malicious host in the network from spoofing tons of
> supervision frames.
> 
> HSR and PRP are supposed to be used in pretty restricted network
> environments, so the whole protocol design doesn't really expect
> malicious actors in the network and doesn't provide good options to
> safeguard against misuse.
> 
> IMHO, the only real way to prevent excessive resource use on our side is
> to put a limit on these resources. In this case, limit the size of the
> node table (bonus: make that limit configurable as Paolo suggested).
> 
> Thanks,
>    Felix
> 

I agree and therefore dropped the v2 learning-suppression approach and
reworked the fix toward directly bounding node-table growth in v3 patch.

If the node-table-limit approach in v3 looks acceptable, I can also follow
up later with refinements such as per-device configurability and better
visibility when the table is saturated.

Best regards,
Haoze Xie

> 
>> Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
>> Fixes: 451d8123f897 ("net: prp: add packet handling support")
>> Reported-by: Yifan Wu <yifanwucs@gmail.com>
>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
>> Suggested-by: Xin Liu <bird@lzu.edu.cn>
>> Tested-by: Yuqi Xu <xuyuqiabc@gmail.com>
>> Signed-off-by: Haoze Xie <royenheart@gmail.com>
>> Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn>
>> ---
>> changes in v2:
>> - generalize the fix beyond PRP SAN traffic and cover HSR/PRP tagged sender floods
>> - decide whether learning is needed from local-exclusive delivery instead of protocol-specific SAN checks
>> - use the normal NULL return semantics from hsr_get_node() instead of ERR_PTR-based error plumbing
>> - skip duplicate-discard state checks when no node state exists
>>
>>  net/hsr/hsr_forward.c  | 23 +++++++++++++----------
>>  net/hsr/hsr_framereg.c |  5 ++++-
>>  net/hsr/hsr_framereg.h |  2 +-
>>  3 files changed, 18 insertions(+), 12 deletions(-)
>>
>> diff --git a/net/hsr/hsr_forward.c b/net/hsr/hsr_forward.c
>> index aefc9b6936ba..15bd17b4ee17 100644
>> --- a/net/hsr/hsr_forward.c
>> +++ b/net/hsr/hsr_forward.c
>> @@ -403,7 +403,8 @@ static void hsr_deliver_master(struct sk_buff *skb, struct net_device *dev,
>>  	int res, recv_len;
>>
>>  	was_multicast_frame = (skb->pkt_type == PACKET_MULTICAST);
>> -	hsr_addr_subst_source(node_src, skb);
>> +	if (node_src)
>> +		hsr_addr_subst_source(node_src, skb);
>>  	skb_pull(skb, ETH_HLEN);
>>  	recv_len = skb->len;
>>  	res = netif_rx(skb);
>> @@ -545,7 +546,7 @@ static void hsr_forward_do(struct hsr_frame_info *frame)
>>  		/* Don't send frame over port where it has been sent before.
>>  		 * Also for SAN, this shouldn't be done.
>>  		 */
>> -		if (!frame->is_from_san &&
>> +		if (frame->node_src && !frame->is_from_san &&
>>  		    hsr->proto_ops->register_frame_out &&
>>  		    hsr->proto_ops->register_frame_out(port, frame))
>>  			continue;
>> @@ -688,21 +689,25 @@ static int fill_frame_info(struct hsr_frame_info *frame,
>>  		return -EINVAL;
>>
>>  	memset(frame, 0, sizeof(*frame));
>> +	frame->port_rcv = port;
>>  	frame->is_supervision = is_supervision_frame(port->hsr, skb);
>>  	if (frame->is_supervision && hsr->redbox)
>>  		frame->is_proxy_supervision =
>>  			is_proxy_supervision_frame(port->hsr, skb);
>>
>> +	ethhdr = (struct ethhdr *)skb_mac_header(skb);
>> +	check_local_dest(port->hsr, skb, frame);
>> +
>>  	n_db = &hsr->node_db;
>>  	if (port->type == HSR_PT_INTERLINK)
>>  		n_db = &hsr->proxy_node_db;
>>
>>  	frame->node_src = hsr_get_node(port, n_db, skb,
>> -				       frame->is_supervision, port->type);
>> -	if (!frame->node_src)
>> -		return -1; /* Unknown node and !is_supervision, or no mem */
>> +				       frame->is_supervision, port->type,
>> +				       !frame->is_local_exclusive);
>> +	if (!frame->node_src && !frame->is_local_exclusive)
>> +		return -1;
>>
>> -	ethhdr = (struct ethhdr *)skb_mac_header(skb);
>>  	frame->is_vlan = false;
>>  	proto = ethhdr->h_proto;
>>
>> @@ -720,13 +725,10 @@ static int fill_frame_info(struct hsr_frame_info *frame,
>>  	}
>>
>>  	frame->is_from_san = false;
>> -	frame->port_rcv = port;
>>  	ret = hsr->proto_ops->fill_frame_info(proto, skb, frame);
>>  	if (ret)
>>  		return ret;
>>
>> -	check_local_dest(port->hsr, skb, frame);
>> -
>>  	return 0;
>>  }
>>
>> @@ -739,7 +741,8 @@ void hsr_forward_skb(struct sk_buff *skb, struct hsr_port *port)
>>  	if (fill_frame_info(&frame, skb, port) < 0)
>>  		goto out_drop;
>>
>> -	hsr_register_frame_in(frame.node_src, port, frame.sequence_nr);
>> +	if (frame.node_src)
>> +		hsr_register_frame_in(frame.node_src, port, frame.sequence_nr);
>>  	hsr_forward_do(&frame);
>>  	rcu_read_unlock();
>>  	/* Gets called for ingress frames as well as egress from master port.
>> diff --git a/net/hsr/hsr_framereg.c b/net/hsr/hsr_framereg.c
>> index 50996f4de7f9..2bc6f8f154c2 100644
>> --- a/net/hsr/hsr_framereg.c
>> +++ b/net/hsr/hsr_framereg.c
>> @@ -221,7 +221,7 @@ void prp_update_san_info(struct hsr_node *node, bool is_sup)
>>   */
>>  struct hsr_node *hsr_get_node(struct hsr_port *port, struct list_head *node_db,
>>  			      struct sk_buff *skb, bool is_sup,
>> -			      enum hsr_port_type rx_port)
>> +			      enum hsr_port_type rx_port, bool learn)
>>  {
>>  	struct hsr_priv *hsr = port->hsr;
>>  	struct hsr_node *node;
>> @@ -270,6 +270,9 @@ struct hsr_node *hsr_get_node(struct hsr_port *port, struct list_head *node_db,
>>  			san = true;
>>  	}
>>
>> +	if (!learn)
>> +		return NULL;
>> +
>>  	return hsr_add_node(hsr, node_db, ethhdr->h_source, san, rx_port);
>>  }
>>
>> diff --git a/net/hsr/hsr_framereg.h b/net/hsr/hsr_framereg.h
>> index c65ecb925734..3d9c88e83090 100644
>> --- a/net/hsr/hsr_framereg.h
>> +++ b/net/hsr/hsr_framereg.h
>> @@ -33,7 +33,7 @@ void hsr_del_self_node(struct hsr_priv *hsr);
>>  void hsr_del_nodes(struct list_head *node_db);
>>  struct hsr_node *hsr_get_node(struct hsr_port *port, struct list_head *node_db,
>>  			      struct sk_buff *skb, bool is_sup,
>> -			      enum hsr_port_type rx_port);
>> +			      enum hsr_port_type rx_port, bool learn);
>>  void hsr_handle_sup_frame(struct hsr_frame_info *frame);
>>  bool hsr_addr_is_self(struct hsr_priv *hsr, unsigned char *addr);
>>  bool hsr_addr_is_redbox(struct hsr_priv *hsr, unsigned char *addr);
>> --
>> 2.53.0
>>
> 


^ permalink raw reply

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
From: Cosmin Ratiu @ 2026-04-21 14:33 UTC (permalink / raw)
  To: kuba@kernel.org
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, Rahul Rameshbabu, Raed Salem,
	Dragos Tatulea, kees@kernel.org, Mark Bloch, pabeni@redhat.com,
	Tariq Toukan, Saeed Mahameed, netdev@vger.kernel.org,
	Gal Pressman
In-Reply-To: <20260421072609.4b15e7b9@kernel.org>

On Tue, 2026-04-21 at 07:26 -0700, Jakub Kicinski wrote:
> On Tue, 21 Apr 2026 12:29:13 +0000 Cosmin Ratiu wrote:
> > > Sure but why are you leaving the priv->psp struct in place and
> > > whatever
> > > FS init has been done? IOW if you really want PSP init to not
> > > block
> > > probe why is mlx5e_psp_register() a void function rather than
> > > mlx5e_psp_init() ? Ignoring errors from psp_dev_create()
> > > makes no sense to me - what are you protecting from?
> > > kmalloc(GFP_KERNEL)
> > > failing?  
> > 
> > priv->psp and steering at the time of mlx5e_psp_register() is inert
> > without the PSP device. Cleaning it on psp_dev_create() failure
> > would
> > be weird, it's cleaned up anyway on netdev teardown. The fact that
> > only
> > memory allocations can fail inside psp_dev_create() is irrelevant
> > here.
> > psp_dev_create() failing shouldn't bring down the whole netdevice,
> > so
> > logging a message and continuing is ok (which is what is also done
> > for
> > macsec and ktls).
> 
> This is a misguided cargo cult. Or something motivated by OOT
> compatibility. Alex D sometimes tries to do the same thing with Meta
> drivers. I don't get it. Of course we want the device to be
> operational
> if some *device* init fails. The compatibility matrix with all device
> generations and fw versions could justify that. But continuing init
> when a single-page kmalloc failed is pure silliness.

I am not sure about the wider context, but from the POV of the driver,
it's calling $thing from the kernel which can fail and it needs to do
something about it, either fail the entire netdev bringup or accept
that $thing won't be functional and continue without it. The driver
shouldn't need to know what $thing does inside and how it can fail,
which can change over time. Today it's a kmalloc(), tomorrow it may be
something else. It doesn't and shouldn't matter for the local decision
to continue or not without $thing working.

Isn't this reasonable?

> 
> > mlx5e_psp_register() is void because it's called from
> > mlx5e_nic_enable() which can't fail, so it really can't do much
> > other
> > than complain to dmesg.
> > 
> > But while thinking about this, I suppose we could change the entire
> > PSP
> > initialization to happen at the time of the current
> > mlx5e_psp_register(), and that would simplify the number of states.
> > I will do that in the next planned PSP series for net-next.
> > 
> > Meanwhile, could you please take the 2nd patch and leave this one
> > out?
> > It should apply with no conflicts by itself.
> > 
> > Or you would like to see a separate submission with the 2nd patch
> > alone?
> 
> Please resubmit.


^ permalink raw reply

* [PATCH net] net_sched: sch_hhf: annotate data-races in hhf_dump_stats()
From: Eric Dumazet @ 2026-04-21 14:33 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Jiri Pirko, netdev, eric.dumazet,
	Eric Dumazet

hhf_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/sch_hhf.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index 95e5d9bfd9c8c0cac08e080b8f1e0332e722aa3b..96021f52d835b56339509565ca03fe796593e231 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -198,7 +198,8 @@ static struct hh_flow_state *seek_list(const u32 hash,
 				return NULL;
 			list_del(&flow->flowchain);
 			kfree(flow);
-			q->hh_flows_current_cnt--;
+			WRITE_ONCE(q->hh_flows_current_cnt,
+				   q->hh_flows_current_cnt - 1);
 		} else if (flow->hash_id == hash) {
 			return flow;
 		}
@@ -226,7 +227,7 @@ static struct hh_flow_state *alloc_new_hh(struct list_head *head,
 	}
 
 	if (q->hh_flows_current_cnt >= q->hh_flows_limit) {
-		q->hh_flows_overlimit++;
+		WRITE_ONCE(q->hh_flows_overlimit, q->hh_flows_overlimit + 1);
 		return NULL;
 	}
 	/* Create new entry. */
@@ -234,7 +235,7 @@ static struct hh_flow_state *alloc_new_hh(struct list_head *head,
 	if (!flow)
 		return NULL;
 
-	q->hh_flows_current_cnt++;
+	WRITE_ONCE(q->hh_flows_current_cnt, q->hh_flows_current_cnt + 1);
 	INIT_LIST_HEAD(&flow->flowchain);
 	list_add_tail(&flow->flowchain, head);
 
@@ -309,7 +310,7 @@ static enum wdrr_bucket_idx hhf_classify(struct sk_buff *skb, struct Qdisc *sch)
 			return WDRR_BUCKET_FOR_NON_HH;
 		flow->hash_id = hash;
 		flow->hit_timestamp = now;
-		q->hh_flows_total_cnt++;
+		WRITE_ONCE(q->hh_flows_total_cnt, q->hh_flows_total_cnt + 1);
 
 		/* By returning without updating counters in q->hhf_arrays,
 		 * we implicitly implement "shielding" (see Optimization O1).
@@ -403,7 +404,7 @@ static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		return NET_XMIT_SUCCESS;
 
 	prev_backlog = sch->qstats.backlog;
-	q->drop_overlimit++;
+	WRITE_ONCE(q->drop_overlimit, q->drop_overlimit + 1);
 	/* Return Congestion Notification only if we dropped a packet from this
 	 * bucket.
 	 */
@@ -686,10 +687,10 @@ static int hhf_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 {
 	struct hhf_sched_data *q = qdisc_priv(sch);
 	struct tc_hhf_xstats st = {
-		.drop_overlimit = q->drop_overlimit,
-		.hh_overlimit	= q->hh_flows_overlimit,
-		.hh_tot_count	= q->hh_flows_total_cnt,
-		.hh_cur_count	= q->hh_flows_current_cnt,
+		.drop_overlimit = READ_ONCE(q->drop_overlimit),
+		.hh_overlimit	= READ_ONCE(q->hh_flows_overlimit),
+		.hh_tot_count	= READ_ONCE(q->hh_flows_total_cnt),
+		.hh_cur_count	= READ_ONCE(q->hh_flows_current_cnt),
 	};
 
 	return gnet_stats_copy_app(d, &st, sizeof(st));
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related

* Re: [PATCH net-next v6 11/11] net: wangxun: implement pci_error_handlers ops
From: Lukas Wunner @ 2026-04-21 14:33 UTC (permalink / raw)
  To: Jiawen Wu
  Cc: netdev, Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Simon Horman, Michal Swiatkowski, Jacob Keller, Kees Cook,
	Joe Damato, Larysa Zaremba, Abdun Nihaal, Breno Leitao
In-Reply-To: <20260326021406.30444-12-jiawenwu@trustnetic.com>

On Thu, Mar 26, 2026 at 10:14:06AM +0800, Jiawen Wu wrote:
> +static pci_ers_result_t wx_io_slot_reset(struct pci_dev *pdev)
> +{
> +	struct wx *wx = pci_get_drvdata(pdev);
> +	pci_ers_result_t result;
> +
> +	if (pci_enable_device_mem(pdev)) {
> +		wx_err(wx, "Cannot re-enable PCI device after reset.\n");
> +		result = PCI_ERS_RESULT_DISCONNECT;
> +	} else {
> +		/* make all bar access done before reset. */
> +		smp_mb__before_atomic();
> +		clear_bit(WX_STATE_DISABLED, wx->state);
> +		pci_set_master(pdev);
> +		pci_restore_state(pdev);
> +		pci_save_state(pdev);

The pci_save_state() is no longer necessary here, please drop it.
See commits a2f1e22390ac and 383d89699c50 for details.

Thanks,

Lukas

^ permalink raw reply

* Re: [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
From: Waiman Long @ 2026-04-21 14:29 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87qzo8bs9m.ffs@tglx>

On 4/21/26 5:02 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>
>> As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
>> use RCU to protect access to the cpumask.
>>
>> To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
>> following steps can be done.
> Can be done?
>
>>   1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
>>      CPUs and add back the de-isolated CPUs.
>>   2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
>>      to be called on the affected CPUs to migrate the irqs to other
>>      HK_TYPE_MANAGED_IRQ housekeeping CPUs.
>>   3) Bring up the previously offline CPUs to invoke
>>      irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
>>      be used for managed irqs.
> Which previously offline CPUs?
This part should go into another patch.
>
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 2e8072437826..8270c4de260b 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
>>   	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
>>   		const struct cpumask *hk_mask;
>>   
>> +		guard(rcu)();
>>   		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
>>   
>>   		cpumask_and(tmp_mask, mask, hk_mask);
> How is this hunk related to $Subject?

The subject is actually about using RCU to protect access to 
housekeeping cpumask. There are extra info in the commit  log that 
should go to another patch.

Cheers,
Longman

>


^ permalink raw reply

* [PATCH net] net/sched: sch_pie: annotate data-races in pie_dump_stats()
From: Eric Dumazet @ 2026-04-21 14:29 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Jiri Pirko, netdev, eric.dumazet,
	Eric Dumazet

pie_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Alternative would be to acquire the qdisc spinlock, but our long-term
goal is to make qdisc dump operations lockless as much as we can.

tc_pie_xstats fields don't need to be latched atomically,
otherwise this bug would have been caught earlier.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/pie.h   |  2 +-
 net/sched/sch_pie.c | 38 +++++++++++++++++++-------------------
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/net/pie.h b/include/net/pie.h
index 01cbc66825a40bd21c0a044b1180cbbc346785df..1f3db0c355149b41823a891c9156cac625122031 100644
--- a/include/net/pie.h
+++ b/include/net/pie.h
@@ -104,7 +104,7 @@ static inline void pie_vars_init(struct pie_vars *vars)
 	vars->dq_tstamp = DTIME_INVALID;
 	vars->accu_prob = 0;
 	vars->dq_count = DQCOUNT_INVALID;
-	vars->avg_dq_rate = 0;
+	WRITE_ONCE(vars->avg_dq_rate, 0);
 }
 
 static inline struct pie_skb_cb *get_pie_cb(const struct sk_buff *skb)
diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 16f3f629cb8e4be71431f7e50a278e3c7fdba8d0..fb53fbf0e328571be72b66ba4e75a938e1963422 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -90,7 +90,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	bool enqueue = false;
 
 	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
-		q->stats.overlimit++;
+		WRITE_ONCE(q->stats.overlimit, q->stats.overlimit + 1);
 		goto out;
 	}
 
@@ -104,7 +104,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		/* If packet is ecn capable, mark it if drop probability
 		 * is lower than 10%, else drop it.
 		 */
-		q->stats.ecn_mark++;
+		WRITE_ONCE(q->stats.ecn_mark, q->stats.ecn_mark + 1);
 		enqueue = true;
 	}
 
@@ -114,15 +114,15 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		if (!q->params.dq_rate_estimator)
 			pie_set_enqueue_time(skb);
 
-		q->stats.packets_in++;
+		WRITE_ONCE(q->stats.packets_in, q->stats.packets_in + 1);
 		if (qdisc_qlen(sch) > q->stats.maxq)
-			q->stats.maxq = qdisc_qlen(sch);
+			WRITE_ONCE(q->stats.maxq, qdisc_qlen(sch));
 
 		return qdisc_enqueue_tail(skb, sch);
 	}
 
 out:
-	q->stats.dropped++;
+	WRITE_ONCE(q->stats.dropped, q->stats.dropped + 1);
 	q->vars.accu_prob = 0;
 	return qdisc_drop_reason(skb, sch, to_free, reason);
 }
@@ -267,11 +267,11 @@ void pie_process_dequeue(struct sk_buff *skb, struct pie_params *params,
 			count = count / dtime;
 
 			if (vars->avg_dq_rate == 0)
-				vars->avg_dq_rate = count;
+				WRITE_ONCE(vars->avg_dq_rate, count);
 			else
-				vars->avg_dq_rate =
+				WRITE_ONCE(vars->avg_dq_rate,
 				    (vars->avg_dq_rate -
-				     (vars->avg_dq_rate >> 3)) + (count >> 3);
+				     (vars->avg_dq_rate >> 3)) + (count >> 3));
 
 			/* If the queue has receded below the threshold, we hold
 			 * on to the last drain rate calculated, else we reset
@@ -381,7 +381,7 @@ void pie_calculate_probability(struct pie_params *params, struct pie_vars *vars,
 	if (delta > 0) {
 		/* prevent overflow */
 		if (vars->prob < oldprob) {
-			vars->prob = MAX_PROB;
+			WRITE_ONCE(vars->prob, MAX_PROB);
 			/* Prevent normalization error. If probability is at
 			 * maximum value already, we normalize it here, and
 			 * skip the check to do a non-linear drop in the next
@@ -392,7 +392,7 @@ void pie_calculate_probability(struct pie_params *params, struct pie_vars *vars,
 	} else {
 		/* prevent underflow */
 		if (vars->prob > oldprob)
-			vars->prob = 0;
+			WRITE_ONCE(vars->prob, 0);
 	}
 
 	/* Non-linear drop in probability: Reduce drop probability quickly if
@@ -403,7 +403,7 @@ void pie_calculate_probability(struct pie_params *params, struct pie_vars *vars,
 		/* Reduce drop probability to 98.4% */
 		vars->prob -= vars->prob / 64;
 
-	vars->qdelay = qdelay;
+	WRITE_ONCE(vars->qdelay, qdelay);
 	vars->backlog_old = backlog;
 
 	/* We restart the measurement cycle if the following conditions are met
@@ -502,21 +502,21 @@ static int pie_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	struct pie_sched_data *q = qdisc_priv(sch);
 	struct tc_pie_xstats st = {
 		.prob		= q->vars.prob << BITS_PER_BYTE,
-		.delay		= ((u32)PSCHED_TICKS2NS(q->vars.qdelay)) /
+		.delay		= ((u32)PSCHED_TICKS2NS(READ_ONCE(q->vars.qdelay))) /
 				   NSEC_PER_USEC,
-		.packets_in	= q->stats.packets_in,
-		.overlimit	= q->stats.overlimit,
-		.maxq		= q->stats.maxq,
-		.dropped	= q->stats.dropped,
-		.ecn_mark	= q->stats.ecn_mark,
+		.packets_in	= READ_ONCE(q->stats.packets_in),
+		.overlimit	= READ_ONCE(q->stats.overlimit),
+		.maxq		= READ_ONCE(q->stats.maxq),
+		.dropped	= READ_ONCE(q->stats.dropped),
+		.ecn_mark	= READ_ONCE(q->stats.ecn_mark),
 	};
 
 	/* avg_dq_rate is only valid if dq_rate_estimator is enabled */
 	st.dq_rate_estimating = q->params.dq_rate_estimator;
 
 	/* unscale and return dq_rate in bytes per sec */
-	if (q->params.dq_rate_estimator)
-		st.avg_dq_rate = q->vars.avg_dq_rate *
+	if (st.dq_rate_estimating)
+		st.avg_dq_rate = READ_ONCE(q->vars.avg_dq_rate) *
 				 (PSCHED_TICKS_PER_SEC) >> PIE_SCALE;
 
 	return gnet_stats_copy_app(d, &st, sizeof(st));
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related

* Re: [PATCH net 1/2] net/mlx5e: psp: Fix invalid access on PSP dev registration fail
From: Jakub Kicinski @ 2026-04-21 14:26 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Boris Pismenny, willemdebruijn.kernel@gmail.com,
	andrew+netdev@lunn.ch, daniel.zahka@gmail.com,
	davem@davemloft.net, leon@kernel.org, Rahul Rameshbabu,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	pabeni@redhat.com, Raed Salem, Dragos Tatulea, kees@kernel.org,
	Mark Bloch, edumazet@google.com, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman
In-Reply-To: <f327ce67e69c27ed971f4ed38f46381cd2f97ec7.camel@nvidia.com>

On Tue, 21 Apr 2026 12:29:13 +0000 Cosmin Ratiu wrote:
> > Sure but why are you leaving the priv->psp struct in place and
> > whatever
> > FS init has been done? IOW if you really want PSP init to not block
> > probe why is mlx5e_psp_register() a void function rather than
> > mlx5e_psp_init() ? Ignoring errors from psp_dev_create()
> > makes no sense to me - what are you protecting from?
> > kmalloc(GFP_KERNEL)
> > failing?  
> 
> priv->psp and steering at the time of mlx5e_psp_register() is inert
> without the PSP device. Cleaning it on psp_dev_create() failure would
> be weird, it's cleaned up anyway on netdev teardown. The fact that only
> memory allocations can fail inside psp_dev_create() is irrelevant here.
> psp_dev_create() failing shouldn't bring down the whole netdevice, so
> logging a message and continuing is ok (which is what is also done for
> macsec and ktls).

This is a misguided cargo cult. Or something motivated by OOT
compatibility. Alex D sometimes tries to do the same thing with Meta
drivers. I don't get it. Of course we want the device to be operational
if some *device* init fails. The compatibility matrix with all device
generations and fw versions could justify that. But continuing init
when a single-page kmalloc failed is pure silliness.

> mlx5e_psp_register() is void because it's called from
> mlx5e_nic_enable() which can't fail, so it really can't do much other
> than complain to dmesg.
> 
> But while thinking about this, I suppose we could change the entire PSP
> initialization to happen at the time of the current
> mlx5e_psp_register(), and that would simplify the number of states.
> I will do that in the next planned PSP series for net-next.
> 
> Meanwhile, could you please take the 2nd patch and leave this one out?
> It should apply with no conflicts by itself.
> 
> Or you would like to see a separate submission with the 2nd patch
> alone?

Please resubmit.

^ permalink raw reply

* Re: [PATCH v2 1/2] drm/drm_ras: Add clear-error-counter netlink command to drm_ras
From: Tauro, Riana @ 2026-04-21 14:25 UTC (permalink / raw)
  To: Rodrigo Vivi, maarten.lankhorst
  Cc: intel-xe, dri-devel, netdev, Zack McKevitt, joonas.lahtinen,
	aravind.iddamsetty, anshuman.gupta, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, raag.jadav,
	anvesh.bakwad, Jakub Kicinski, Lijo Lazar, Hawking Zhang,
	David S. Miller, Paolo Abeni, Eric Dumazet
In-Reply-To: <ee2681bd-5223-4957-b10e-5dbde0c0e974@intel.com>

Hi Maarten

Could you please help with an ack for this patch so we can merge this 
through drm-xe-next?

Thanks
Riana

On 4/10/2026 10:51 AM, Tauro, Riana wrote:
> Hi Rodrigo
>
> On 4/9/2026 7:07 PM, Rodrigo Vivi wrote:
>> On Thu, Apr 09, 2026 at 12:51:44PM +0530, Tauro, Riana wrote:
>>> Hi Zack
>>>
>>> Could you please take a look at this patch if applicable to your 
>>> usecase.
>>> Please let me know if any
>>> changes are required
>>>
>>> @Rodrigo This is already reviewed by Jakub and Raag.
>>> If there are no opens, can this be merged via drm_misc
>> if we push this to drm-misc-next, it might take a few weeks to propagate
>> back to drm-xe-next. With other work from you and Raag going fast pace
>> on drm-xe-next around this area, I'm afraid it could cause some 
>> conflicts.
>>
>> It is definitely fine by me, but another option is to get ack from
>> drm-misc maintainers to get this through drm-xe-next.
>>
>
> Yeah this would be better with the other RAS patches close to merge.
>
> @Maarten Can you please help with an ack if this patch looks good to you?
> This has been reviewed by Jakub from netdev and Raag from intel-xe
> There are no other opens.
>
> Thanks
> Riana
>
>>
>> so, really okay with drm-misc-next?
>>
>>> Thanks
>>> Riana
>>>
>>> On 4/9/2026 1:03 PM, Riana Tauro wrote:
>>>> Introduce a new 'clear-error-counter' drm_ras command to reset the 
>>>> counter
>>>> value for a specific error counter of a given node.
>>>>
>>>> The command is a 'do' netlink request with 'node-id' and 'error-id'
>>>> as parameters with no response payload.
>>>>
>>>> Usage:
>>>>
>>>> $ sudo ynl --family drm_ras  --do clear-error-counter --json \
>>>> '{"node-id":1, "error-id":1}'
>>>> None
>>>>
>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>>>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>>> Cc: David S. Miller <davem@davemloft.net>
>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>> Cc: Eric Dumazet <edumazet@google.com>
>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
>>>> Reviewed-by: Raag Jadav <raag.jadav@intel.com>
>>>> ---
>>>>    Documentation/gpu/drm-ras.rst            |  8 +++++
>>>>    Documentation/netlink/specs/drm_ras.yaml | 13 ++++++-
>>>>    drivers/gpu/drm/drm_ras.c                | 43 
>>>> +++++++++++++++++++++++-
>>>>    drivers/gpu/drm/drm_ras_nl.c             | 13 +++++++
>>>>    drivers/gpu/drm/drm_ras_nl.h             |  2 ++
>>>>    include/drm/drm_ras.h                    | 11 ++++++
>>>>    include/uapi/drm/drm_ras.h               |  1 +
>>>>    7 files changed, 89 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/gpu/drm-ras.rst 
>>>> b/Documentation/gpu/drm-ras.rst
>>>> index 70b246a78fc8..4636e68f5678 100644
>>>> --- a/Documentation/gpu/drm-ras.rst
>>>> +++ b/Documentation/gpu/drm-ras.rst
>>>> @@ -52,6 +52,8 @@ User space tools can:
>>>>      as a parameter.
>>>>    * Query specific error counter values with the 
>>>> ``get-error-counter`` command, using both
>>>>      ``node-id`` and ``error-id`` as parameters.
>>>> +* Clear specific error counters with the ``clear-error-counter`` 
>>>> command, using both
>>>> +  ``node-id`` and ``error-id`` as parameters.
>>>>    YAML-based Interface
>>>>    --------------------
>>>> @@ -101,3 +103,9 @@ Example: Query an error counter for a given node
>>>>        sudo ynl --family drm_ras --do get-error-counter --json 
>>>> '{"node-id":0, "error-id":1}'
>>>>        {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
>>>> +Example: Clear an error counter for a given node
>>>> +
>>>> +.. code-block:: bash
>>>> +
>>>> +    sudo ynl --family drm_ras --do clear-error-counter --json 
>>>> '{"node-id":0, "error-id":1}'
>>>> +    None
>>>> diff --git a/Documentation/netlink/specs/drm_ras.yaml 
>>>> b/Documentation/netlink/specs/drm_ras.yaml
>>>> index 79af25dac3c5..e113056f8c01 100644
>>>> --- a/Documentation/netlink/specs/drm_ras.yaml
>>>> +++ b/Documentation/netlink/specs/drm_ras.yaml
>>>> @@ -99,7 +99,7 @@ operations:
>>>>          flags: [admin-perm]
>>>>          do:
>>>>            request:
>>>> -          attributes:
>>>> +          attributes: &id-attrs
>>>>                - node-id
>>>>                - error-id
>>>>            reply:
>>>> @@ -113,3 +113,14 @@ operations:
>>>>                - node-id
>>>>            reply:
>>>>              attributes: *errorinfo
>>>> +    -
>>>> +      name: clear-error-counter
>>>> +      doc: >-
>>>> +           Clear error counter for a given node.
>>>> +           The request includes the error-id and node-id of the
>>>> +           counter to be cleared.
>>>> +      attribute-set: error-counter-attrs
>>>> +      flags: [admin-perm]
>>>> +      do:
>>>> +        request:
>>>> +          attributes: *id-attrs
>>>> diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
>>>> index b2fa5ab86d87..d6eab29a1394 100644
>>>> --- a/drivers/gpu/drm/drm_ras.c
>>>> +++ b/drivers/gpu/drm/drm_ras.c
>>>> @@ -26,7 +26,7 @@
>>>>     * efficient lookup by ID. Nodes can be registered or unregistered
>>>>     * dynamically at runtime.
>>>>     *
>>>> - * A Generic Netlink family `drm_ras` exposes two main operations to
>>>> + * A Generic Netlink family `drm_ras` exposes the below operations to
>>>>     * userspace:
>>>>     *
>>>>     * 1. LIST_NODES: Dump all currently registered RAS nodes.
>>>> @@ -37,6 +37,10 @@
>>>>     *    Returns all counters of a node if only Node ID is provided 
>>>> or specific
>>>>     *    error counters.
>>>>     *
>>>> + * 3. CLEAR_ERROR_COUNTER: Clear error counter of a given node.
>>>> + *    Userspace must provide Node ID, Error ID.
>>>> + *    Clears specific error counter of a node if supported.
>>>> + *
>>>>     * Node registration:
>>>>     *
>>>>     * - drm_ras_node_register(): Registers a new node and assigns
>>>> @@ -66,6 +70,8 @@
>>>>     *   operation, fetching all counters from a specific node.
>>>>     * - drm_ras_nl_get_error_counter_doit(): Implements the 
>>>> GET_ERROR_COUNTER doit
>>>>     *   operation, fetching a counter value from a specific node.
>>>> + * - drm_ras_nl_clear_error_counter_doit(): Implements the 
>>>> CLEAR_ERROR_COUNTER doit
>>>> + *   operation, clearing a counter value from a specific node.
>>>>     */
>>>>    static DEFINE_XARRAY_ALLOC(drm_ras_xa);
>>>> @@ -314,6 +320,41 @@ int drm_ras_nl_get_error_counter_doit(struct 
>>>> sk_buff *skb,
>>>>        return doit_reply_value(info, node_id, error_id);
>>>>    }
>>>> +/**
>>>> + * drm_ras_nl_clear_error_counter_doit() - Clear an error counter 
>>>> of a node
>>>> + * @skb: Netlink message buffer
>>>> + * @info: Generic Netlink info containing attributes of the request
>>>> + *
>>>> + * Extracts the node ID and error ID from the netlink attributes and
>>>> + * clears the current value.
>>>> + *
>>>> + * Return: 0 on success, or negative errno on failure.
>>>> + */
>>>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>>>> +                    struct genl_info *info)
>>>> +{
>>>> +    struct drm_ras_node *node;
>>>> +    u32 node_id, error_id;
>>>> +
>>>> +    if (!info->attrs ||
>>>> +        GENL_REQ_ATTR_CHECK(info, 
>>>> DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
>>>> +        GENL_REQ_ATTR_CHECK(info, 
>>>> DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
>>>> +        return -EINVAL;
>>>> +
>>>> +    node_id = 
>>>> nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
>>>> +    error_id = 
>>>> nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
>>>> +
>>>> +    node = xa_load(&drm_ras_xa, node_id);
>>>> +    if (!node || !node->clear_error_counter)
>>>> +        return -ENOENT;
>>>> +
>>>> +    if (error_id < node->error_counter_range.first ||
>>>> +        error_id > node->error_counter_range.last)
>>>> +        return -EINVAL;
>>>> +
>>>> +    return node->clear_error_counter(node, error_id);
>>>> +}
>>>> +
>>>>    /**
>>>>     * drm_ras_node_register() - Register a new RAS node
>>>>     * @node: Node structure to register
>>>> diff --git a/drivers/gpu/drm/drm_ras_nl.c 
>>>> b/drivers/gpu/drm/drm_ras_nl.c
>>>> index 16803d0c4a44..dea1c1b2494e 100644
>>>> --- a/drivers/gpu/drm/drm_ras_nl.c
>>>> +++ b/drivers/gpu/drm/drm_ras_nl.c
>>>> @@ -22,6 +22,12 @@ static const struct nla_policy 
>>>> drm_ras_get_error_counter_dump_nl_policy[DRM_RAS_
>>>>        [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>>>    };
>>>> +/* DRM_RAS_CMD_CLEAR_ERROR_COUNTER - do */
>>>> +static const struct nla_policy 
>>>> drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID 
>>>> + 1] = {
>>>> +    [DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
>>>> +    [DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
>>>> +};
>>>> +
>>>>    /* Ops table for drm_ras */
>>>>    static const struct genl_split_ops drm_ras_nl_ops[] = {
>>>>        {
>>>> @@ -43,6 +49,13 @@ static const struct genl_split_ops 
>>>> drm_ras_nl_ops[] = {
>>>>            .maxattr    = DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID,
>>>>            .flags        = GENL_ADMIN_PERM | GENL_CMD_CAP_DUMP,
>>>>        },
>>>> +    {
>>>> +        .cmd        = DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>>>> +        .doit        = drm_ras_nl_clear_error_counter_doit,
>>>> +        .policy        = drm_ras_clear_error_counter_nl_policy,
>>>> +        .maxattr    = DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
>>>> +        .flags        = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>>>> +    },
>>>>    };
>>>>    struct genl_family drm_ras_nl_family __ro_after_init = {
>>>> diff --git a/drivers/gpu/drm/drm_ras_nl.h 
>>>> b/drivers/gpu/drm/drm_ras_nl.h
>>>> index 06ccd9342773..a398643572a5 100644
>>>> --- a/drivers/gpu/drm/drm_ras_nl.h
>>>> +++ b/drivers/gpu/drm/drm_ras_nl.h
>>>> @@ -18,6 +18,8 @@ int drm_ras_nl_get_error_counter_doit(struct 
>>>> sk_buff *skb,
>>>>                          struct genl_info *info);
>>>>    int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
>>>>                        struct netlink_callback *cb);
>>>> +int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
>>>> +                    struct genl_info *info);
>>>>    extern struct genl_family drm_ras_nl_family;
>>>> diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
>>>> index 5d50209e51db..f2a787bc4f64 100644
>>>> --- a/include/drm/drm_ras.h
>>>> +++ b/include/drm/drm_ras.h
>>>> @@ -58,6 +58,17 @@ struct drm_ras_node {
>>>>        int (*query_error_counter)(struct drm_ras_node *node, u32 
>>>> error_id,
>>>>                       const char **name, u32 *val);
>>>> +    /**
>>>> +     * @clear_error_counter:
>>>> +     *
>>>> +     * This callback is used by drm_ras to clear a specific error 
>>>> counter.
>>>> +     * Driver should implement this callback to support clearing 
>>>> error counters
>>>> +     * of a node.
>>>> +     *
>>>> +     * Returns: 0 on success, negative error code on failure.
>>>> +     */
>>>> +    int (*clear_error_counter)(struct drm_ras_node *node, u32 
>>>> error_id);
>>>> +
>>>>        /** @priv: Driver private data */
>>>>        void *priv;
>>>>    };
>>>> diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
>>>> index 5f40fa5b869d..218a3ee86805 100644
>>>> --- a/include/uapi/drm/drm_ras.h
>>>> +++ b/include/uapi/drm/drm_ras.h
>>>> @@ -41,6 +41,7 @@ enum {
>>>>    enum {
>>>>        DRM_RAS_CMD_LIST_NODES = 1,
>>>>        DRM_RAS_CMD_GET_ERROR_COUNTER,
>>>> +    DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
>>>>        __DRM_RAS_CMD_MAX,
>>>>        DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)

^ permalink raw reply

* Re: [PATCH 10/23] cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
From: Waiman Long @ 2026-04-21 14:25 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87wly0bsh5.ffs@tglx>

On 4/21/26 4:57 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
>> RCU to protect access to the cpumask.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cpu.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index bc4f7a9ba64e..0d02b5d7a7ba 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -1890,6 +1890,8 @@ int freeze_secondary_cpus(int primary)
>>   	cpu_maps_update_begin();
>>   	if (primary == -1) {
>>   		primary = cpumask_first(cpu_online_mask);
>> +
>> +		guard(rcu)();
>>   		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
>>   			primary = housekeeping_any_cpu(HK_TYPE_TIMER);
> housekeeping_cpu() and housekeeping_any_cpu() can operate on two
> different CPU masks once the runtime update is enabled.
>
> Seriously?

Good point, will fix that in the next version.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH net] net: ipv6: fix NOREF dst use in seg6 and rpl lwtunnels
From: Simon Horman @ 2026-04-21 14:25 UTC (permalink / raw)
  To: Andrea Mayer
  Cc: davem, dsahern, edumazet, kuba, pabeni, bigeasy, clrkwllms,
	rostedt, david.lebrun, alex.aring, stefano.salsano, netdev,
	linux-rt-devel, linux-kernel, stable
In-Reply-To: <20260421094735.20997-1-andrea.mayer@uniroma2.it>

On Tue, Apr 21, 2026 at 11:47:35AM +0200, Andrea Mayer wrote:
> seg6_input_core() and rpl_input() call ip6_route_input() which sets a
> NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking
> dst_hold() unconditionally.
> On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can
> release the underlying pcpu_rt between the lookup and the caching
> through a concurrent FIB lookup on a shared nexthop.
> Simplified race sequence:
> 
>   ksoftirqd/X                       higher-prio task (same CPU X)
>   -----------                       --------------------------------
>   seg6_input_core(,skb)/rpl_input(skb)
>     dst_cache_get()
>       -> miss
>     ip6_route_input(skb)
>       -> ip6_pol_route(,skb,flags)
>          [RT6_LOOKUP_F_DST_NOREF in flags]
>         -> FIB lookup resolves fib6_nh
>            [nhid=N route]
>         -> rt6_make_pcpu_route()
>            [creates pcpu_rt, refcount=1]
>              pcpu_rt->sernum = fib6_sernum
>              [fib6_sernum=W]
>            -> cmpxchg(fib6_nh.rt6i_pcpu,
>                       NULL, pcpu_rt)
>               [slot was empty, store succeeds]
>       -> skb_dst_set_noref(skb, dst)
>          [dst is pcpu_rt, refcount still 1]
> 
>                                     rt_genid_bump_ipv6()
>                                       -> bumps fib6_sernum
>                                          [fib6_sernum from W to Z]
>                                     ip6_route_output()
>                                       -> ip6_pol_route()
>                                         -> FIB lookup resolves fib6_nh
>                                            [nhid=N]
>                                         -> rt6_get_pcpu_route()
>                                              pcpu_rt->sernum != fib6_sernum
>                                              [W <> Z, stale]
>                                           -> prev = xchg(rt6i_pcpu, NULL)
>                                           -> dst_release(prev)
>                                              [prev is pcpu_rt,
>                                               refcount 1->0, dead]
> 
>     dst = skb_dst(skb)
>     [dst is the dead pcpu_rt]
>     dst_cache_set_ip6(dst)
>       -> dst_hold() on dead dst
>       -> WARN / use-after-free
> 
> For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without
> PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release
> the pcpu_rt. Shared nexthop objects provide such a path, as two routes
> pointing to the same nhid share the same fib6_nh and its rt6i_pcpu
> entry.
> 
> Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after
> ip6_route_input() to force the NOREF dst into a refcounted one before
> caching.
> The output path is not affected as ip6_route_output() already returns a
> refcounted dst.
> 
> Fixes: af4a2209b134 ("ipv6: sr: use dst_cache in seg6_input")
> Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel")
> Cc: stable@vger.kernel.org
> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* [PATCH net] net/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats()
From: Eric Dumazet @ 2026-04-21 14:25 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Jiri Pirko, netdev, eric.dumazet,
	Eric Dumazet

fq_codel_dump_stats() acquires the qdisc spinlock a bit too late.

Move this acquisition before we fill st.qdisc_stats with live data.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/sch_fq_codel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 2a3d758f67ab43d17128442fd8b51c6ba7775d52..0664b2f2d6f28041e5250a44fc92311116ae0cf1 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -585,6 +585,8 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	};
 	struct list_head *pos;
 
+	sch_tree_lock(sch);
+
 	st.qdisc_stats.maxpacket = q->cstats.maxpacket;
 	st.qdisc_stats.drop_overlimit = q->drop_overlimit;
 	st.qdisc_stats.ecn_mark = q->cstats.ecn_mark;
@@ -593,7 +595,6 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	st.qdisc_stats.memory_usage  = q->memory_usage;
 	st.qdisc_stats.drop_overmemory = q->drop_overmemory;
 
-	sch_tree_lock(sch);
 	list_for_each(pos, &q->new_flows)
 		st.qdisc_stats.new_flows_len++;
 
-- 
2.54.0.rc2.533.g4f5dca5207-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox