public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* Re: 回复:[PATCH v4 net-next 02/11] net/nebula-matrix: add our driver architecture
From: Andrew Lunn @ 2026-02-07 17:19 UTC (permalink / raw)
  To: Illusion Wang
  Cc: Dimon, Alvin, Sam, netdev, andrew+netdev, corbet, kuba, linux-doc,
	lorenzo, pabeni, horms, vadim.fedorenko, lukas.bulwahn, edumazet,
	open list
In-Reply-To: <8641f978-76d5-464f-a312-414bd913c918.illusion.wang@nebula-matrix.com>

On Fri, Feb 06, 2026 at 05:26:35PM +0800, Illusion Wang wrote:
> Last time sam had a question
> "
> Thank you for your feedback. You might have misunderstood me.
> Our difficulties lie in the following:
> 1. Assuming only the mainline version changes the name (Assume name "nbl"),
>    and our regularly released driver doesn't change its name, then when
>    customers upgrade to a new kernel (containing the "nbl" driver),
>    and then want to update our regularly released driver (named "nbl_core"),
>    the module (ko) conflict will occur.
> 2. If both our mainline and regularly released drivers change their names,
>    then customers who are already using the "nbl_core" driver will also
>    encounter conflict issues when updating to the new driver "nbl".
> 
> Is it possible to do this: our net driver is also modified to be a driver based
> on the auxiliary bus, while the PCIe driver only handles PCIe-related processing,
> and these two drivers share a single kernel module (ko), namely "nbl_core"?"
> 
> There's no conclusion to this issue yet, so I haven't modified the 'core' parts for now
> (as mentioned in patch0)

This is all open source, you can do whatever you want with a fork of
Linux and out of tree drivers. Mainline has no influence about what
you do in your out of tree driver. So for Mainline, your out of tree
vendor driver does not really exist, any problems with it are yours to
solve.

However, Mainline cares about Mainline. We expect drivers which get
merged follow Mainline design principles, look like other mainline
drivers, and use naming consistent with other Mainline drivers.

You should also think about how this driver is going to be merged. It
is going to be in small pieces. It is very unlikely the first merged
patchset is actually useful for customers. You probably need quite a
few patchset merged before the driver is useful. If you have customers
who use Linus releases, they are going to have to deal with these WIP
driver. Such customers will be building the kernel themselves, so can
leave the in tree module out of the build. However, do most of your
customers use a distribution? A distribution is not going to update
its kernel until the next LTS kernel is release, sometime in
December. By then, you might have something usable in Mainline, and
the vendor driver is not needed. Or you might still be in the process
of rewriting the driver to Mainline standards and it is not
usable. Your customers then need to handle removing the mainline
driver and use the vendor driver. Again, that is not Mainlines
problem.

So, if your "core" driver is purely core, you can call it core, and
give it an empty tristate. The other drivers which are layered on top
of it can then select it.

If your "core" driver is actually an Ethernet driver, please drop the
name core.

     Andrew

^ permalink raw reply

* Re: [PATCH v6.1-v6.12 ] ipv6: use RCU in ip6_xmit()
From: Greg KH @ 2026-02-07 15:23 UTC (permalink / raw)
  To: Keerthana K
  Cc: stable, davem, yoshfuji, dsahern, edumazet, kuba, pabeni, kafai,
	weiwan, netdev, linux-kernel, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu, Sasha Levin
In-Reply-To: <20260205074722.2091297-1-keerthana.kalyanasundaram@broadcom.com>

On Thu, Feb 05, 2026 at 07:47:22AM +0000, Keerthana K wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> [ Upstream commit 9085e56501d93af9f2d7bd16f7fcfacdde47b99c ]
> 
> Use RCU in ip6_xmit() in order to use dst_dev_rcu() to prevent
> possible UAF.
> 
> Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: David Ahern <dsahern@kernel.org>
> Link: https://patch.msgid.link/20250828195823.3958522-4-edumazet@google.com
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
> ---
>  net/ipv6/ip6_output.c | 35 +++++++++++++++++++++--------------
>  1 file changed, 21 insertions(+), 14 deletions(-)

Does not apply to 6.12.y :(

^ permalink raw reply

* Re: [tc] Invalid JSON output from hfsc qdisc
From: Andrew Lunn @ 2026-02-07 15:09 UTC (permalink / raw)
  To: Deren Teo; +Cc: jhs@mojatatu.com, netdev@vger.kernel.org
In-Reply-To: <SI2PPF4F82E9256898C9826AF17C3AE8AD9F467A@SI2PPF4F82E9256.apcprd04.prod.outlook.com>

On Sat, Feb 07, 2026 at 02:01:01AM +0000, Deren Teo wrote:
> Dear Jamal,
> 
> I apologise if you have received this email twice. My first attempt mistakenly contained an HTML part and so was not delivered to the netdev mailing list. The original content follows.
> 
> 
> The JSON output for the hfsc qdisc is not parseable when a default class is specified; `hfsc_print_opt` uses `fprintf` to print the default class.
>  
> 
> Environment:
> 
> - Fedora 42
> - Linux 6.19.0-rc4+
> - iproute-tc-6.12.0-3.fc42.x86_64
> 
> Reproduce:
> 
> 1. Add a hfsc qdisc with any default class. For example, from the tc-hfsc man page:
>     # tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> 
> 2. Show the qdisc with the `-j` flag:
>     # tc -j qdisc show dev eth0
> 
> Actual results:
> 
> [{"kind":"hfsc","handle":"1:","root":true,"refcnt":17,"options":{default 1 }}]
> 
> Expected results:
> 
> [{"kind":"hfsc","handle":"1:","root":true,"refcnt":17,"options":{"default":1}}]

Just an idea...

Could the self tests in tools/testing/selftests/tc-testing/tc-tests be
extended to dump the configuration in json and test the output is at
least readable by the python json parser? That should catch simple
formatting errors like this.

     Andrew


^ permalink raw reply

* Re: [PATCH bpf] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Michal Luczaj @ 2026-02-07 14:37 UTC (permalink / raw)
  To: Martin KaFai Lau, Kuniyuki Iwashima, Jakub Sitnicki,
	John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, netdev, bpf, linux-kernel
In-Reply-To: <7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev>

On 2/2/26 20:15, Martin KaFai Lau wrote:
> Regardless, if the proper lock is held, all this complication and 
> reasoning will go away.

Here's my attempt:
https://lore.kernel.org/bpf/20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co/


^ permalink raw reply

* [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
	Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
	Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co>

unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.

          T0 bpf                            T1 connect
          ------                            ----------

                                WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
                                sock_hold(newsk)
                                smp_mb__after_atomic()
                                unix_peer(sk) = newsk

BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
  sock_map_link+0x564/0x8b0
  sock_map_update_common+0x6e/0x340
  sock_map_update_elem_sys+0x17d/0x240
  __sys_bpf+0x26db/0x3250
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0x6b/0x3a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected. Then a check during proto update was considered[2], but a
follow-up discussion[3] concluded the root cause is sockmap taking a wrong
lock.

Thus, teach sockmap about the af_unix-specific locking: instead of the
usual lock_sock() involving sock::sk_lock, af_unix protects critical
sections under unix_state_lock() operating on unix_sock::lock.

[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/

This patch also happens to fix a deadlock that may occur when
bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
iter prog attempts to update a sockmap. Which ends up spinning at
sock_map_update_elem()'s bh_lock_sock():

WARNING: possible recursive locking detected
--------------------------------------------
test_progs/1393 is trying to acquire lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0

but task is already holding lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(slock-AF_UNIX);
  lock(slock-AF_UNIX);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by test_progs/1393:
 #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
 #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
 #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
 #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00

Call Trace:
 dump_stack_lvl+0x5d/0x80
 print_deadlock_bug.cold+0xc0/0xce
 __lock_acquire+0x130f/0x2590
 lock_acquire+0x14e/0x2b0
 _raw_spin_lock+0x30/0x40
 sock_map_update_elem+0xdb/0x1f0
 bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
 bpf_iter_run_prog+0x5b9/0xb00
 bpf_iter_unix_seq_show+0x1f7/0x2e0
 bpf_seq_read+0x42c/0x10d0
 vfs_read+0x171/0xb20
 ksys_read+0xff/0x200
 do_syscall_64+0x6b/0x3a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Keeping sparse annotations in sock_map_sk_{acquire,release}() required some
hackery I'm not proud of. Is there a better way?
---
 net/core/sock_map.c | 47 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index b6586d9590b7..0c638b1f363a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/jhash.h>
 #include <linux/sock_diag.h>
+#include <net/af_unix.h>
 #include <net/udp.h>
 
 struct bpf_stab {
@@ -115,17 +116,49 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
 }
 
 static void sock_map_sk_acquire(struct sock *sk)
-	__acquires(&sk->sk_lock.slock)
+	__acquires(sock_or_unix_lock)
 {
-	lock_sock(sk);
+	if (sk_is_unix(sk)) {
+		unix_state_lock(sk);
+		__release(sk); /* Silence sparse. */
+	} else {
+		lock_sock(sk);
+	}
+
 	rcu_read_lock();
 }
 
 static void sock_map_sk_release(struct sock *sk)
-	__releases(&sk->sk_lock.slock)
+	__releases(sock_or_unix_lock)
 {
 	rcu_read_unlock();
-	release_sock(sk);
+
+	if (sk_is_unix(sk)) {
+		unix_state_unlock(sk);
+		__acquire(sk); /* Silence sparse. */
+	} else {
+		release_sock(sk);
+	}
+}
+
+static inline void sock_map_sk_acquire_fast(struct sock *sk)
+{
+	local_bh_disable();
+
+	if (sk_is_unix(sk))
+		unix_state_lock(sk);
+	else
+		bh_lock_sock(sk);
+}
+
+static inline void sock_map_sk_release_fast(struct sock *sk)
+{
+	if (sk_is_unix(sk))
+		unix_state_unlock(sk);
+	else
+		bh_unlock_sock(sk);
+
+	local_bh_enable();
 }
 
 static void sock_map_add_link(struct sk_psock *psock,
@@ -604,16 +637,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
 	if (!sock_map_sk_is_suitable(sk))
 		return -EOPNOTSUPP;
 
-	local_bh_disable();
-	bh_lock_sock(sk);
+	sock_map_sk_acquire_fast(sk);
 	if (!sock_map_sk_state_allowed(sk))
 		ret = -EOPNOTSUPP;
 	else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
 		ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
 	else
 		ret = sock_hash_update_common(map, key, sk, flags);
-	bh_unlock_sock(sk);
-	local_bh_enable();
+	sock_map_sk_release_fast(sk);
 	return ret;
 }
 

-- 
2.52.0


^ permalink raw reply related

* [PATCH bpf v2 4/4] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
	Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
	Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co>

Updating a sockmap from a unix iterator prog may lead to a deadlock.
Piggyback on the original selftest.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
index fea275df9e22..a2652c8c3616 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
@@ -7,6 +7,13 @@
 
 char _license[] SEC("license") = "GPL";
 
+SEC(".maps") struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} sockmap;
+
 static long sock_i_ino(const struct sock *sk)
 {
 	const struct socket *sk_socket = sk->sk_socket;
@@ -76,5 +83,8 @@ int dump_unix(struct bpf_iter__unix *ctx)
 
 	BPF_SEQ_PRINTF(seq, "\n");
 
+	/* Test for deadlock. */
+	bpf_map_update_elem(&sockmap, &(int){0}, sk, 0);
+
 	return 0;
 }

-- 
2.52.0


^ permalink raw reply related

* [PATCH bpf v2 1/4] bpf, sockmap: Annotate af_unix sock::sk_state data-races
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
	Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
	Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co>

sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
socket sk_state locklessly.

Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
not only af_unix, but all non-TCP sockets (UDP, af_vsock).

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 5947b38e4f8b..d4f15b846ad4 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -530,7 +530,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return sk->sk_state != TCP_LISTEN;
 	else
-		return sk->sk_state == TCP_ESTABLISHED;
+		return READ_ONCE(sk->sk_state) == TCP_ESTABLISHED;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
@@ -543,7 +543,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	if (sk_is_vsock(sk) &&
 	    (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET))
 		return (1 << sk->sk_state) & TCPF_ESTABLISHED;

-- 
2.52.0


^ permalink raw reply related

* [PATCH bpf v2 2/4] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
	Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
	Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co>

Instead of repeating the same (un)locking pattern, reuse
sock_map_sk_{acquire,release}(). This centralizes the code and makes it
easier to adapt sockmap to af_unix-specific locking.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d4f15b846ad4..b6586d9590b7 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
 		sk = xchg(psk, NULL);
 		if (sk) {
 			sock_hold(sk);
-			lock_sock(sk);
-			rcu_read_lock();
+			sock_map_sk_acquire(sk);
 			sock_map_unref(sk, psk);
-			rcu_read_unlock();
-			release_sock(sk);
+			sock_map_sk_release(sk);
 			sock_put(sk);
 		}
 	}
@@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
 		 */
 		hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
 			hlist_del(&elem->node);
-			lock_sock(elem->sk);
-			rcu_read_lock();
+			sock_map_sk_acquire(elem->sk);
 			sock_map_unref(elem->sk, elem);
-			rcu_read_unlock();
-			release_sock(elem->sk);
+			sock_map_sk_release(elem->sk);
 			sock_put(elem->sk);
 			sock_hash_free_elem(htab, elem);
 		}
@@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
 	void (*saved_close)(struct sock *sk, long timeout);
 	struct sk_psock *psock;
 
-	lock_sock(sk);
-	rcu_read_lock();
+	sock_map_sk_acquire(sk);
 	psock = sk_psock(sk);
 	if (likely(psock)) {
 		saved_close = psock->saved_close;
@@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
 		psock = sk_psock_get(sk);
 		if (unlikely(!psock))
 			goto no_psock;
-		rcu_read_unlock();
 		sk_psock_stop(psock);
-		release_sock(sk);
+		sock_map_sk_release(sk);
 		cancel_delayed_work_sync(&psock->work);
 		sk_psock_put(sk, psock);
 	} else {
 		saved_close = READ_ONCE(sk->sk_prot)->close;
 no_psock:
-		rcu_read_unlock();
-		release_sock(sk);
+		sock_map_sk_release(sk);
 	}
 
 	/* Make sure we do not recurse. This is a bug.

-- 
2.52.0


^ permalink raw reply related

* [PATCH bpf v2 0/4] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
	Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
	Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

BPF_MAP_UPDATE_ELEM races unix_stream_connect(): when
sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.

BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
  sock_map_link+0x564/0x8b0
  sock_map_update_common+0x6e/0x340
  sock_map_update_elem_sys+0x17d/0x240
  __sys_bpf+0x26db/0x3250
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0x6b/0x3a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Series fixes the null-ptr-deref by teaching sockmap about the
af_unix-specific locking. Accidentally this also fixes a deadlock.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Changes in v2:
- Instead of probing for unix peer, make sockmap take the right lock [Martin]
- Annotate data races [Kaniyuki, Martin]
- Extend bpf unix iter selftest to attempt a deadlock
- Link to v1: https://lore.kernel.org/r/20260129-unix-proto-update-null-ptr-deref-v1-1-e1daeb7012fd@rbox.co

---
Michal Luczaj (4):
      bpf, sockmap: Annotate af_unix sock::sk_state data-races
      bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
      bpf, sockmap: Adapt for the af_unix-specific lock
      selftests/bpf: Extend bpf_iter_unix to attempt deadlocking

 net/core/sock_map.c                               | 72 +++++++++++++++--------
 tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++
 2 files changed, 58 insertions(+), 24 deletions(-)
---
base-commit: 2687c848e57820651b9f69d30c4710f4219f7dbf
change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8

Best regards,
-- 
Michal Luczaj <mhal@rbox.co>


^ permalink raw reply

* Re: [PATCH net-next] net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
From: Vadim Fedorenko @ 2026-02-07 14:23 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn
  Cc: Alexandre Torgue, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, linux-arm-kernel, linux-arm-msm, linux-stm32,
	Mohd Ayaan Anwar, netdev, Paolo Abeni, Sneh Shah, Vinod Koul
In-Reply-To: <E1voPUH-000000083ji-25FH@rmk-PC.armlinux.org.uk>

On 06/02/2026 17:19, Russell King (Oracle) wrote:
> Add cleanup for failure paths in qcom_ethqos_serdes_powerup(). This
> was missing calling phy_exit() and phy_power_off() at appropriate
> failure points.
> 
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> ---
>   .../net/ethernet/stmicro/stmmac/dwmac-qcom-ethqos.c  | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-qcom-ethqos.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-qcom-ethqos.c
> index 869f924f3cde..af8204c0e188 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-qcom-ethqos.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-qcom-ethqos.c
> @@ -659,10 +659,18 @@ static int qcom_ethqos_serdes_powerup(struct net_device *ndev, void *priv)
>   		return ret;
>   
>   	ret = phy_power_on(ethqos->serdes_phy);
> -	if (ret)
> +	if (ret) {
> +		phy_exit(ethqos->serdes_phy);
>   		return ret;
> +	}
>   
> -	return phy_set_speed(ethqos->serdes_phy, ethqos->serdes_speed);
> +	ret = phy_set_speed(ethqos->serdes_phy, ethqos->serdes_speed);
> +	if (ret) {
> +		phy_power_off(ethqos->serdes_phy);
> +		phy_exit(ethqos->serdes_phy);
> +	}
> +
> +	return ret;
>   }
>   
>   static void qcom_ethqos_serdes_powerdown(struct net_device *ndev, void *priv)

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>

^ permalink raw reply

* Re: [PATCH net-next v20 00/12] virtio_net: Add ethtool flow rules support
From: Dan Jurgens @ 2026-02-07 13:14 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jakub Kicinski
  Cc: netdev, jasowang, pabeni, virtualization, parav, shshitrit,
	yohadt, xuanzhuo, eperezma, jgg, kevin.tian, andrew+netdev,
	edumazet
In-Reply-To: <20260207045848-mutt-send-email-mst@kernel.org>

On 2/7/26 4:01 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 05, 2026 at 06:43:28PM -0800, Jakub Kicinski wrote:
>> On Thu, 5 Feb 2026 16:46:55 -0600 Daniel Jurgens wrote:
>>> This series implements ethtool flow rules support for virtio_net using the
>>> virtio flow filter (FF) specification. The implementation allows users to
>>> configure packet filtering rules through ethtool commands, directing
>>> packets to specific receive queues, or dropping them based on various
>>> header fields.
>>
>> This is a 4th version of this you posted in as many days and it doesn't
>> even build. Please slow down. Please wait with v21 until after the merge
>> window. We have enough patches to sift thru still for v7.0.
> 
> v20 and no end in sight.
> Just looking at the amount of pain all this parsing is inflicting
> makes me worry. And wait until we need to begin worrying about
> maintaining UAPI stability.
> 
> It would be much nicer if drivers were out of the business of parsing
> fiddly structures.  Isn't there a way for more code in net core
> to deal with all this?

MST, you reviewed the spec that defined these data structures. If you
didn't want the driver to have parse data structures then suggesting
using the same format as the ethtool flow specs would have been a great
idea at that point. Or short of that padded and fixed size data
structures would also made things much cleaner.

I thought this series was close to done, so I was trying to address the
very non-deterministic AI review comments. It's been generating new
comments on things that had been there for many revisions, and running
it locally with the same model never reproduces the comments from the
online review.



^ permalink raw reply

* Re: [PATCH net v2 2/2] octeontx2-af: CGX: replace kfree() with rvu_free_bitmap()
From: Vadim Fedorenko @ 2026-02-07 12:19 UTC (permalink / raw)
  To: Bo Sun, kuba, pabeni
  Cc: gakula, sgoutham, sbhatta, hkelam, horms, bbhushan2,
	andrew+netdev, davem, edumazet, sumang, netdev, linux-kernel
In-Reply-To: <20260206130925.1087588-3-bo@mboxify.com>

On 06/02/2026 13:09, Bo Sun wrote:
> mac_to_index_bmap is allocated with rvu_alloc_bitmap(), so free it
> with rvu_free_bitmap() instead of open-coding kfree(.bmap) to keep
> the alloc/free API pairing consistent.
> 
> Signed-off-by: Bo Sun <bo@mboxify.com>
> ---
>   drivers/net/ethernet/marvell/octeontx2/af/cgx.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> index fd4792e432bf..29f5def796ba 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> @@ -1822,7 +1822,7 @@ static int cgx_lmac_exit(struct cgx *cgx)
>   			continue;
>   		cgx->mac_ops->mac_pause_frm_config(cgx, lmac->lmac_id, false);
>   		cgx_configure_interrupt(cgx, lmac, lmac->lmac_id, true);
> -		kfree(lmac->mac_to_index_bmap.bmap);
> +		rvu_free_bitmap(&lmac->mac_to_index_bmap);
>   		rvu_free_bitmap(&lmac->rx_fc_pfvf_bmap);
>   		rvu_free_bitmap(&lmac->tx_fc_pfvf_bmap);
>   		kfree(lmac->name);

The code LGTM, but as Jakub metioned in v1, the cleanup should be
a separate patch targeting net-next.

On respin you can add:
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>

^ permalink raw reply

* Re: [PATCH net v2 1/2] octeontx2-af: CGX: fix bitmap leaks
From: Vadim Fedorenko @ 2026-02-07 12:17 UTC (permalink / raw)
  To: Bo Sun, kuba, pabeni
  Cc: gakula, sgoutham, sbhatta, hkelam, horms, bbhushan2,
	andrew+netdev, davem, edumazet, sumang, netdev, linux-kernel,
	stable
In-Reply-To: <20260206130925.1087588-2-bo@mboxify.com>

On 06/02/2026 13:09, Bo Sun wrote:
> The RX/TX flow-control bitmaps (rx_fc_pfvf_bmap and tx_fc_pfvf_bmap)
> are allocated by cgx_lmac_init() but never freed in cgx_lmac_exit().
> Unbinding and rebinding the driver therefore triggers kmemleak:
> 
>      unreferenced object (size 16):
>          backtrace:
>            rvu_alloc_bitmap
>            cgx_probe
> 
> Free both bitmaps during teardown.
> 
> Fixes: e740003874ed ("octeontx2-af: Flow control resource management")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bo Sun <bo@mboxify.com>
> ---
>   drivers/net/ethernet/marvell/octeontx2/af/cgx.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> index 42044cd810b1..fd4792e432bf 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> @@ -1823,6 +1823,8 @@ static int cgx_lmac_exit(struct cgx *cgx)
>   		cgx->mac_ops->mac_pause_frm_config(cgx, lmac->lmac_id, false);
>   		cgx_configure_interrupt(cgx, lmac, lmac->lmac_id, true);
>   		kfree(lmac->mac_to_index_bmap.bmap);
> +		rvu_free_bitmap(&lmac->rx_fc_pfvf_bmap);
> +		rvu_free_bitmap(&lmac->tx_fc_pfvf_bmap);
>   		kfree(lmac->name);
>   		kfree(lmac);
>   	}

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>

^ permalink raw reply

* i40e: Fix preempt count leak in napi poll tracepoint
From: Thomas Gleixner @ 2026-02-07 10:50 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: Tony Nguyen, Przemek Kitszel, netdev

Using get_cpu() in the tracepoint assignment causes an obvious preempt
count leak because nothing invokes put_cpu() to undo it:

  softirq: huh, entered softirq 3 NET_RX with preempt_count 00000100, exited with 00000101?

This clearly has seen a lot of testing in the last 3+ years...

Use smp_processor_id() instead.

Fixes: 6d4d584a7ea8 ("i40e: Add i40e_napi_poll tracepoint")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/intel/i40e/i40e_trace.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/intel/i40e/i40e_trace.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_trace.h
@@ -88,7 +88,7 @@ TRACE_EVENT(i40e_napi_poll,
 		__entry->rx_clean_complete = rx_clean_complete;
 		__entry->tx_clean_complete = tx_clean_complete;
 		__entry->irq_num = q->irq_num;
-		__entry->curr_cpu = get_cpu();
+		__entry->curr_cpu = smp_processor_id();
 		__assign_str(qname);
 		__assign_str(dev_name);
 		__assign_bitmask(irq_affinity, cpumask_bits(&q->affinity_mask),

^ permalink raw reply

* Re: [PATCH net-next] xfrm: reduce struct sec_path size
From: Florian Westphal @ 2026-02-07 10:39 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Steffen Klassert, Herbert Xu, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman
In-Reply-To: <83846bd2e3fa08899bd0162e41bfadfec95e82ef.1770398071.git.pabeni@redhat.com>

Paolo Abeni <pabeni@redhat.com> wrote:
> The mentioned struct has an hole and uses unnecessary wide type to
> store MAC length and indexes of very small arrays.
> 
> It's also embedded into the skb_extensions, and the latter, due
> to recent CAN changes, may exceeds the 192 bytes mark (3 cachelines
> on x86_64 arch) on some reasonable configurations.
> 
> Reordering and the sec_path fields, shrinking xfrm_offload.orig_mac_len
> to 16 bits and xfrm_offload.{len,olen,verified_cnt} to u8, we can save
> 16 bytes and keep skb_extensions size under control.

Reviewed-by: Florian Westphal <fw@strlen.de>

^ permalink raw reply

* Re: [PATCH v2 net-next] virtio_net: Improve RSS key size validation and use NETDEV_RSS_KEY_LEN
From: Michael S. Tsirkin @ 2026-02-07 10:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Srujana Challa, netdev, virtualization, pabeni, jasowang,
	xuanzhuo, eperezma, davem, edumazet, ndabilpuram, kshankar
In-Reply-To: <20260206191308.7fdbfef4@kernel.org>

On Fri, Feb 06, 2026 at 07:13:08PM -0800, Jakub Kicinski wrote:
> On Fri, 6 Feb 2026 17:31:54 +0530 Srujana Challa wrote:
> > Replace hardcoded RSS max key size limit with NETDEV_RSS_KEY_LEN to
> > align with kernel's standard RSS key length. Add validation for RSS
> > key size against spec minimum (40 bytes) and driver maximum. When
> > validation fails, gracefully disable RSS features and continue
> > initialization rather than failing completely.
> 
> Hm, FWIW clang says:
> 
> drivers/net/virtio_net.c:6841:31: warning: result of comparison of constant 256 with expression of type 'u8' (aka 'unsigned char') is always false [-Wtautological-constant-out-of-range-compare]
>  6841 |                 } else if (vi->rss_key_size > VIRTIO_NET_RSS_MAX_KEY_SIZE) {
>       |                            ~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Which is kinda annoying because the value was increased in net-next.
> If Machael wants this backported then we need to keep the check
> and follow up in net-next? We could try to cast the u32 away but
> that feels dirty..


for net next we will presumably replace with 
	BUILD_BUG_ON(type_max(vi->rss_key_size) > NETDEV_RSS_KEY_LEN) 

and I think we should get rid of VIRTIO_NET_RSS_MAX_KEY_SIZE while
we are at it.

-- 
MST


^ permalink raw reply

* [PATCH net-next v2] docs: ethtool: clarify the bit-by-bit bitset format description
From: Yohei Kojima @ 2026-02-07 10:25 UTC (permalink / raw)
  To: Andrew Lunn, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet
  Cc: Yohei Kojima, netdev, linux-doc, linux-kernel

Clarify the bit-by-bit bitset format's behavior around mandatory
attributes and bit identification. More specifically, the following
changes are made:

* Rephrase a misleading sentence which implies name and index are
  mutually exclusive
* Describe that ETHTOOL_A_BITSET_BITS nest is mandatory
* Describe that a request fails if inconsistent identifiers are given

Signed-off-by: Yohei Kojima <yk@y-koj.net>
---
Current ethtool-netlink documentation doesn't describe several behavior
around bit-by-bit bitset, which makes it hard to develop a ethtool
library without digging into the kernel code. This patch eases the gap
between the kernel behavior and the documentation by adding descriptions
around the mandatory attribute and bit identification.

ChangeLog
=========
v2 (this version):
* Minimize the diff for ease of review
v1: https://lore.kernel.org/lkml/e9ea0fe8bf7935d6439e4dc883414b685afbaf58.1770045398.git.yk@y-koj.net/

---
 Documentation/networking/ethtool-netlink.rst | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index af56c304cef4..32179168eb73 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -96,7 +96,7 @@ For short bitmaps of (reasonably) fixed length, standard ``NLA_BITFIELD32``
 type is used. For arbitrary length bitmaps, ethtool netlink uses a nested
 attribute with contents of one of two forms: compact (two binary bitmaps
 representing bit values and mask of affected bits) and bit-by-bit (list of
-bits identified by either index or name).
+bits identified by index or name).
 
 Verbose (bit-by-bit) bitsets allow sending symbolic names for bits together
 with their values which saves a round trip (when the bitset is passed in a
@@ -156,12 +156,16 @@ Bit-by-bit form: nested (bitset) attribute contents:
  | | | ``ETHTOOL_A_BITSET_BIT_VALUE`` | flag   | present if bit is set       |
  +-+-+--------------------------------+--------+-----------------------------+
 
-Bit size is optional for bit-by-bit form. ``ETHTOOL_A_BITSET_BITS`` nest can
+For bit-by-bit form, ``ETHTOOL_A_BITSET_SIZE`` is optional, and
+``ETHTOOL_A_BITSET_BITS`` is mandatory. ``ETHTOOL_A_BITSET_BITS`` nest can
 only contain ``ETHTOOL_A_BITSET_BITS_BIT`` attributes but there can be an
 arbitrary number of them.  A bit may be identified by its index or by its
 name. When used in requests, listed bits are set to 0 or 1 according to
-``ETHTOOL_A_BITSET_BIT_VALUE``, the rest is preserved. A request fails if
-index exceeds kernel bit length or if name is not recognized.
+``ETHTOOL_A_BITSET_BIT_VALUE``, the rest is preserved.
+
+A request fails if index exceeds kernel bit length or if name is not
+recognized. If both name and index are set, the request will fail if they
+point to different bits.
 
 When ``ETHTOOL_A_BITSET_NOMASK`` flag is present, bitset is interpreted as
 a simple bitmap. ``ETHTOOL_A_BITSET_BIT_VALUE`` attributes are not used in
-- 
2.52.0


^ permalink raw reply related

* [PATCH net v2] iavf: fix deadlock in reset handling
From: Petr Oros @ 2026-02-07 10:22 UTC (permalink / raw)
  To: netdev
  Cc: ivecera, aleksandr.loktionov, shaojijie, Petr Oros, Jacob Keller,
	Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Stanislav Fomichev,
	intel-wired-lan, linux-kernel

Three driver callbacks schedule a reset and wait for its completion:
ndo_change_mtu(), ethtool set_ringparam(), and ethtool set_channels().

Waiting for reset in ndo_change_mtu() and set_ringparam() was added by
commit c2ed2403f12c ("iavf: Wait for reset in callbacks which trigger
it") to fix a race condition where adding an interface to bonding
immediately after MTU or ring parameter change failed because the
interface was still in __RESETTING state. The same commit also added
waiting in iavf_set_priv_flags(), which was later removed by commit
53844673d555 ("iavf: kill "legacy-rx" for good").

Waiting in set_channels() was introduced earlier by commit 4e5e6b5d9d13
("iavf: Fix return of set the new channel count") to ensure the PF has
enough time to complete the VF reset when changing channel count, and to
return correct error codes to userspace.

Commit ef490bbb2267 ("iavf: Add net_shaper_ops support") added
net_shaper_ops to iavf, which required reset_task to use _locked NAPI
variants (napi_enable_locked, napi_disable_locked) that need the netdev
instance lock.

Later, commit 7e4d784f5810 ("net: hold netdev instance lock during
rtnetlink operations") and commit 2bcf4772e45a ("net: ethtool: try to
protect all callback with netdev instance lock") started holding the
netdev instance lock during ndo and ethtool callbacks for drivers with
net_shaper_ops.

Finally, commit 120f28a6f314 ("iavf: get rid of the crit lock")
replaced the driver's crit_lock with netdev_lock in reset_task, making
the deadlock manifest: the callback holds netdev_lock and waits for
reset_task, but reset_task needs the same lock:

  Thread 1 (callback)               Thread 2 (reset_task)
  -------------------               ---------------------
  netdev_lock()                     [blocked on workqueue]
  ndo_change_mtu() or ethtool op
    iavf_schedule_reset()
    iavf_wait_for_reset()           iavf_reset_task()
      waiting...                      netdev_lock() <- DEADLOCK

Fix this by extracting the reset logic from iavf_reset_task() into a new
iavf_reset_step() function that expects netdev_lock to be already held.
The three callbacks now call iavf_reset_step() directly instead of
scheduling the work and waiting, performing the reset synchronously in
the caller's context which already holds netdev_lock. This eliminates
both the deadlock and the need for iavf_wait_for_reset(), which is
removed.

The workqueue-based iavf_reset_task() becomes a thin wrapper that
acquires netdev_lock and calls iavf_reset_step(), preserving its use
for PF-initiated resets.

The callbacks may block for several seconds while iavf_reset_step()
polls hardware registers, but this is acceptable since netdev_lock is a
per-device mutex and only serializes operations on the same interface.

Fixes: 120f28a6f314 ("iavf: get rid of the crit lock")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Petr Oros <poros@redhat.com>
---
 drivers/net/ethernet/intel/iavf/iavf.h        |  2 +-
 .../net/ethernet/intel/iavf/iavf_ethtool.c    | 21 +++---
 drivers/net/ethernet/intel/iavf/iavf_main.c   | 72 +++++++------------
 3 files changed, 33 insertions(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index d552f912e8a947..0c3844b3ff1c86 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -625,5 +625,5 @@ void iavf_add_adv_rss_cfg(struct iavf_adapter *adapter);
 void iavf_del_adv_rss_cfg(struct iavf_adapter *adapter);
 struct iavf_mac_filter *iavf_add_filter(struct iavf_adapter *adapter,
 					const u8 *macaddr);
-int iavf_wait_for_reset(struct iavf_adapter *adapter);
+void iavf_reset_step(struct iavf_adapter *adapter);
 #endif /* _IAVF_H_ */
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index 2cc21289a70779..9b0f47f9340942 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -492,7 +492,6 @@ static int iavf_set_ringparam(struct net_device *netdev,
 {
 	struct iavf_adapter *adapter = netdev_priv(netdev);
 	u32 new_rx_count, new_tx_count;
-	int ret = 0;
 
 	if ((ring->rx_mini_pending) || (ring->rx_jumbo_pending))
 		return -EINVAL;
@@ -537,13 +536,11 @@ static int iavf_set_ringparam(struct net_device *netdev,
 	}
 
 	if (netif_running(netdev)) {
-		iavf_schedule_reset(adapter, IAVF_FLAG_RESET_NEEDED);
-		ret = iavf_wait_for_reset(adapter);
-		if (ret)
-			netdev_warn(netdev, "Changing ring parameters timeout or interrupted waiting for reset");
+		adapter->flags |= IAVF_FLAG_RESET_NEEDED;
+		iavf_reset_step(adapter);
 	}
 
-	return ret;
+	return 0;
 }
 
 /**
@@ -1723,7 +1720,6 @@ static int iavf_set_channels(struct net_device *netdev,
 {
 	struct iavf_adapter *adapter = netdev_priv(netdev);
 	u32 num_req = ch->combined_count;
-	int ret = 0;
 
 	if ((adapter->vf_res->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_ADQ) &&
 	    adapter->num_tc) {
@@ -1745,13 +1741,12 @@ static int iavf_set_channels(struct net_device *netdev,
 
 	adapter->num_req_queues = num_req;
 	adapter->flags |= IAVF_FLAG_REINIT_ITR_NEEDED;
-	iavf_schedule_reset(adapter, IAVF_FLAG_RESET_NEEDED);
-
-	ret = iavf_wait_for_reset(adapter);
-	if (ret)
-		netdev_warn(netdev, "Changing channel count timeout or interrupted waiting for reset");
+	if (netif_running(netdev)) {
+		adapter->flags |= IAVF_FLAG_RESET_NEEDED;
+		iavf_reset_step(adapter);
+	}
 
-	return ret;
+	return 0;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 8aa6e92c16431f..9c8d6125106f5a 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -185,31 +185,6 @@ static bool iavf_is_reset_in_progress(struct iavf_adapter *adapter)
 	return false;
 }
 
-/**
- * iavf_wait_for_reset - Wait for reset to finish.
- * @adapter: board private structure
- *
- * Returns 0 if reset finished successfully, negative on timeout or interrupt.
- */
-int iavf_wait_for_reset(struct iavf_adapter *adapter)
-{
-	int ret = wait_event_interruptible_timeout(adapter->reset_waitqueue,
-					!iavf_is_reset_in_progress(adapter),
-					msecs_to_jiffies(5000));
-
-	/* If ret < 0 then it means wait was interrupted.
-	 * If ret == 0 then it means we got a timeout while waiting
-	 * for reset to finish.
-	 * If ret > 0 it means reset has finished.
-	 */
-	if (ret > 0)
-		return 0;
-	else if (ret < 0)
-		return -EINTR;
-	else
-		return -EBUSY;
-}
-
 /**
  * iavf_allocate_dma_mem_d - OS specific memory alloc for shared code
  * @hw:   pointer to the HW structure
@@ -3100,18 +3075,16 @@ static void iavf_reconfig_qs_bw(struct iavf_adapter *adapter)
 }
 
 /**
- * iavf_reset_task - Call-back task to handle hardware reset
- * @work: pointer to work_struct
+ * iavf_reset_step - Perform the VF reset sequence
+ * @adapter: board private structure
  *
- * During reset we need to shut down and reinitialize the admin queue
- * before we can use it to communicate with the PF again. We also clear
- * and reinit the rings because that context is lost as well.
- **/
-static void iavf_reset_task(struct work_struct *work)
+ * Requests a reset from PF, polls for completion, and reconfigures
+ * the driver. Caller must hold the netdev instance lock.
+ *
+ * This can sleep for several seconds while polling HW registers.
+ */
+void iavf_reset_step(struct iavf_adapter *adapter)
 {
-	struct iavf_adapter *adapter = container_of(work,
-						      struct iavf_adapter,
-						      reset_task);
 	struct virtchnl_vf_resource *vfres = adapter->vf_res;
 	struct net_device *netdev = adapter->netdev;
 	struct iavf_hw *hw = &adapter->hw;
@@ -3122,7 +3095,7 @@ static void iavf_reset_task(struct work_struct *work)
 	int i = 0, err;
 	bool running;
 
-	netdev_lock(netdev);
+	netdev_assert_locked(netdev);
 
 	iavf_misc_irq_disable(adapter);
 	if (adapter->flags & IAVF_FLAG_RESET_NEEDED) {
@@ -3167,7 +3140,6 @@ static void iavf_reset_task(struct work_struct *work)
 		dev_err(&adapter->pdev->dev, "Reset never finished (%x)\n",
 			reg_val);
 		iavf_disable_vf(adapter);
-		netdev_unlock(netdev);
 		return; /* Do not attempt to reinit. It's dead, Jim. */
 	}
 
@@ -3179,7 +3151,6 @@ static void iavf_reset_task(struct work_struct *work)
 		iavf_startup(adapter);
 		queue_delayed_work(adapter->wq, &adapter->watchdog_task,
 				   msecs_to_jiffies(30));
-		netdev_unlock(netdev);
 		return;
 	}
 
@@ -3321,7 +3292,6 @@ static void iavf_reset_task(struct work_struct *work)
 	adapter->flags &= ~IAVF_FLAG_REINIT_ITR_NEEDED;
 
 	wake_up(&adapter->reset_waitqueue);
-	netdev_unlock(netdev);
 
 	return;
 reset_err:
@@ -3331,10 +3301,21 @@ static void iavf_reset_task(struct work_struct *work)
 	}
 	iavf_disable_vf(adapter);
 
-	netdev_unlock(netdev);
 	dev_err(&adapter->pdev->dev, "failed to allocate resources during reinit\n");
 }
 
+static void iavf_reset_task(struct work_struct *work)
+{
+	struct iavf_adapter *adapter = container_of(work,
+						      struct iavf_adapter,
+						      reset_task);
+	struct net_device *netdev = adapter->netdev;
+
+	netdev_lock(netdev);
+	iavf_reset_step(adapter);
+	netdev_unlock(netdev);
+}
+
 /**
  * iavf_adminq_task - worker thread to clean the admin queue
  * @work: pointer to work_struct containing our data
@@ -4600,22 +4581,17 @@ static int iavf_close(struct net_device *netdev)
 static int iavf_change_mtu(struct net_device *netdev, int new_mtu)
 {
 	struct iavf_adapter *adapter = netdev_priv(netdev);
-	int ret = 0;
 
 	netdev_dbg(netdev, "changing MTU from %d to %d\n",
 		   netdev->mtu, new_mtu);
 	WRITE_ONCE(netdev->mtu, new_mtu);
 
 	if (netif_running(netdev)) {
-		iavf_schedule_reset(adapter, IAVF_FLAG_RESET_NEEDED);
-		ret = iavf_wait_for_reset(adapter);
-		if (ret < 0)
-			netdev_warn(netdev, "MTU change interrupted waiting for reset");
-		else if (ret)
-			netdev_warn(netdev, "MTU change timed out waiting for reset");
+		adapter->flags |= IAVF_FLAG_RESET_NEEDED;
+		iavf_reset_step(adapter);
 	}
 
-	return ret;
+	return 0;
 }
 
 /**
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH net-next v20 00/12] virtio_net: Add ethtool flow rules support
From: Michael S. Tsirkin @ 2026-02-07 10:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Jurgens, netdev, jasowang, pabeni, virtualization, parav,
	shshitrit, yohadt, xuanzhuo, eperezma, jgg, kevin.tian,
	andrew+netdev, edumazet
In-Reply-To: <20260205184328.3b706154@kernel.org>

On Thu, Feb 05, 2026 at 06:43:28PM -0800, Jakub Kicinski wrote:
> On Thu, 5 Feb 2026 16:46:55 -0600 Daniel Jurgens wrote:
> > This series implements ethtool flow rules support for virtio_net using the
> > virtio flow filter (FF) specification. The implementation allows users to
> > configure packet filtering rules through ethtool commands, directing
> > packets to specific receive queues, or dropping them based on various
> > header fields.
> 
> This is a 4th version of this you posted in as many days and it doesn't
> even build. Please slow down. Please wait with v21 until after the merge
> window. We have enough patches to sift thru still for v7.0.

v20 and no end in sight.
Just looking at the amount of pain all this parsing is inflicting
makes me worry. And wait until we need to begin worrying about
maintaining UAPI stability.

It would be much nicer if drivers were out of the business of parsing
fiddly structures.  Isn't there a way for more code in net core
to deal with all this?
-- 
MST


^ permalink raw reply

* Re: [PATCH v2 net-next] virtio_net: Improve RSS key size validation and use NETDEV_RSS_KEY_LEN
From: Michael S. Tsirkin @ 2026-02-07  9:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Srujana Challa, netdev, virtualization, pabeni, jasowang,
	xuanzhuo, eperezma, davem, edumazet, ndabilpuram, kshankar
In-Reply-To: <20260206191308.7fdbfef4@kernel.org>

On Fri, Feb 06, 2026 at 07:13:08PM -0800, Jakub Kicinski wrote:
> On Fri, 6 Feb 2026 17:31:54 +0530 Srujana Challa wrote:
> > Replace hardcoded RSS max key size limit with NETDEV_RSS_KEY_LEN to
> > align with kernel's standard RSS key length. Add validation for RSS
> > key size against spec minimum (40 bytes) and driver maximum. When
> > validation fails, gracefully disable RSS features and continue
> > initialization rather than failing completely.
> 
> Hm, FWIW clang says:
> 
> drivers/net/virtio_net.c:6841:31: warning: result of comparison of constant 256 with expression of type 'u8' (aka 'unsigned char') is always false [-Wtautological-constant-out-of-range-compare]
>  6841 |                 } else if (vi->rss_key_size > VIRTIO_NET_RSS_MAX_KEY_SIZE) {
>       |                            ~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Which is kinda annoying because the value was increased in net-next.
> If Machael wants this backported then we need to keep the check
> and follow up in net-next? We could try to cast the u32 away but
> that feels dirty..


I'd say yes. the warning is harmless.
so
patch 1 - this code
patch 2 - replace with BUILD_BUG_ON

but i ask then whether this code was actually tested against net-next.

-- 
MST


^ permalink raw reply

* [PATCH v6 4/4] net: phy: realtek: add RTL8224 polarity support
From: Damien Dejean @ 2026-02-07  9:25 UTC (permalink / raw)
  To: andrew, krzk+dt, robh, kuba, maxime.chevallier
  Cc: netdev, devicetree, linux-kernel, edumazet, davem, pabeni,
	hkallweit1, Damien Dejean
In-Reply-To: <20260207092539.647768-1-dam.dejean@gmail.com>

The RTL8224 has a register to configure the polarity of every pair of
each port. It provides device designers more flexbility when wiring the
chip.

Unfortunately, the register is left in an unknown state after a reset.
Thus on devices where the bootloader don't initialize it, the driver has
to do it to detect and use a link.

The MDI polarity swap can be set in the device tree using the property
enet-phy-pair-polarity. The u32 value is a bitfield where bit[0..3]
control the polarity of pairs A..D.

Signed-off-by: Damien Dejean <dam.dejean@gmail.com>
---
 drivers/net/phy/realtek/realtek_main.c | 45 +++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/realtek/realtek_main.c b/drivers/net/phy/realtek/realtek_main.c
index 4f0c1b72f7e0..d15d3b41e5d1 100644
--- a/drivers/net/phy/realtek/realtek_main.c
+++ b/drivers/net/phy/realtek/realtek_main.c
@@ -172,6 +172,7 @@
 #define RTL8224_SRAM_RTCT_LEN(pair)		(0x8028 + (pair) * 4)
 
 #define RTL8224_VND1_MDI_PAIR_SWAP		0xa90
+#define RTL8224_VND1_MDI_POLARITY_SWAP		0xa94
 
 #define RTL8366RB_POWER_SAVE			0x15
 #define RTL8366RB_POWER_SAVE_ON			BIT(12)
@@ -1861,9 +1862,51 @@ static int rtl8224_mdi_config_order(struct phy_device *phydev)
 	return ret;
 }
 
+static int rtl8224_mdi_config_polarity(struct phy_device *phydev)
+{
+	struct device_node *np = phydev->mdio.dev.of_node;
+	u8 offset = (phydev->mdio.addr & 3) * 4;
+	u32 polarity = 0;
+	int ret, val;
+
+	ret = of_property_read_u32(np, "enet-phy-pair-polarity", &polarity);
+
+	/* Do nothing if the property is not present */
+	if (ret == -EINVAL)
+		return 0;
+
+	if (ret)
+		return ret;
+
+	if (polarity & ~0xf)
+		return -EINVAL;
+
+	phy_lock_mdio_bus(phydev);
+	val = __phy_package_read_mmd(phydev, 0, MDIO_MMD_VEND1,
+				     RTL8224_VND1_MDI_POLARITY_SWAP);
+	if (val < 0) {
+		ret = val;
+		goto exit;
+	}
+
+	val &= ~(0xf << offset);
+	val |= polarity << offset;
+	ret = __phy_package_write_mmd(phydev, 0, MDIO_MMD_VEND1,
+				      RTL8224_VND1_MDI_POLARITY_SWAP, val);
+exit:
+	phy_unlock_mdio_bus(phydev);
+	return ret;
+}
+
 static int rtl8224_config_init(struct phy_device *phydev)
 {
-	return rtl8224_mdi_config_order(phydev);
+	int ret;
+
+	ret = rtl8224_mdi_config_order(phydev);
+	if (ret)
+		return ret;
+
+	return rtl8224_mdi_config_polarity(phydev);
 }
 
 static int rtl8224_probe(struct phy_device *phydev)
-- 
2.47.3


^ permalink raw reply related

* [PATCH v6 3/4] dt-bindings: net: ethernet-phy: add property enet-phy-pair-polarity
From: Damien Dejean @ 2026-02-07  9:25 UTC (permalink / raw)
  To: andrew, krzk+dt, robh, kuba, maxime.chevallier
  Cc: netdev, devicetree, linux-kernel, edumazet, davem, pabeni,
	hkallweit1, Damien Dejean
In-Reply-To: <20260207092539.647768-1-dam.dejean@gmail.com>

Add the property enet-phy-pair-polarity to describe the polarity of the
PHY pairs. To ease PCB designs some manufacturers allow to wire the
pairs with a reverse polarity and provide a way to configure it.

The property 'enet-phy-pair-polarity' sets the polarity of each pair.
Bit 0 to 3 configure the polarity or pairs A to D, if set to 1 the
polarity is reversed for this pair.

Signed-off-by: Damien Dejean <dam.dejean@gmail.com>
---
 Documentation/devicetree/bindings/net/ethernet-phy.yaml | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/ethernet-phy.yaml b/Documentation/devicetree/bindings/net/ethernet-phy.yaml
index 4a27547f7d7a..21a1a63506f0 100644
--- a/Documentation/devicetree/bindings/net/ethernet-phy.yaml
+++ b/Documentation/devicetree/bindings/net/ethernet-phy.yaml
@@ -132,6 +132,14 @@ properties:
     description:
       For normal (0) or reverse (1) order of the pairs (ABCD -> DCBA).
 
+  enet-phy-pair-polarity:
+    $ref: /schemas/types.yaml#/definitions/uint32
+    maximum: 0xf
+    description:
+      A bitmap to describe pair polarity swap. Bit 0 to swap polarity of pair A,
+      bit 1 to swap polarity of pair B, bit 2 to swap polarity of pair C and bit
+      3 to swap polarity of pair D.
+
   eee-broken-100tx:
     $ref: /schemas/types.yaml#/definitions/flag
     description:
-- 
2.47.3


^ permalink raw reply related

* [PATCH v6 2/4] net: phy: realtek: add RTL8224 pair order support
From: Damien Dejean @ 2026-02-07  9:25 UTC (permalink / raw)
  To: andrew, krzk+dt, robh, kuba, maxime.chevallier
  Cc: netdev, devicetree, linux-kernel, edumazet, davem, pabeni,
	hkallweit1, Damien Dejean
In-Reply-To: <20260207092539.647768-1-dam.dejean@gmail.com>

The RTL8224 has a register to configure a pair swap (from ABCD order to
DCBA) providing PCB designers more flexbility when wiring the chip. The
swap parameter has to be set correctly for each of the 4 ports before
the chip can detect a link.

After a reset, this register is (unfortunately) left in a random state,
thus it has to be initialized. On most of the devices the bootloader
does it once for all and we can rely on the value set, on some other it
is not and the kernel has to do it.

The MDI pair swap can be set in the device tree using the property
enet-phy-pair-order. The property is set to 0 to keep the default order
(ABCD), or 1 to reverse the pairs (DCBA).

Signed-off-by: Damien Dejean <dam.dejean@gmail.com>
---
 drivers/net/phy/realtek/Kconfig        |  1 +
 drivers/net/phy/realtek/realtek_main.c | 55 ++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/drivers/net/phy/realtek/Kconfig b/drivers/net/phy/realtek/Kconfig
index b05c2a1e9024..a741b34d193e 100644
--- a/drivers/net/phy/realtek/Kconfig
+++ b/drivers/net/phy/realtek/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config REALTEK_PHY
 	tristate "Realtek PHYs"
+	select PHY_PACKAGE
 	help
 	  Currently supports RTL821x/RTL822x and fast ethernet PHYs
 
diff --git a/drivers/net/phy/realtek/realtek_main.c b/drivers/net/phy/realtek/realtek_main.c
index 75565fbdbf6d..4f0c1b72f7e0 100644
--- a/drivers/net/phy/realtek/realtek_main.c
+++ b/drivers/net/phy/realtek/realtek_main.c
@@ -171,6 +171,8 @@
 
 #define RTL8224_SRAM_RTCT_LEN(pair)		(0x8028 + (pair) * 4)
 
+#define RTL8224_VND1_MDI_PAIR_SWAP		0xa90
+
 #define RTL8366RB_POWER_SAVE			0x15
 #define RTL8366RB_POWER_SAVE_ON			BIT(12)
 
@@ -1820,6 +1822,57 @@ static int rtl8224_cable_test_get_status(struct phy_device *phydev, bool *finish
 	return rtl8224_cable_test_report(phydev, finished);
 }
 
+static int rtl8224_mdi_config_order(struct phy_device *phydev)
+{
+	struct device_node *np = phydev->mdio.dev.of_node;
+	u8 port_offset = phydev->mdio.addr & 3;
+	u32 order = 0;
+	int ret, val;
+
+	ret = of_property_read_u32(np, "enet-phy-pair-order", &order);
+
+	/* Do nothing in case the property is not present */
+	if (ret == -EINVAL)
+		return 0;
+
+	if (ret)
+		return ret;
+
+	if (order & ~1)
+		return -EINVAL;
+
+	phy_lock_mdio_bus(phydev);
+	val = __phy_package_read_mmd(phydev, 0, MDIO_MMD_VEND1,
+				     RTL8224_VND1_MDI_PAIR_SWAP);
+	if (val < 0) {
+		ret = val;
+		goto exit;
+	}
+
+	if (order)
+		val |= (1 << port_offset);
+	else
+		val &= ~(1 << port_offset);
+
+	ret = __phy_package_write_mmd(phydev, 0, MDIO_MMD_VEND1,
+				      RTL8224_VND1_MDI_PAIR_SWAP, val);
+exit:
+	phy_unlock_mdio_bus(phydev);
+	return ret;
+}
+
+static int rtl8224_config_init(struct phy_device *phydev)
+{
+	return rtl8224_mdi_config_order(phydev);
+}
+
+static int rtl8224_probe(struct phy_device *phydev)
+{
+	/* Chip exposes 4 ports, join all of them in the same package */
+	return devm_phy_package_join(&phydev->mdio.dev, phydev,
+				     phydev->mdio.addr & ~3, 0);
+}
+
 static bool rtlgen_supports_2_5gbps(struct phy_device *phydev)
 {
 	int val;
@@ -2392,6 +2445,8 @@ static struct phy_driver realtek_drvs[] = {
 		PHY_ID_MATCH_EXACT(0x001ccad0),
 		.name		= "RTL8224 2.5Gbps PHY",
 		.flags		= PHY_POLL_CABLE_TEST,
+		.probe		= rtl8224_probe,
+		.config_init	= rtl8224_config_init,
 		.get_features	= rtl822x_c45_get_features,
 		.config_aneg	= rtl822x_c45_config_aneg,
 		.read_status	= rtl822x_c45_read_status,
-- 
2.47.3


^ permalink raw reply related

* [PATCH v6 1/4] dt-bindings: net: ethernet-phy: add property enet-phy-pair-order
From: Damien Dejean @ 2026-02-07  9:25 UTC (permalink / raw)
  To: andrew, krzk+dt, robh, kuba, maxime.chevallier
  Cc: netdev, devicetree, linux-kernel, edumazet, davem, pabeni,
	hkallweit1, Damien Dejean

Add property enet-phy-pair-order to the device tree bindings to define
the pair order of the PHY. To simplify PCB design some manufacturers
allow to wire the pairs in a reverse order, and change the order in
software.

The property can be set to 0 to force the normal pair order (ABCD), or 1
to force the reverse pair order (DCBA).

Signed-off-by: Damien Dejean <dam.dejean@gmail.com>
---
 Documentation/devicetree/bindings/net/ethernet-phy.yaml | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/ethernet-phy.yaml b/Documentation/devicetree/bindings/net/ethernet-phy.yaml
index 58634fee9fc4..4a27547f7d7a 100644
--- a/Documentation/devicetree/bindings/net/ethernet-phy.yaml
+++ b/Documentation/devicetree/bindings/net/ethernet-phy.yaml
@@ -126,6 +126,12 @@ properties:
       e.g. wrong bootstrap configuration caused by issues in PCB
       layout design.
 
+  enet-phy-pair-order:
+    $ref: /schemas/types.yaml#/definitions/uint32
+    enum: [0, 1]
+    description:
+      For normal (0) or reverse (1) order of the pairs (ABCD -> DCBA).
+
   eee-broken-100tx:
     $ref: /schemas/types.yaml#/definitions/flag
     description:
-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next] ppp: don't byte-swap at run time
From: Qingfang Deng @ 2026-02-07  6:47 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-ppp, netdev, linux-kernel

Currently, the code loads the protocol number from a skb and converts it
to host-endian for comparison. This requires runtime byte swapping on
little-endian architectures.

Optimize this by comparing the protocol number directly to
constant-folded big-endian values. This reduces code size, and slightly
improves performance in the fastpath. ppp_ioctl() still takes a
host-endian int, so keep the old function for it.

bloat-o-meter analysis on a x86_64 build:
add/remove: 0/0 grow/shrink: 0/6 up/down: 0/-131 (-131)
Function                                     old     new   delta
ppp_receive_nonmp_frame                     2002    2000      -2
ppp_input                                    641     639      -2
npindex_to_proto                              24      12     -12
npindex_to_ethertype                          24      12     -12
ppp_start_xmit                               375     344     -31
__ppp_xmit_process                          1881    1809     -72
Total: Before=22998, After=22867, chg -0.57%

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
---
 drivers/net/ppp/ppp_generic.c | 109 ++++++++++++++++++++--------------
 1 file changed, 65 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index f8814d7be6f1..eca9cd6f3a87 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -239,7 +239,7 @@ struct ppp_net {
 };
 
 /* Get the PPP protocol number from a skb */
-#define PPP_PROTO(skb)	get_unaligned_be16((skb)->data)
+#define PPP_PROTO(skb)	get_unaligned((__be16 *)(skb)->data)
 
 /* We limit the length of ppp->file.rq to this (arbitrary) value */
 #define PPP_MAX_RQLEN	32
@@ -312,7 +312,26 @@ static inline struct ppp_net *ppp_pernet(struct net *net)
 }
 
 /* Translates a PPP protocol number to a NP index (NP == network protocol) */
-static inline int proto_to_npindex(int proto)
+static __always_inline int proto_to_npindex(__be16 proto)
+{
+	switch (proto) {
+	case htons(PPP_IP):
+		return NP_IP;
+	case htons(PPP_IPV6):
+		return NP_IPV6;
+	case htons(PPP_IPX):
+		return NP_IPX;
+	case htons(PPP_AT):
+		return NP_AT;
+	case htons(PPP_MPLS_UC):
+		return NP_MPLS_UC;
+	case htons(PPP_MPLS_MC):
+		return NP_MPLS_MC;
+	}
+	return -EINVAL;
+}
+
+static __always_inline int proto_to_npindex_user(int proto)
 {
 	switch (proto) {
 	case PPP_IP:
@@ -332,44 +351,44 @@ static inline int proto_to_npindex(int proto)
 }
 
 /* Translates an NP index into a PPP protocol number */
-static const int npindex_to_proto[NUM_NP] = {
-	PPP_IP,
-	PPP_IPV6,
-	PPP_IPX,
-	PPP_AT,
-	PPP_MPLS_UC,
-	PPP_MPLS_MC,
+static const __be16 npindex_to_proto[NUM_NP] = {
+	htons(PPP_IP),
+	htons(PPP_IPV6),
+	htons(PPP_IPX),
+	htons(PPP_AT),
+	htons(PPP_MPLS_UC),
+	htons(PPP_MPLS_MC),
 };
 
 /* Translates an ethertype into an NP index */
-static inline int ethertype_to_npindex(int ethertype)
+static inline int ethertype_to_npindex(__be16 ethertype)
 {
 	switch (ethertype) {
-	case ETH_P_IP:
+	case htons(ETH_P_IP):
 		return NP_IP;
-	case ETH_P_IPV6:
+	case htons(ETH_P_IPV6):
 		return NP_IPV6;
-	case ETH_P_IPX:
+	case htons(ETH_P_IPX):
 		return NP_IPX;
-	case ETH_P_PPPTALK:
-	case ETH_P_ATALK:
+	case htons(ETH_P_PPPTALK):
+	case htons(ETH_P_ATALK):
 		return NP_AT;
-	case ETH_P_MPLS_UC:
+	case htons(ETH_P_MPLS_UC):
 		return NP_MPLS_UC;
-	case ETH_P_MPLS_MC:
+	case htons(ETH_P_MPLS_MC):
 		return NP_MPLS_MC;
 	}
 	return -1;
 }
 
 /* Translates an NP index into an ethertype */
-static const int npindex_to_ethertype[NUM_NP] = {
-	ETH_P_IP,
-	ETH_P_IPV6,
-	ETH_P_IPX,
-	ETH_P_PPPTALK,
-	ETH_P_MPLS_UC,
-	ETH_P_MPLS_MC,
+static const __be16 npindex_to_ethertype[NUM_NP] = {
+	htons(ETH_P_IP),
+	htons(ETH_P_IPV6),
+	htons(ETH_P_IPX),
+	htons(ETH_P_PPPTALK),
+	htons(ETH_P_MPLS_UC),
+	htons(ETH_P_MPLS_MC),
 };
 
 /*
@@ -504,7 +523,7 @@ static bool ppp_check_packet(struct sk_buff *skb, size_t count)
 	/* LCP packets must include LCP header which 4 bytes long:
 	 * 1-byte code, 1-byte identifier, and 2-byte length.
 	 */
-	return get_unaligned_be16(skb->data) != PPP_LCP ||
+	return PPP_PROTO(skb) != htons(PPP_LCP) ||
 		count >= PPP_PROTO_LEN + PPP_LCP_HDRLEN;
 }
 
@@ -914,7 +933,7 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case PPPIOCSNPMODE:
 		if (copy_from_user(&npi, argp, sizeof(npi)))
 			break;
-		err = proto_to_npindex(npi.protocol);
+		err = proto_to_npindex_user(npi.protocol);
 		if (err < 0)
 			break;
 		i = err;
@@ -1451,10 +1470,10 @@ static netdev_tx_t
 ppp_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ppp *ppp = netdev_priv(dev);
-	int npi, proto;
-	unsigned char *pp;
+	__be16 *pp, proto;
+	int npi;
 
-	npi = ethertype_to_npindex(ntohs(skb->protocol));
+	npi = ethertype_to_npindex(skb->protocol);
 	if (npi < 0)
 		goto outf;
 
@@ -1478,7 +1497,7 @@ ppp_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	pp = skb_push(skb, 2);
 	proto = npindex_to_proto[npi];
-	put_unaligned_be16(proto, pp);
+	put_unaligned(proto, pp);
 
 	skb_scrub_packet(skb, !net_eq(ppp->ppp_net, dev_net(dev)));
 	ppp_xmit_process(ppp, skb);
@@ -1764,14 +1783,14 @@ pad_compress_skb(struct ppp *ppp, struct sk_buff *skb)
 static void
 ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
 {
-	int proto = PPP_PROTO(skb);
+	__be16 proto = PPP_PROTO(skb);
 	struct sk_buff *new_skb;
 	int len;
 	unsigned char *cp;
 
 	skb->dev = ppp->dev;
 
-	if (proto < 0x8000) {
+	if (!(proto & htons(0x8000))) {
 #ifdef CONFIG_PPP_FILTER
 		/* check if the packet passes the pass and active filters.
 		 * See comment for PPP_FILTER_OUTBOUND_TAG above.
@@ -2324,7 +2343,7 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 {
 	struct channel *pch = chan->ppp;
 	struct ppp *ppp;
-	int proto;
+	__be16 proto;
 
 	if (!pch) {
 		kfree_skb(skb);
@@ -2347,7 +2366,8 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	}
 
 	proto = PPP_PROTO(skb);
-	if (!ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
+	if (!ppp || (proto & htons(0xc000)) == htons(0xc000) ||
+	    proto == htons(PPP_CCPFRAG)) {
 		/* put it on the channel queue */
 		skb_queue_tail(&pch->file.rq, skb);
 		/* drop old frames if queue too long */
@@ -2399,7 +2419,7 @@ ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 		skb_checksum_complete_unset(skb);
 #ifdef CONFIG_PPP_MULTILINK
 		/* XXX do channel-level decompression here */
-		if (PPP_PROTO(skb) == PPP_MP)
+		if (PPP_PROTO(skb) == htons(PPP_MP))
 			ppp_receive_mp_frame(ppp, skb, pch);
 		else
 #endif /* CONFIG_PPP_MULTILINK */
@@ -2422,7 +2442,8 @@ static void
 ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 {
 	struct sk_buff *ns;
-	int proto, len, npi;
+	int len, npi;
+	__be16 proto;
 
 	/*
 	 * Decompress the frame, if compressed.
@@ -2441,7 +2462,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 	 */
 	proto = PPP_PROTO(skb);
 	switch (proto) {
-	case PPP_VJC_COMP:
+	case htons(PPP_VJC_COMP):
 		/* decompress VJ compressed packets */
 		if (!ppp->vj || (ppp->flags & SC_REJ_COMP_TCP))
 			goto err;
@@ -2473,10 +2494,10 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 			skb_put(skb, len - skb->len);
 		else if (len < skb->len)
 			skb_trim(skb, len);
-		proto = PPP_IP;
+		proto = htons(PPP_IP);
 		break;
 
-	case PPP_VJC_UNCOMP:
+	case htons(PPP_VJC_UNCOMP):
 		if (!ppp->vj || (ppp->flags & SC_REJ_COMP_TCP))
 			goto err;
 
@@ -2490,10 +2511,10 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 			netdev_err(ppp->dev, "PPP: VJ uncompressed error\n");
 			goto err;
 		}
-		proto = PPP_IP;
+		proto = htons(PPP_IP);
 		break;
 
-	case PPP_CCP:
+	case htons(PPP_CCP):
 		ppp_ccp_peek(ppp, skb, 1);
 		break;
 	}
@@ -2546,7 +2567,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 			/* chop off protocol */
 			skb_pull_rcsum(skb, 2);
 			skb->dev = ppp->dev;
-			skb->protocol = htons(npindex_to_ethertype[npi]);
+			skb->protocol = npindex_to_ethertype[npi];
 			skb_reset_mac_header(skb);
 			skb_scrub_packet(skb, !net_eq(ppp->ppp_net,
 						      dev_net(ppp->dev)));
@@ -2563,7 +2584,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 static struct sk_buff *
 ppp_decompress_frame(struct ppp *ppp, struct sk_buff *skb)
 {
-	int proto = PPP_PROTO(skb);
+	__be16 proto = PPP_PROTO(skb);
 	struct sk_buff *ns;
 	int len;
 
@@ -2573,7 +2594,7 @@ ppp_decompress_frame(struct ppp *ppp, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, skb->len))
 		goto err;
 
-	if (proto == PPP_COMP) {
+	if (proto == htons(PPP_COMP)) {
 		int obuff_size;
 
 		switch(ppp->rcomp->compress_proto) {
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox