* Re: [PATCH] net: correcting section tags for .init and .exit data/functions
From: Nathan Chancellor @ 2026-06-13 17:01 UTC (permalink / raw)
To: xur
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Neal Cardwell, Kuniyuki Iwashima, Willem de Bruijn,
David Ahern, Ido Schimmel, Andreas Färber,
Manivannan Sadhasivam, Nick Desaulniers, Bill Wendling,
Justin Stitt, Maciej Żenczykowski, Yue Haibing, Jeff Layton,
Kees Cook, Fernando Fernandez Mancera, Gustavo A. R. Silva,
Sabrina Dubroca, Masahiro Yamada, Nicolas Schier, netdev,
linux-kernel, linux-arm-kernel, linux-actions, llvm,
kernel test robot
In-Reply-To: <20260612162257.896792-1-xur@google.com>
Hi Rong,
On Fri, Jun 12, 2026 at 09:22:57AM -0700, xur@google.com wrote:
> From: Rong Xu <xur@google.com>
>
> Fix modpost warnings that have surfaced during Clang's distributed ThinLTO
> builds.
>
> WARNING: modpost: vmlinux: section mismatch in reference: tcp4_net_ops.llvm.4527429266264891517+0x8 (section: .data) -> tcp4_proc_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: udp4_net_ops.llvm.17425824324074326067+0x8 (section: .data) -> udp4_proc_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ping_v4_net_ops.llvm.5641696707737373282+0x8 (section: .data) -> ping_v4_proc_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: if6_proc_net_ops.llvm.7870945277386035298+0x8 (section: .data) -> if6_proc_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ipv6_addr_label_ops.llvm.5745897517271459135+0x8 (section: .data) -> ip6addrlbl_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ndisc_net_ops.llvm.8806210167060761094+0x8 (section: .data) -> ndisc_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: raw6_net_ops.llvm.3743523335772203324+0x8 (section: .data) -> raw6_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: igmp6_net_ops.llvm.7071106350580158050+0x8 (section: .data) -> igmp6_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: tcpv6_net_ops.llvm.17505177970592326146+0x8 (section: .data) -> tcpv6_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ip6_flowlabel_net_ops.llvm.6051723423336054316+0x8 (section: .data) -> ip6_flowlabel_proc_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ipv6_proc_ops.llvm.7829948594772821810+0x8 (section: .data) -> ipv6_proc_init_net (section: .init.text)
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202606111233.kM8oo8Df-lkp@intel.com/
> Signed-off-by: Rong Xu <xur@google.com>
Thanks for sending this change to try and clear up those new warnings
from the distributed ThinLTO build. Based on the build reports that
appear from this change downthread, it does not seem like it is quite
right. Additionally, I think the commit message could be a little more
descriptive around the root cause of the warnings and how this patch
actually addresses it (I can infer but I think that information should
be up front and center).
> ---
> net/ipv4/ping.c | 6 +++---
> net/ipv4/tcp_ipv4.c | 6 +++---
> net/ipv4/udp.c | 6 +++---
> net/ipv6/addrconf.c | 6 +++---
> net/ipv6/addrlabel.c | 6 +++---
> net/ipv6/ip6_flowlabel.c | 6 +++---
> net/ipv6/mcast.c | 10 +++++-----
> net/ipv6/ndisc.c | 10 +++++-----
> net/ipv6/proc.c | 6 +++---
> net/ipv6/raw.c | 6 +++---
> net/ipv6/tcp_ipv6.c | 6 +++---
> 11 files changed, 37 insertions(+), 37 deletions(-)
>
> diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
> index d36f1e273fde..1dda6d661ad8 100644
> --- a/net/ipv4/ping.c
> +++ b/net/ipv4/ping.c
> @@ -1144,17 +1144,17 @@ static void __net_exit ping_v4_proc_exit_net(struct net *net)
> remove_proc_entry("icmp", net->proc_net);
> }
>
> -static struct pernet_operations ping_v4_net_ops = {
> +static struct pernet_operations ping_v4_net_ops __net_initdata = {
> .init = ping_v4_proc_init_net,
> .exit = ping_v4_proc_exit_net,
> };
>
> -int __init ping_proc_init(void)
> +int __net_init ping_proc_init(void)
> {
> return register_pernet_subsys(&ping_v4_net_ops);
> }
>
> -void ping_proc_exit(void)
> +void __net_exit ping_proc_exit(void)
> {
> unregister_pernet_subsys(&ping_v4_net_ops);
> }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fdc81150ff6c..9caca5879466 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3317,17 +3317,17 @@ static void __net_exit tcp4_proc_exit_net(struct net *net)
> remove_proc_entry("tcp", net->proc_net);
> }
>
> -static struct pernet_operations tcp4_net_ops = {
> +static struct pernet_operations tcp4_net_ops __net_initdata = {
> .init = tcp4_proc_init_net,
> .exit = tcp4_proc_exit_net,
> };
>
> -int __init tcp4_proc_init(void)
> +int __net_init tcp4_proc_init(void)
> {
> return register_pernet_subsys(&tcp4_net_ops);
> }
>
> -void tcp4_proc_exit(void)
> +void __net_exit tcp4_proc_exit(void)
> {
> unregister_pernet_subsys(&tcp4_net_ops);
> }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 70f6cbd4ef73..87f4cced2114 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3600,17 +3600,17 @@ static void __net_exit udp4_proc_exit_net(struct net *net)
> remove_proc_entry("udp", net->proc_net);
> }
>
> -static struct pernet_operations udp4_net_ops = {
> +static struct pernet_operations udp4_net_ops __net_initdata = {
> .init = udp4_proc_init_net,
> .exit = udp4_proc_exit_net,
> };
>
> -int __init udp4_proc_init(void)
> +int __net_init udp4_proc_init(void)
> {
> return register_pernet_subsys(&udp4_net_ops);
> }
>
> -void udp4_proc_exit(void)
> +void __net_exit udp4_proc_exit(void)
> {
> unregister_pernet_subsys(&udp4_net_ops);
> }
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index c9e5d3e48ab9..73d9439bd408 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4527,17 +4527,17 @@ static void __net_exit if6_proc_net_exit(struct net *net)
> remove_proc_entry("if_inet6", net->proc_net);
> }
>
> -static struct pernet_operations if6_proc_net_ops = {
> +static struct pernet_operations if6_proc_net_ops __net_initdata = {
> .init = if6_proc_net_init,
> .exit = if6_proc_net_exit,
> };
>
> -int __init if6_proc_init(void)
> +int __net_init if6_proc_init(void)
> {
> return register_pernet_subsys(&if6_proc_net_ops);
> }
>
> -void if6_proc_exit(void)
> +void __net_exit if6_proc_exit(void)
> {
> unregister_pernet_subsys(&if6_proc_net_ops);
> }
> diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
> index f4b2618446bd..50f6c1b1edaa 100644
> --- a/net/ipv6/addrlabel.c
> +++ b/net/ipv6/addrlabel.c
> @@ -340,17 +340,17 @@ static void __net_exit ip6addrlbl_net_exit(struct net *net)
> spin_unlock(&net->ipv6.ip6addrlbl_table.lock);
> }
>
> -static struct pernet_operations ipv6_addr_label_ops = {
> +static struct pernet_operations ipv6_addr_label_ops __net_initdata = {
> .init = ip6addrlbl_net_init,
> .exit = ip6addrlbl_net_exit,
> };
>
> -int __init ipv6_addr_label_init(void)
> +int __net_init ipv6_addr_label_init(void)
> {
> return register_pernet_subsys(&ipv6_addr_label_ops);
> }
>
> -void ipv6_addr_label_cleanup(void)
> +void __net_exit ipv6_addr_label_cleanup(void)
> {
> unregister_pernet_subsys(&ipv6_addr_label_ops);
> }
> diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
> index b1ccdf0dc646..f6980c403c68 100644
> --- a/net/ipv6/ip6_flowlabel.c
> +++ b/net/ipv6/ip6_flowlabel.c
> @@ -903,17 +903,17 @@ static void __net_exit ip6_flowlabel_net_exit(struct net *net)
> ip6_flowlabel_proc_fini(net);
> }
>
> -static struct pernet_operations ip6_flowlabel_net_ops = {
> +static struct pernet_operations ip6_flowlabel_net_ops __net_initdata = {
> .init = ip6_flowlabel_proc_init,
> .exit = ip6_flowlabel_net_exit,
> };
>
> -int ip6_flowlabel_init(void)
> +int __net_init ip6_flowlabel_init(void)
> {
> return register_pernet_subsys(&ip6_flowlabel_net_ops);
> }
>
> -void ip6_flowlabel_cleanup(void)
> +void __net_exit ip6_flowlabel_cleanup(void)
> {
> static_key_deferred_flush(&ipv6_flowlabel_exclusive);
> timer_delete(&ip6_fl_gc_timer);
> diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> index d9b855d5191b..eef5bab1ee13 100644
> --- a/net/ipv6/mcast.c
> +++ b/net/ipv6/mcast.c
> @@ -3209,12 +3209,12 @@ static void __net_exit igmp6_net_exit(struct net *net)
> igmp6_proc_exit(net);
> }
>
> -static struct pernet_operations igmp6_net_ops = {
> +static struct pernet_operations igmp6_net_ops __net_initdata = {
> .init = igmp6_net_init,
> .exit = igmp6_net_exit,
> };
>
> -int __init igmp6_init(void)
> +int __net_init igmp6_init(void)
> {
> int err;
>
> @@ -3231,18 +3231,18 @@ int __init igmp6_init(void)
> return err;
> }
>
> -int __init igmp6_late_init(void)
> +int __net_init igmp6_late_init(void)
> {
> return register_netdevice_notifier(&igmp6_netdev_notifier);
> }
>
> -void igmp6_cleanup(void)
> +void __net_exit igmp6_cleanup(void)
> {
> unregister_pernet_subsys(&igmp6_net_ops);
> destroy_workqueue(mld_wq);
> }
>
> -void igmp6_late_cleanup(void)
> +void __net_exit igmp6_late_cleanup(void)
> {
> unregister_netdevice_notifier(&igmp6_netdev_notifier);
> }
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index e7ad13c5bd26..3a83280db29d 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1994,12 +1994,12 @@ static void __net_exit ndisc_net_exit(struct net *net)
> inet_ctl_sock_destroy(net->ipv6.ndisc_sk);
> }
>
> -static struct pernet_operations ndisc_net_ops = {
> +static struct pernet_operations ndisc_net_ops __net_initdata = {
> .init = ndisc_net_init,
> .exit = ndisc_net_exit,
> };
>
> -int __init ndisc_init(void)
> +int __net_init ndisc_init(void)
> {
> int err;
>
> @@ -2027,17 +2027,17 @@ int __init ndisc_init(void)
> #endif
> }
>
> -int __init ndisc_late_init(void)
> +int __net_init ndisc_late_init(void)
> {
> return register_netdevice_notifier(&ndisc_netdev_notifier);
> }
>
> -void ndisc_late_cleanup(void)
> +void __net_exit ndisc_late_cleanup(void)
> {
> unregister_netdevice_notifier(&ndisc_netdev_notifier);
> }
>
> -void ndisc_cleanup(void)
> +void __net_exit ndisc_cleanup(void)
> {
> #ifdef CONFIG_SYSCTL
> neigh_sysctl_unregister(&nd_tbl.parms);
> diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
> index 813013ca4e75..c59bade608cd 100644
> --- a/net/ipv6/proc.c
> +++ b/net/ipv6/proc.c
> @@ -298,17 +298,17 @@ static void __net_exit ipv6_proc_exit_net(struct net *net)
> remove_proc_entry("snmp6", net->proc_net);
> }
>
> -static struct pernet_operations ipv6_proc_ops = {
> +static struct pernet_operations ipv6_proc_ops __net_initdata = {
> .init = ipv6_proc_init_net,
> .exit = ipv6_proc_exit_net,
> };
>
> -int __init ipv6_misc_proc_init(void)
> +int __net_init ipv6_misc_proc_init(void)
> {
> return register_pernet_subsys(&ipv6_proc_ops);
> }
>
> -void ipv6_misc_proc_exit(void)
> +void __net_exit ipv6_misc_proc_exit(void)
> {
> unregister_pernet_subsys(&ipv6_proc_ops);
> }
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 3cc58698cbbd..fe399675b8fc 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -1256,17 +1256,17 @@ static void __net_exit raw6_exit_net(struct net *net)
> remove_proc_entry("raw6", net->proc_net);
> }
>
> -static struct pernet_operations raw6_net_ops = {
> +static struct pernet_operations raw6_net_ops __net_initdata = {
> .init = raw6_init_net,
> .exit = raw6_exit_net,
> };
>
> -int __init raw6_proc_init(void)
> +int __net_init raw6_proc_init(void)
> {
> return register_pernet_subsys(&raw6_net_ops);
> }
>
> -void raw6_proc_exit(void)
> +void __net_exit raw6_proc_exit(void)
> {
> unregister_pernet_subsys(&raw6_net_ops);
> }
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 36d75fb50a70..d0737f16076b 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -2335,12 +2335,12 @@ static void __net_exit tcpv6_net_exit(struct net *net)
> inet_ctl_sock_destroy(net->ipv6.tcp_sk);
> }
>
> -static struct pernet_operations tcpv6_net_ops = {
> +static struct pernet_operations tcpv6_net_ops __net_initdata = {
> .init = tcpv6_net_init,
> .exit = tcpv6_net_exit,
> };
>
> -int __init tcpv6_init(void)
> +int __net_init tcpv6_init(void)
> {
> int ret;
>
> @@ -2378,7 +2378,7 @@ int __init tcpv6_init(void)
> goto out;
> }
>
> -void tcpv6_exit(void)
> +void __net_exit tcpv6_exit(void)
> {
> unregister_pernet_subsys(&tcpv6_net_ops);
> inet6_unregister_protosw(&tcpv6_protosw);
>
> base-commit: 2b414a95b8f7307d42173ba9e580d6d3e2bcbfce
> --
> 2.54.0.1136.gdb2ca164c4-goog
>
>
--
Cheers,
Nathan
^ permalink raw reply
* [PATCH net-next v2 3/3] docs: net: fix minor issues with strparser docs
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski, skhan
In-Reply-To: <20260613165846.2913092-1-kuba@kernel.org>
Not sure if anyone would read this doc, but the API has evolved
since it was written. Update to:
- show the int return type for strp_init()
- refer to strp_data_ready(), not the old strp_tcp_data_ready() name
- direct users to strp_msg(skb) for strparser metadata instead of
treating skb->cb as struct strp_msg directly
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/strparser.rst | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/Documentation/networking/strparser.rst b/Documentation/networking/strparser.rst
index 8dc6bb04c710..372106b61e65 100644
--- a/Documentation/networking/strparser.rst
+++ b/Documentation/networking/strparser.rst
@@ -40,8 +40,8 @@ Functions
::
- strp_init(struct strparser *strp, struct sock *sk,
- const struct strp_callbacks *cb)
+ int strp_init(struct strparser *strp, struct sock *sk,
+ const struct strp_callbacks *cb)
Called to initialize a stream parser. strp is a struct of type
strparser that is allocated by the upper layer. sk is the TCP
@@ -95,7 +95,7 @@ Functions
void strp_data_ready(struct strparser *strp);
- The upper layer calls strp_tcp_data_ready when data is ready on
+ The upper layer calls strp_data_ready when data is ready on
the lower socket for strparser to process. This should be called
from a data_ready callback that is set on the socket. Note that
maximum messages size is the limit of the receive socket
@@ -123,9 +123,9 @@ Callbacks
should parse the sk_buff as containing the headers for the
next application layer message in the stream.
- The skb->cb in the input skb is a struct strp_msg. Only
- the offset field is relevant in parse_msg and gives the offset
- where the message starts in the skb.
+ The strparser metadata in the input skb can be accessed with
+ strp_msg(skb). Only the offset field is relevant in parse_msg and
+ gives the offset where the message starts in the skb.
The return values of this function are:
@@ -176,11 +176,11 @@ Callbacks
received in rcv_msg (see strp_pause above). This callback
must be set.
- The skb->cb in the input skb is a struct strp_msg. This
- struct contains two fields: offset and full_len. Offset is
- where the message starts in the skb, and full_len is the
- the length of the message. skb->len - offset may be greater
- than full_len since strparser does not trim the skb.
+ The strparser metadata in the input skb can be accessed with
+ strp_msg(skb). This struct contains two fields: offset and full_len.
+ Offset is where the message starts in the skb, and full_len is
+ the length of the message. skb->len - offset may be greater than
+ full_len since strparser does not trim the skb.
::
--
2.54.0
^ permalink raw reply related
* [PATCH net-next v2 2/3] docs: net: fix minor issues with devlink docs
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski, skhan
In-Reply-To: <20260613165846.2913092-1-kuba@kernel.org>
Update devlink documentation to match current code:
- describe health reporter defaults (it's currently under "callbacks"),
best-effort auto-dump, and port-scoped reporters
- fix generic parameter names and values
- fix nested devlink setup wording and registration ordering
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jiri@resnulli.us
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/devlink/devlink-health.rst | 12 ++++++++----
Documentation/networking/devlink/devlink-params.rst | 2 +-
Documentation/networking/devlink/devlink-port.rst | 5 ++++-
Documentation/networking/devlink/devlink-trap.rst | 8 +++++---
Documentation/networking/devlink/index.rst | 10 +++++-----
5 files changed, 23 insertions(+), 14 deletions(-)
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 4d10536377ab..bedac58a2f36 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -33,7 +33,9 @@ asynchronously. All health reports handling is done by ``devlink``.
* Recovery procedures
* Diagnostics procedures
* Object dump procedures
- * Out Of Box initial parameters
+
+Drivers also provide default values for generic reporter parameters when
+creating a health reporter.
Different parts of the driver can register different types of health reporters
with different handlers.
@@ -45,8 +47,9 @@ Actions
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
- * Object dump is being taken and saved at the reporter instance (as long as
- auto-dump is set and there is no other dump which is already stored)
+ * Object dump is being taken and saved at the reporter instance. This is
+ best effort and skipped when recovery is aborted, auto-dump is disabled,
+ no dump callback is registered, or a dump is already stored.
* Auto recovery attempt is being done. Depends on:
- Auto-recovery configuration
@@ -75,7 +78,8 @@ User Interface
==============
User can access/change each reporter's parameters and driver specific callbacks
-via ``devlink``, e.g per error type (per health reporter):
+via ``devlink``, e.g. per error type (per health reporter). Reporters may be
+registered for the whole devlink instance or for a specific devlink port.
* Configure reporter's generic parameters (like: disable/enable auto recovery)
* Invoke recovery procedure
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index ea17756dcda6..ca19ee3e63c8 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -122,7 +122,7 @@ own name.
* - ``enable_iwarp``
- Boolean
- Enable handling of iWARP traffic in the device.
- * - ``internal_err_reset``
+ * - ``internal_error_reset``
- Boolean
- When enabled, the device driver will reset the device on internal
errors.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 5e397798a402..9374ebe70f48 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -38,7 +38,7 @@ Devlink port flavours are described below.
- This indicates an eswitch port representing a port of PCI
subfunction (SF).
* - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
- - This indicates a virtual port for the PCI virtual function.
+ - Any virtual port facing the user.
Devlink port can have a different type based on the link layer described below.
@@ -134,6 +134,9 @@ Users may also set the IPsec crypto capability of the function using
Users may also set the IPsec packet capability of the function using
`devlink port function set ipsec_packet` command.
+The ``migratable`` attribute may be set only on ports with
+``DEVLINK_PORT_FLAVOUR_PCI_VF``.
+
Users may also set the maximum IO event queues of the function
using `devlink port function set max_io_eqs` command.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 5885e21e2212..ac5bf9337198 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -516,9 +516,11 @@ Generic Packet Trap Groups
Generic packet trap groups are used to aggregate logically related packet
traps. These groups allow the user to batch operations such as setting the trap
-action of all member traps. In addition, ``devlink-trap`` can report aggregated
-per-group packets and bytes statistics, in case per-trap statistics are too
-narrow. The description of these groups must be added to the following table:
+action of all member drop traps whose action may legally change. Exception and
+control traps remain unchanged. In addition, ``devlink-trap`` can report
+aggregated per-group packets and bytes statistics, in case per-trap statistics
+are too narrow. The description of these groups must be added to the following
+table:
.. list-table:: List of Generic Packet Trap Groups
:widths: 10 90
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index f7ba7dcf477d..32f70879ddd0 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -13,8 +13,8 @@ new APIs prefixed by ``devl_*``. The older APIs handle all the locking
in devlink core, but don't allow registration of most sub-objects once
the main devlink object is itself registered. The newer ``devl_*`` APIs assume
the devlink instance lock is already held. Drivers can take the instance
-lock by calling ``devl_lock()``. It is also held all callbacks of devlink
-netlink commands.
+lock by calling ``devl_lock()``. It is also held across all callbacks of
+devlink netlink commands.
Drivers are encouraged to use the devlink instance lock for their own needs.
@@ -33,11 +33,11 @@ devlink instances created underneath. In that case, drivers should make
lock of both nested and parent instances at the same time, devlink
instance lock of the parent instance should be taken first, only then
instance lock of the nested instance could be taken.
- - Driver should use object-specific helpers to setup the
- nested relationship:
+ - Driver should use object-specific helpers to setup the nested relationship
+ before registering the nested devlink instance:
- ``devl_nested_devlink_set()`` - called to setup devlink -> nested
- devlink relationship (could be user for multiple nested instances.
+ devlink relationship (could be used for multiple nested instances).
- ``devl_port_fn_devlink_set()`` - called to setup port function ->
nested devlink relationship.
- ``devlink_linecard_nested_dl_set()`` - called to setup linecard ->
--
2.54.0
^ permalink raw reply related
* [PATCH net-next v2 1/3] docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski, skhan
In-Reply-To: <20260613165846.2913092-1-kuba@kernel.org>
Fill in some gaps in the TLS offload doc:
- describe the tls_dev_del and tls_dev_resync callbacks
- add a mention of rekeying being out of scope for now
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
- add mentions of the callback in resync text
- Stack -> The stack
v1: https://lore.kernel.org/20260609201224.1191391-1-kuba@kernel.org
CC: john.fastabend@gmail.com
CC: sd@queasysnail.net
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/tls-offload.rst | 45 ++++++++++++++++++++----
1 file changed, 38 insertions(+), 7 deletions(-)
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 25ee8d9f12c9..e5802bcd4d22 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -99,6 +99,29 @@ at the end of kernel structures (see :c:member:`driver_state` members
in ``include/net/tls.h``) to avoid additional allocations and pointer
dereferences.
+When the offloaded connection is destroyed the core calls
+the :c:member:`tls_dev_del` callback so the driver can release per-direction
+state:
+
+.. code-block:: c
+
+ void (*tls_dev_del)(struct net_device *netdev,
+ struct tls_context *ctx,
+ enum tls_offload_ctx_dir direction);
+
+``tls_dev_del`` is mandatory whenever ``tls_dev_add`` is provided.
+
+The third TLS device callback is :c:member:`tls_dev_resync`, called by the core
+to synchronize the TCP stream with the record boundaries:
+
+.. code-block:: c
+
+ int (*tls_dev_resync)(struct net_device *netdev,
+ struct sock *sk, u32 seq, u8 *rcd_sn,
+ enum tls_offload_ctx_dir direction);
+
+See the `Resync handling`_ section for details.
+
TX
--
@@ -250,9 +273,9 @@ sequence number (as it will be updated from a different context).
bool tls_offload_tx_resync_pending(struct sock *sk)
Next time ``ktls`` pushes a record it will first send its TCP sequence number
-and TLS record number to the driver. Stack will also make sure that
-the new record will start on a segment boundary (like it does when
-the connection is initially added).
+and TLS record number to the driver via the ``tls_dev_resync`` callback.
+The stack will also make sure that the new record will start on a segment
+boundary (like it does when the connection is initially added).
RX
--
@@ -344,9 +367,10 @@ all TLS record headers that have been logged since the resync request
started.
The kernel confirms the guessed location was correct and tells the device
-the record sequence number. Meanwhile, the device had been parsing
-and counting all records since the just-confirmed one, it adds the number
-of records it had seen to the record number provided by the kernel.
+the record sequence number via the ``tls_dev_resync`` callback. Meanwhile,
+the device had been parsing and counting all records since the just-confirmed
+one, it adds the number of records it had seen to the record number provided
+by the kernel.
At this point the device is in sync and can resume decryption at next
segment boundary.
@@ -370,12 +394,19 @@ schedules resynchronization after it has received two completely encrypted
records.
The stack waits for the socket to drain and informs the device about
-the next expected record number and its TCP sequence number. If the
+the next expected record number and its TCP sequence number via the
+``tls_dev_resync`` callback. If the
records continue to be received fully encrypted stack retries the
synchronization with an exponential back off (first after 2 encrypted
records, then after 4 records, after 8, after 16... up until every
128 records).
+Rekey
+=====
+
+Offload does not currently support TLS 1.3, therefore key rotation
+is not a concern for offloaded connections at this point.
+
Error handling
==============
--
2.54.0
^ permalink raw reply related
* [PATCH net-next v2 0/3] docs: net: more adjustments to docs
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski
A few small updates to the docs.
This is trying to prepare docs for getting fed directly
into AI reviews.
v2:
- fixes in the tls offload patch
- add the strparser patch in place of the already applied XDP md one
v1: https://lore.kernel.org/20260609201224.1191391-1-kuba@kernel.org
Jakub Kicinski (3):
docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and
rekey
docs: net: fix minor issues with devlink docs
docs: net: fix minor issues with strparser docs
.../networking/devlink/devlink-health.rst | 12 +++--
.../networking/devlink/devlink-params.rst | 2 +-
.../networking/devlink/devlink-port.rst | 5 ++-
.../networking/devlink/devlink-trap.rst | 8 ++--
Documentation/networking/devlink/index.rst | 10 ++---
Documentation/networking/strparser.rst | 22 ++++-----
Documentation/networking/tls-offload.rst | 45 ++++++++++++++++---
7 files changed, 72 insertions(+), 32 deletions(-)
--
2.54.0
^ permalink raw reply
* Re: [PATCH net-next v6 3/5] net: dsa: tag_ks8995: Add the KS8995 tag handling
From: Linus Walleij @ 2026-06-13 16:56 UTC (permalink / raw)
To: Jakub Kicinski
Cc: woojung.huh, UNGLinuxDriver, andrew, olteanv, davem, edumazet,
pabeni, robh, krzk+dt, conor+dt, marex, horms, linux, netdev,
devicetree, nb
In-Reply-To: <20260610153952.1685895-1-kuba@kernel.org>
DSA maintainers, look into this analysis a bit below,
the conclusion is that taggers can't re-use vlan_* helpers
because they free the SKB:s on error, so maybe I need
to create new derivative functions to reuse these.
On Wed, Jun 10, 2026 at 5:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
Fair enough.
> > +static struct sk_buff *ks8995_xmit(struct sk_buff *skb, struct net_device *dev)
> > +{
> > + struct vlan_ethhdr *hdr = vlan_eth_hdr(skb);
> > + bool have_hwaccel_tag = false;
> > + u16 tci = 0, portmask;
> > +
> > + /* Prepare the special KS8995 tags */
> > + portmask = dsa_xmit_port_mask(skb, dev);
> > +
> > + if (skb_vlan_tag_present(skb) && skb->vlan_proto == htons(ETH_P_8021Q)) {
> > + tci = skb_vlan_tag_get(skb);
> > + __vlan_hwaccel_clear_tag(skb);
> > + have_hwaccel_tag = true;
> > + }
>
> [Low]
> What is intended to happen if the skb arrives with both a hwaccel VLAN
> tag and an in-head 0x8100 tag, or with a hwaccel tag whose vlan_proto
> is something other than 0x8100 (for example 0x88a8 service tag)?
>
> In those cases this branch leaves the hwaccel tag alone or only consumes
> it conditionally, and the second branch below either inserts a fresh
> in-head tag while keeping a pre-existing in-head 0x8100, or modifies the
> in-head 0x8100 while a non-8021Q hwaccel tag is still attached. Should
> this path either reject such frames or normalise them first?
I have no idea what "normalise" means in this context.
What happens to e.g. a 88a8 service tag is that it gets forwarded
to the target port using the modified vlan tag added by vlan_insert_tag()
below, I think part of the AI confusion here is coming from the
fact that we are using modified "vlan tags" and the semantics
does not fit the common pattern, and the statistics that are used
in the training data become irrelevant.
The whole point of the code is to modify pre-existing
ETH_P_8021Q tags in-place rather than adding a second one,
which means that everything else will get a second tag, the sole
function of which is to direct the packet to the right port, where it
will be stripped by the hardware.
> > + if (have_hwaccel_tag || hdr->h_vlan_proto != htons(ETH_P_8021Q)) {
> > + skb = vlan_insert_tag(skb, KS8995M_STPID(portmask), tci);
> > + if (!skb)
> > + return NULL;
>
> [High]
> Is returning NULL here safe?
>
> include/linux/if_vlan.h documents that vlan_insert_tag() calls
> dev_kfree_skb_any(skb) on failure (e.g. when skb_cow_head() fails under
> memory pressure) and returns NULL. The caller in net/dsa/user.c does:
>
> nskb = p->xmit(skb, dev);
> if (!nskb) {
> kfree_skb(skb);
> return NETDEV_TX_OK;
> }
>
> so on the NULL return dsa_user_xmit() runs kfree_skb() on its local
> pointer that vlan_insert_tag() has already freed.
vlan_insert_tag()
vlan_insert_inner_tag() <- returns NULL on error
__vlan_insert_inner_tag()
> Doesn't that produce a
> use-after-free on skb->users plus a double free?
>
> Should this path swallow the free done by vlan_insert_tag() (e.g. by
> returning a sentinel handled by the caller) so dsa_user_xmit() does not
> re-free the skb?
I looked into this.
The fact that several taggers do this in ->xmit() such as here in tag_8021q.c:
struct sk_buff *dsa_8021q_xmit(struct sk_buff *skb, struct net_device *netdev,
u16 tpid, u16 tci)
{
/* skb->data points at the MAC header, which is fine
* for vlan_insert_tag().
*/
return vlan_insert_tag(skb, htons(tpid), tci);
}
should be a sign that this is OK in xmit:
return vlan_insert_tag(skb,...);
which is equivalent to:
skb = vlan_insert_tag(skb,...);
if (!skb)
return skb;
which is equivalent to:
skb = vlan_insert_tag(skb,...);
if (!skb)
return NULL;
Which is what I do.
So yeah. skb_free() will be free:ed twice. The code in tag_8021q.c will
also do that. But what do you expect ->xmit() to return on error if
not NULL?
When user.c does this:
/* Transmit function may have to reallocate the original SKB,
* in which case it must have freed it. Only free it here on error.
*/
nskb = p->xmit(skb, dev);
if (!nskb) {
kfree_skb(skb);
return NETDEV_TX_OK;
}
return dsa_enqueue_skb(nskb, dev);
The only way to get clean out of this branch if you run
into an error in ->xmit() is to return NULL!
Now let's look at ->rcv():
> > + /* Move the custom DSA+VLAN tag into the hwaccel area and strip
> > + * it from the skb head
> > + */
> > + skb = skb_vlan_untag(skb);
> > + if (!skb) {
> > + netdev_err(dev, "%s: unable to untag skb\n", __func__);
> > + return NULL;
> > + }
>
> [High]
> Is the NULL return here compatible with the cpu_dp->rcv contract?
>
> skb_vlan_untag() can take its err_free path (e.g. when the
> pskb_may_pull(VLAN_HLEN + sizeof(unsigned short)) inside it fails, or
> skb_reorder_vlan_header() fails) which calls kfree_skb(skb) and returns
> NULL.
>
> The caller in net/dsa/tag.c is:
>
> skb = skb_unshare(skb, GFP_ATOMIC);
> ...
> nskb = cpu_dp->rcv(skb, dev);
> ...
> if (!nskb) {
> kfree_skb(skb);
> return 0;
> }
>
> so when ks8995_rcv() returns NULL after skb_vlan_untag() has already
> freed its argument, dsa_switch_rcv() ends up calling kfree_skb() on a
> freed skb. Wouldn't that be a use-after-free on skb->users plus a
> double free?
>
> Other cpu_dp->rcv handlers appear to assume that returning NULL means
> the skb was not freed. Should this handler do something equivalent so
> the contract still holds?
Yes skb_vlan_untag() will free the skb on error and then this will
free the tag again.
But how do you signal to the caller in net/dsa/tag.c
that "things went sidewise and the SKB is already free:ed"?
So these semantics around ->xmit() and ->rcv() free:in the skb on
a NULL return basically challenges Vladimir's request that I
reuse these functions in the first place. They are not made
for this kind of reuse.
What I *CAN* do is go and create wrappers in skbuff.h/c
that will not free the skb on error just return NULL anyway,
intended for this one user (to begin with), such as
vlan_insert_tag_no_free_skb_on_error();
skb_vlan_untag_no_free_skb_on_errror()
I honestly think these are good names because there is
no risk to misunderstand them...
But then I want some buy-in from the maintainers that this is the
way to go.
Yours,
Linus Walleij
^ permalink raw reply
* [PATCH bpf-next v4 2/2] selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper
From: Leon Hwang @ 2026-06-13 16:24 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
John Fastabend, Stanislav Fomichev, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, Leon Hwang, Ihor Solodrai, netdev, linux-kernel,
linux-kselftest, kernel-patches-bot
In-Reply-To: <20260613162443.60515-1-leon.hwang@linux.dev>
Verify the fix by:
1. Attach cgroup sockops prog.
2. Build a tcp connection using ipv4 addr in ipv6 socket.
3. Verify the return value of bpf_setsockopt() helper.
Assisted-by: Codex:gpt-5.5-xhigh
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
.../selftests/bpf/prog_tests/setget_sockopt.c | 78 +++++++++++++++++++
.../selftests/bpf/progs/setget_sockopt.c | 23 ++++++
2 files changed, 101 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
index 77fe1bfb7504..4e91d9b615ce 100644
--- a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
+++ b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
@@ -199,6 +199,83 @@ static void test_nonstandard_opt(int family)
bpf_link__destroy(getsockopt_link);
}
+static int connect_to_v4mapped_v6_fd(int server_fd)
+{
+ struct sockaddr_storage addr;
+ struct sockaddr_in *addr4 = (void *)&addr;
+ socklen_t addrlen = sizeof(addr);
+ struct sockaddr_in6 addr6 = {};
+ int fd = -1, v6only = 0, err;
+
+ err = getsockname(server_fd, (struct sockaddr *)&addr, &addrlen);
+ if (!ASSERT_OK(err, "getsockname"))
+ return -1;
+
+ fd = socket(AF_INET6, SOCK_STREAM, 0);
+ if (!ASSERT_GE(fd, 0, "socket"))
+ return -1;
+
+ err = settimeo(fd, 0);
+ if (!ASSERT_OK(err, "settimeo"))
+ goto err_out;
+
+ err = setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &v6only, sizeof(v6only));
+ if (!ASSERT_OK(err, "clear_v6only"))
+ goto err_out;
+
+ addr6.sin6_family = AF_INET6;
+ addr6.sin6_port = addr4->sin_port;
+ addr6.sin6_addr.s6_addr[10] = 0xff;
+ addr6.sin6_addr.s6_addr[11] = 0xff;
+ memcpy(&addr6.sin6_addr.s6_addr[12], &addr4->sin_addr, sizeof(addr4->sin_addr));
+
+ err = connect(fd, (struct sockaddr *)&addr6, sizeof(addr6));
+ if (!ASSERT_OK(err, "connect"))
+ goto err_out;
+
+ return fd;
+
+err_out:
+ close(fd);
+ return -1;
+}
+
+static void test_v4mapped_v6_ip_tos(void)
+{
+ struct setget_sockopt__bss *bss = skel->bss;
+ int sfd = -1, fd = -1, got = 0, exp = 0x1c;
+ socklen_t optlen;
+
+ memset(bss, 0, sizeof(*bss));
+ bss->v4mapped_v6_ip_tos_enable = 1;
+ bss->v4mapped_v6_ip_tos_ret = -1;
+ bss->v4mapped_v6_ip_tos_val = exp;
+
+ sfd = start_server(AF_INET, SOCK_STREAM, addr4_str, 0, 0);
+ if (!ASSERT_GE(sfd, 0, "start_server"))
+ goto err_out;
+
+ fd = connect_to_v4mapped_v6_fd(sfd);
+ if (!ASSERT_GE(fd, 0, "connect_to_v4mapped_v6_fd"))
+ goto err_out;
+
+ ASSERT_GT(bss->v4mapped_v6_ip_tos_cnt, 0, "v4mapped_v6_ip_tos_cnt");
+ ASSERT_EQ(bss->v4mapped_v6_ip_tos_ret, 0, "v4mapped_v6_ip_tos_ret");
+
+ optlen = sizeof(got);
+ if (!ASSERT_OK(getsockopt(fd, SOL_IP, IP_TOS, &got, &optlen), "getsockopt_ip_tos"))
+ goto err_out;
+
+ ASSERT_EQ(got, exp, "ip_tos");
+
+err_out:
+ bss->v4mapped_v6_ip_tos_enable = 0;
+ if (fd >= 0)
+ close(fd);
+ if (sfd >= 0)
+ close(sfd);
+}
+
void test_setget_sockopt(void)
{
cg_fd = test__join_cgroup(CG_NAME);
@@ -238,6 +315,7 @@ void test_setget_sockopt(void)
test_ktls(AF_INET);
test_nonstandard_opt(AF_INET);
test_nonstandard_opt(AF_INET6);
+ test_v4mapped_v6_ip_tos();
done:
setget_sockopt__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/setget_sockopt.c b/tools/testing/selftests/bpf/progs/setget_sockopt.c
index d330b1511979..636a7cd8e2fa 100644
--- a/tools/testing/selftests/bpf/progs/setget_sockopt.c
+++ b/tools/testing/selftests/bpf/progs/setget_sockopt.c
@@ -387,6 +387,24 @@ int _getsockopt(struct bpf_sockopt *ctx)
return 1;
}
+int v4mapped_v6_ip_tos_enable;
+int v4mapped_v6_ip_tos_ret;
+int v4mapped_v6_ip_tos_cnt;
+int v4mapped_v6_ip_tos_val;
+
+static void test_v4mapped_v6_ip_tos(struct bpf_sock_ops *skops)
+{
+ int tos = v4mapped_v6_ip_tos_val;
+
+ if (!v4mapped_v6_ip_tos_enable || skops->op != BPF_SOCK_OPS_TCP_CONNECT_CB)
+ return;
+ if (skops->family != AF_INET6)
+ return;
+
+ v4mapped_v6_ip_tos_cnt++;
+ v4mapped_v6_ip_tos_ret = bpf_setsockopt(skops, IPPROTO_IP, IP_TOS, &tos, sizeof(tos));
+}
+
SEC("sockops")
int skops_sockopt(struct bpf_sock_ops *skops)
{
@@ -401,6 +419,11 @@ int skops_sockopt(struct bpf_sock_ops *skops)
if (!sk)
return 1;
+ if (v4mapped_v6_ip_tos_enable) {
+ test_v4mapped_v6_ip_tos(skops);
+ return 1;
+ }
+
switch (skops->op) {
case BPF_SOCK_OPS_TCP_LISTEN_CB:
nr_listen += !(bpf_test_sockopt(skops, sk) ||
--
2.54.0
^ permalink raw reply related
* [PATCH bpf-next v4 0/2] bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket
From: Leon Hwang @ 2026-06-13 16:24 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
John Fastabend, Stanislav Fomichev, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, Leon Hwang, Ihor Solodrai, netdev, linux-kernel,
linux-kselftest, kernel-patches-bot
When TCP over IPv4 via INET6 API, sk->sk_family is AF_INET6, but it is a
v4 pkt. inet_csk(sk)->icsk_af_ops is ipv6_mapped and use ip_queue_xmit.
The tos sockopt does not work for bpf [get,set]sockopt() helpers.
Changelog:
v3 -> v4:
* Add 'sk->sk_type != SOCK_RAW && !ipv6_only_sock(sk)' check.
* Re-implement test with LLM assistance.
* v3: https://lore.kernel.org/all/20240914103226.71109-1-zhoufeng.zf@bytedance.com/
v2->v3:
* Use sk_is_inet() helper. (Eric Dumazet)
* https://lore.kernel.org/bpf/CANn89i+9GmBLCdgsfH=WWe-tyFYpiO27wONyxaxiU6aOBC6G8g@mail.gmail.com/T/
v1->v2:
* Fix compilation error. (kernel test robot)
* https://lore.kernel.org/bpf/202408152058.YXAnhLgZ-lkp@intel.com/T/
Leon Hwang (2):
bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket
selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper
net/core/filter.c | 15 +++-
.../selftests/bpf/prog_tests/setget_sockopt.c | 78 +++++++++++++++++++
.../selftests/bpf/progs/setget_sockopt.c | 23 ++++++
3 files changed, 115 insertions(+), 1 deletion(-)
--
2.54.0
^ permalink raw reply
* [PATCH bpf-next v4 1/2] bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket
From: Leon Hwang @ 2026-06-13 16:24 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
John Fastabend, Stanislav Fomichev, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, Leon Hwang, Ihor Solodrai, netdev, linux-kernel,
linux-kselftest, kernel-patches-bot, Feng Zhou
In-Reply-To: <20260613162443.60515-1-leon.hwang@linux.dev>
When TCP over IPv4 via INET6 API, bpf_get/setsockopt with ipv4 will
fail, because sk->sk_family is AF_INET6. With ipv6 will success, not
take effect, because inet_csk(sk)->icsk_af_ops is ipv6_mapped and
use ip_queue_xmit, inet_sk(sk)->tos.
To relax this restriction, allow getting/setting tos for those possible
ipv4-mapped ipv6 sockets.
Fixes: ee7f1e1302f5 ("bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()")
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
net/core/filter.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 9590877b0714..57b00c6cc8cc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5544,11 +5544,24 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
KERNEL_SOCKPTR(optval), *optlen);
}
+static bool sk_allows_sol_ip_sockopt(struct sock *sk)
+{
+ switch (sk->sk_family) {
+ case AF_INET:
+ return true;
+ case AF_INET6:
+ /* Allow getting/setting sockopt for possible ipv4-mapped ipv6 socket. */
+ return sk->sk_type != SOCK_RAW && !ipv6_only_sock(sk);
+ default:
+ return false;
+ }
+}
+
static int sol_ip_sockopt(struct sock *sk, int optname,
char *optval, int *optlen,
bool getopt)
{
- if (sk->sk_family != AF_INET)
+ if (!sk_allows_sol_ip_sockopt(sk))
return -EINVAL;
switch (optname) {
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net v2 0/2] ip_tunnel: fix PMTU ICMP reply routing
From: Jakub Kicinski @ 2026-06-13 16:23 UTC (permalink / raw)
To: Laika Price
Cc: David Ahern, Ido Schimmel, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Shuah Khan, netdev, linux-kernel,
linux-kselftest
In-Reply-To: <CAL=tPgjhj0+8voK40ZPdsKyQ0Pn4vwnSg-JVqRK3qRSXLLB4Kw@mail.gmail.com>
On Sat, 13 Jun 2026 16:38:27 +0100 Laika Price wrote:
> Disregard v2 of this series.
>
> Apologies, I'm new to kernel development as did not realise that I should
> squash commits that would cause the kernel to not build / fail tests. I am
> sending in a v3 with these squashed.
>
> Sorry for the noise.
I'm not sure what build failure you're talking about.
Please observe the 24h cooldown between submitting new versions
of a patch.
^ permalink raw reply
* Re: [PATCH net-next V2] selftests: drv-net: Test queue stall upon reconfig
From: Jakub Kicinski @ 2026-06-13 16:20 UTC (permalink / raw)
To: Mohsin Bashir
Cc: netdev, andrew+netdev, davem, edumazet, pabeni, shuah,
linux-kselftest
In-Reply-To: <20260613014855.1717712-1-mohsin.bashr@gmail.com>
On Fri, 12 Jun 2026 18:48:54 -0700 Mohsin Bashir wrote:
> From: Mohsin Bashir <hmohsin@meta.com>
>
> Add a reconfig_tx_stall test that detects the possibility of a TX stall
> after ring reconfiguration. The key observation is that drivers using
> netif_tx_start_all_queues() are prone to experiencing a stall when
> reconfiguration completes compared to drivers using
> netif_tx_wake_all_queues(). start_all_queues only clears DRV_XOFF, while
> wake_all_queues also calls __netif_schedule() to kick the qdisc. Without
> the kick, qdisc backlog present at reconfig time can stay stuck until a
> new trigger is issued.
>
> The test caps the TX ring at 64 entries so it fills quickly, then
> installs FQ on a target TX queue and sends UDP packets with SO_TXTIME
> scheduled in the future. With napi_defer_hard_irqs slowing completions,
> the small ring can fill when FQ releases the burst, leaving requeued
> qdisc backlog with no FQ timer to rescue it. A subsequent ring reconfig
> must wake the queues to drain the backlog. Simply starting the queues can
> leave it stuck.
>
> On host with problematic driver:
> Sent 128 SO_TXTIME packets (+100ms)
> Sent 128 SO_TXTIME packets (+200ms)
> Backlog before reconfig: 52632 bytes
> Check| At /root/ksft-net-drv/./drivers/net/ring_reconfig.py, ...
> Check| ksft_eq(0, backlog,
> Check failed 0 != 52632 qdisc backlog stuck on queue 1 after ring reconfig
> not ok 3 ring_reconfig.reconfig_tx_stall
>
> On host with fixed driver:
> Sent 128 SO_TXTIME packets (+100ms)
> Sent 128 SO_TXTIME packets (+200ms)
> Backlog before reconfig: 76024 bytes
> ok 3 ring_reconfig.reconfig_tx_stall
>
> Signed-off-by: Mohsin Bashir <hmohsin@meta.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pylint is not on board:
+tools/testing/selftests/drivers/net/ring_reconfig.py:169:37: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
+tools/testing/selftests/drivers/net/ring_reconfig.py:162:17: W0613: Unused argument 'cfg' (unused-argument)
+tools/testing/selftests/drivers/net/ring_reconfig.py:253:0: C0116: Missing function or method docstring (missing-function-docstring)
> diff --git a/tools/testing/selftests/drivers/net/config b/tools/testing/selftests/drivers/net/config
> index 617de8aaf551..1ef07fae74c1 100644
> --- a/tools/testing/selftests/drivers/net/config
> +++ b/tools/testing/selftests/drivers/net/config
> @@ -4,6 +4,10 @@ CONFIG_DEBUG_INFO_BTF_MODULES=n
> CONFIG_INET_PSP=y
> CONFIG_IPV6=y
> CONFIG_MACSEC=m
> +CONFIG_NET_ACT_SKBEDIT=m
> +CONFIG_NET_CLS_ACT=y
> +CONFIG_NET_CLS_FLOWER=m
> +CONFIG_NET_CLS_MATCHALL=m
> CONFIG_NETCONSOLE=m
> CONFIG_NETCONSOLE_DYNAMIC=y
> CONFIG_NETCONSOLE_EXTENDED_LOG=y
> diff --git a/tools/testing/selftests/drivers/net/ring_reconfig.py b/tools/testing/selftests/drivers/net/ring_reconfig.py
> index f9530a8b0856..11491a0b7013 100755
> --- a/tools/testing/selftests/drivers/net/ring_reconfig.py
> +++ b/tools/testing/selftests/drivers/net/ring_reconfig.py
> @@ -5,10 +5,18 @@
> Test channel and ring size configuration via ethtool (-L / -G).
> """
>
> +import socket
> +import struct
> +import time
> +
> from lib.py import ksft_run, ksft_exit, ksft_pr
> from lib.py import ksft_eq
> +from lib.py import KsftSkipEx
> from lib.py import NetDrvEpEnv, EthtoolFamily, GenerateTraffic
> -from lib.py import defer, NlError
> +from lib.py import cmd, defer, rand_port, tc, NlError
> +
> +# Added in Python 3.13; fallback to 61 for x86/ARM/MIPS
> +SO_TXTIME = getattr(socket, "SO_TXTIME", 61)
>
>
> def channels(cfg) -> None:
> @@ -151,6 +159,169 @@ def ringparam(cfg) -> None:
> GenerateTraffic(cfg).wait_pkts_and_stop(10000)
>
>
> +def _write_sysfs(cfg, path, val):
> + with open(path, "r", encoding="utf-8") as fp:
> + orig_val = fp.read().strip()
> + if str(val) == orig_val:
> + return
> + with open(path, "w", encoding="utf-8") as fp:
> + fp.write(str(val))
> + defer(lambda p=path, v=orig_val: open(p, "w").write(v))
> +
> +
> +def _get_mq_handle(cfg):
> + qdiscs = tc(f"qdisc show dev {cfg.ifname}", json=True)
> + for q in qdiscs:
> + if q.get("kind") == "mq":
> + return q["handle"]
> + raise KsftSkipEx(f"no mq qdisc found on {cfg.ifname}")
> +
> +
> +def _get_qdisc_backlog(cfg, queue, mq_handle):
> + qdiscs = tc(f"-s qdisc show dev {cfg.ifname}", json=True)
> + target_parent = f"{mq_handle}{queue + 1:x}"
> + for q in qdiscs:
> + if q.get("parent", "") == target_parent:
> + return q.get("backlog")
> + return None
> +
> +
> +def _setup_fq_qdisc(cfg, mq_handle, port, target_queue, other_queue):
> + mq_child_parent = f"{mq_handle}{target_queue + 1:x}"
> +
> + # Save the original child qdisc to restore after test
> + qdiscs = tc(f"qdisc show dev {cfg.ifname}", json=True)
> + default_qdisc = cmd("sysctl -n net.core.default_qdisc").stdout.strip()
> + orig_kind = default_qdisc
> + for q in qdiscs:
> + if q.get("parent", "") == mq_child_parent:
> + orig_kind = q.get("kind", default_qdisc)
> + break
> + try:
> + tc(f"qdisc replace dev {cfg.ifname} parent {mq_child_parent} fq")
> + except Exception as exc:
> + raise KsftSkipEx("fq not available (CONFIG_NET_SCH_FQ)") from exc
> + defer(tc,
> + f"qdisc replace dev {cfg.ifname} parent {mq_child_parent} {orig_kind}")
> +
> + qdisc_j = tc(f"qdisc show dev {cfg.ifname}", json=True)
> + has_clsact = any(q['kind'] == 'clsact' for q in qdisc_j)
> + if not has_clsact:
> + tc(f"qdisc add dev {cfg.ifname} clsact")
> + defer(tc, f"qdisc del dev {cfg.ifname} clsact")
> +
> + proto = "ipv6" if int(cfg.addr_ipver) == 6 else "ip"
> + try:
> + tc(f"filter add dev {cfg.ifname} egress protocol {proto} "
> + f"pref 1 flower ip_proto udp dst_port {port} "
> + f"action skbedit queue_mapping {target_queue}")
> + except Exception as exc:
> + raise KsftSkipEx("tc flower/act_skbedit not available") from exc
> + defer(tc, f"filter del dev {cfg.ifname} egress pref 1")
> +
> + tc(f"filter add dev {cfg.ifname} egress pref 100 "
> + f"matchall action skbedit queue_mapping {other_queue}")
> + defer(tc, f"filter del dev {cfg.ifname} egress pref 100")
> +
> +
> +def _create_sotxtime_socket(cfg):
> + sock = socket.socket(socket.AF_INET6 if cfg.addr_ipver == "6"
> + else socket.AF_INET, socket.SOCK_DGRAM)
> + try:
> + sock.setsockopt(socket.SOL_SOCKET, SO_TXTIME, struct.pack("Ii", 1, 0))
> + except OSError as exc:
> + sock.close()
> + raise KsftSkipEx("SO_TXTIME not supported") from exc
> + sock.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE,
> + cfg.ifname.encode())
> + return sock
> +
> +
> +def _send_sotxtime_burst(sock, addr, port, count, delay_ns, ipver):
> + payload = b'\x00' * 1400
> + txtime_ns = time.clock_gettime_ns(time.CLOCK_MONOTONIC) + delay_ns
> +
> + ancdata = [(socket.SOL_SOCKET, SO_TXTIME, struct.pack("Q", txtime_ns))]
> + if int(ipver) == 6:
> + dest = (addr, port, 0, 0)
> + else:
> + dest = (addr, port)
> + for _ in range(count):
> + sock.sendmsg([payload], ancdata, 0, dest)
> +
> +
> +def reconfig_tx_stall(cfg) -> None:
> + target_queue = 1
> + other_queue = 0
> +
> + ehdr = {'header': {'dev-index': cfg.ifindex}}
> + chans = cfg.eth.channels_get(ehdr)
> +
> + if 'combined-max' not in chans:
> + raise KsftSkipEx("device does not support combined channels")
> + if chans['combined-count'] < 2:
> + raise KsftSkipEx("need at least 2 combined channels")
> +
> + rings = cfg.eth.rings_get(ehdr)
> + if 'rx' not in rings or 'tx' not in rings:
> + raise KsftSkipEx("device does not expose rx/tx ring params")
> + tx_cur = rings['tx']
> + if tx_cur <= 64:
> + raise KsftSkipEx("tx ring size already at minimum")
> + defer(cfg.eth.rings_set, ehdr | {'tx': tx_cur})
> +
> + tx_min = 64
> + cfg.eth.rings_set(ehdr | {'tx': tx_min})
> +
> + # Slow completions so the ring stays full after FQ releases packets
> + napi_defer = f"/sys/class/net/{cfg.ifname}/napi_defer_hard_irqs"
> + gro_timeout = f"/sys/class/net/{cfg.ifname}/gro_flush_timeout"
> + _write_sysfs(cfg, napi_defer, 100)
> + _write_sysfs(cfg, gro_timeout, 1000000000)
> +
> + mq_handle = _get_mq_handle(cfg)
> + port = rand_port()
> + _setup_fq_qdisc(cfg, mq_handle, port, target_queue, other_queue)
> +
> + sock = _create_sotxtime_socket(cfg)
> + defer(sock.close)
> +
> + pkt_count = tx_min * 2
> +
> + for delay_ms in [100, 200, 500]:
> + delay_ns = delay_ms * 1_000_000
> + _send_sotxtime_burst(sock, cfg.remote_addr, port, pkt_count,
> + delay_ns, cfg.addr_ipver)
> + ksft_pr(f"Sent {pkt_count} SO_TXTIME packets (+{delay_ms}ms)")
> + time.sleep(delay_ms / 1000 + 0.3)
> +
> + backlog = _get_qdisc_backlog(cfg, target_queue, mq_handle)
> + if backlog:
> + break
> + else:
> + raise KsftSkipEx("failed to build qdisc backlog")
> +
> + ksft_pr(f"Backlog before reconfig: {backlog} bytes")
> +
> + # Trigger ring reconfig — driver should call wake, not just start
> + cfg.eth.rings_set(ehdr | {'tx': tx_cur})
> +
> + # Let completions proceed normally
> + _write_sysfs(cfg, napi_defer, 0)
> + _write_sysfs(cfg, gro_timeout, 0)
> +
> + # Poll for backlog to drain
> + for _ in range(100):
> + backlog = _get_qdisc_backlog(cfg, target_queue, mq_handle)
> + if not backlog:
> + break
> + time.sleep(0.1)
> +
> + ksft_eq(0, backlog,
> + comment=f"qdisc backlog stuck on queue {target_queue} "
> + f"after ring reconfig")
> +
> +
> def main() -> None:
> """ Ksft boiler plate main """
>
> @@ -158,7 +329,8 @@ def main() -> None:
the NetDrvEpEnv() setup needs to ask for 2+ queues, otherwise this
fails in netdevsim mode:
# ok 3 ring_reconfig.reconfig_tx_stall # SKIP need at least 2 combined channels
> cfg.eth = EthtoolFamily()
>
> ksft_run([channels,
> - ringparam],
> + ringparam,
> + reconfig_tx_stall],
> args=(cfg, ))
> ksft_exit()
>
^ permalink raw reply
* Re: [PATCH net-next v2 8/8] net: dsa: mt7530: implement port_change_conduit op
From: Daniel Golle @ 2026-06-13 16:09 UTC (permalink / raw)
To: Chester A. Unal, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Matthias Brugger,
AngeloGioacchino Del Regno, Russell King, netdev, linux-kernel,
linux-arm-kernel, linux-mediatek
In-Reply-To: <8dd8cfe32bc8e38b92c49e30a6255090fb0998fb.1781312667.git.daniel@makrotopia.org>
On Sat, Jun 13, 2026 at 02:11:45AM +0100, Daniel Golle wrote:
> Allow changing the CPU port affinity of user ports at runtime via the
> IFLA_DSA_CONDUIT netlink attribute. This updates the port matrix to
> forward to the new CPU port instead of the old one.
>
> Limit the operation to MT7531. There, trapped link-local frames follow
> the per-port affinity, as the MT7531_CPU_PMAP destination mask is
> further restricted by the port matrix. A conduit change is hence fully
> honoured by the hardware, for regular traffic as well as for trapped
> frames.
>
> The MT7530 switch, including the variant embedded in the MT7621 SoC,
> instead traps frames to the single CPU port set in the CPU_PORT field
> of the MFC register, regardless of the affinity of the inbound user
> port. With user ports affine to different CPU ports there is no
> correct value for that field, so per-port CPU affinity cannot be fully
> implemented for trapped frames. Routing a WAN port via the second SoC
> GMAC is conventionally covered by the PHY muxing feature on these
> switches, which bypasses the switch fabric and does not involve a CPU
> port at all.
>
> The switches on the MT7988, EN7581 and AN7583 SoCs only have a
> single CPU port, leaving no other conduit to change to.
>
> Signed-off-by: Daniel Golle <daniel@makrotopia.org>
I forgot to include the previously received
Acked-by: Chester A. Unal <chester.a.unal@arinc9.com>
See also:
https://patchwork.kernel.org/comment/27003848/
https://lore.kernel.org/all/02ad5de0-ea6a-4267-8686-72e3f98fce4e@arinc9.com/
^ permalink raw reply
* [PATCH] net/mlx5: Fix wrong register access in mlx5_query_mtppse()
From: lirongqing @ 2026-06-13 15:36 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-rdma, linux-kernel
Cc: Li RongQing
From: Li RongQing <lirongqing@baidu.com>
In mlx5_query_mtppse(), the result of mtppse_reg query should be read
from the output buffer 'out', not the input buffer 'in'. The function
currently reads event_arm and event_generation_mode from 'in', which
contains the uninitialized query parameters rather than the actual
register values.
Fix by reading from the correct buffer 'out'.
Fixes: f9a1ef720e9e ("net/mlx5: Add MTPPS and MTPPSE registers infrastructure")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
drivers/net/ethernet/mellanox/mlx5/core/port.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/port.c b/drivers/net/ethernet/mellanox/mlx5/core/port.c
index ee8b976..2ab6a6a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/port.c
@@ -921,8 +921,8 @@ int mlx5_query_mtppse(struct mlx5_core_dev *mdev, u8 pin, u8 *arm, u8 *mode)
if (err)
return err;
- *arm = MLX5_GET(mtppse_reg, in, event_arm);
- *mode = MLX5_GET(mtppse_reg, in, event_generation_mode);
+ *arm = MLX5_GET(mtppse_reg, out, event_arm);
+ *mode = MLX5_GET(mtppse_reg, out, event_generation_mode);
return err;
}
--
2.9.4
^ permalink raw reply related
* [PATCH] net/mlx5: Free steering tag data on release
From: lirongqing @ 2026-06-13 15:37 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-rdma, linux-kernel
Cc: Li RongQing
From: Li RongQing <lirongqing@baidu.com>
mlx5_st_alloc_index() allocates an mlx5_st_idx_data object for
each new steering tag table index and stores it in the xarray.
When the last user releases the index, mlx5_st_dealloc_index()
removes the entry from the xarray but did not free the backing
object, leaking memory.
Free idx_data after erasing the xarray entry once the refcount
reaches zero.
Fixes: 888a7776f4fb0 ("net/mlx5: Add support for device steering tag")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
drivers/net/ethernet/mellanox/mlx5/core/lib/st.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
index 997be91..7cedc34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/st.c
@@ -175,6 +175,7 @@ int mlx5_st_dealloc_index(struct mlx5_core_dev *dev, u16 st_index)
if (refcount_dec_and_test(&idx_data->usecount)) {
xa_erase(&st->idx_xa, st_index);
+ kfree(idx_data);
/* We leave PCI config space as was before, no mkey will refer to it */
}
--
2.9.4
^ permalink raw reply related
* [PATCH] net/mlx5: Fix L3 tunnel entropy refcount leak
From: lirongqing @ 2026-06-13 15:36 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-rdma, linux-kernel
Cc: Li RongQing
From: Li RongQing <lirongqing@baidu.com>
mlx5_tun_entropy_refcount_inc() counts both VXLAN and L2-to-L3
tunnel reformat entries as entropy-enabling users. The matching
decrement path only handled VXLAN, leaving L2-to-L3 tunnel entries
counted after release.
Handle MLX5_REFORMAT_TYPE_L2_TO_L3_TUNNEL in
mlx5_tun_entropy_refcount_dec() as well so the enabling entry
refcount remains balanced.
Fixes: f828ca6a2fb6 ("net/mlx5e: Add support for hw encapsulation of MPLS over UDP")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c
index 4571c56..97f6097 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c
@@ -176,7 +176,8 @@ void mlx5_tun_entropy_refcount_dec(struct mlx5_tun_entropy *tun_entropy,
int reformat_type)
{
mutex_lock(&tun_entropy->lock);
- if (reformat_type == MLX5_REFORMAT_TYPE_L2_TO_VXLAN)
+ if (reformat_type == MLX5_REFORMAT_TYPE_L2_TO_VXLAN ||
+ reformat_type == MLX5_REFORMAT_TYPE_L2_TO_L3_TUNNEL)
tun_entropy->num_enabling_entries--;
else if (reformat_type == MLX5_REFORMAT_TYPE_L2_TO_NVGRE &&
--tun_entropy->num_disabling_entries == 0)
--
2.9.4
^ permalink raw reply related
* [PATCH iproute2-next v2] ipaddress: add support for showing IPv4 devconf attributes
From: Fernando Fernandez Mancera @ 2026-06-13 6:57 UTC (permalink / raw)
To: netdev
Cc: dsahern, stephen, davem, edumazet, kuba, pabeni, horms,
Fernando Fernandez Mancera
This patch introduces support for showing IPv4 devconf attributes on
detailed output of an interface e.g "ip -d link show dev enp1s0".
Additionally, this refactors 'print_af_spec()' to sequentially process
both AF_INET and AF_INET6 attributes rather than returning early if
AF_INET6 is missing.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
v2: changed print_string to print_bool for boolean attributes
---
ip/ipaddress.c | 239 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 199 insertions(+), 40 deletions(-)
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 6017bc83..0dd2aa87 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -23,6 +23,7 @@
#include <linux/netdevice.h>
#include <linux/if_arp.h>
#include <linux/if_infiniband.h>
+#include <linux/ip.h>
#include <linux/sockios.h>
#include <linux/net_namespace.h>
@@ -294,53 +295,211 @@ static void print_linktype(FILE *fp, struct rtattr *tb)
close_json_object();
}
+static void print_inet(FILE *fp, struct rtattr *inet_attr)
+{
+ struct rtattr *tb[IFLA_INET_MAX + 1];
+
+ parse_rtattr_nested(tb, IFLA_INET_MAX, inet_attr);
+
+ if (tb[IFLA_INET_CONF]) {
+ int *conf = RTA_DATA(tb[IFLA_INET_CONF]);
+ int max_elements = RTA_PAYLOAD(tb[IFLA_INET_CONF]) / sizeof(int);
+
+ if (max_elements >= IPV4_DEVCONF_FORWARDING)
+ print_bool(PRINT_ANY, "forwarding", "forwarding %s ",
+ conf[IPV4_DEVCONF_FORWARDING - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_MC_FORWARDING)
+ print_bool(PRINT_ANY, "mc_forwarding", "mc_forwarding %s ",
+ conf[IPV4_DEVCONF_MC_FORWARDING - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_PROXY_ARP)
+ print_bool(PRINT_ANY, "proxy_arp", "proxy_arp %s ",
+ conf[IPV4_DEVCONF_PROXY_ARP - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ACCEPT_REDIRECTS)
+ print_bool(PRINT_ANY, "accept_redirects",
+ "accept_redirects %s ",
+ conf[IPV4_DEVCONF_ACCEPT_REDIRECTS - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_SECURE_REDIRECTS)
+ print_bool(PRINT_ANY, "secure_redirects",
+ "secure_redirects %s ",
+ conf[IPV4_DEVCONF_SECURE_REDIRECTS - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_SEND_REDIRECTS)
+ print_bool(PRINT_ANY, "send_redirects", "send_redirects %s ",
+ conf[IPV4_DEVCONF_SEND_REDIRECTS - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_SHARED_MEDIA)
+ print_bool(PRINT_ANY, "shared_media", "shared_media %s ",
+ conf[IPV4_DEVCONF_SHARED_MEDIA - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_RP_FILTER)
+ print_int(PRINT_ANY, "rp_filter", "rp_filter %d ",
+ conf[IPV4_DEVCONF_RP_FILTER - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE)
+ print_bool(PRINT_ANY, "accept_source_route",
+ "accept_source_route %s ",
+ conf[IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_BOOTP_RELAY)
+ print_bool(PRINT_ANY, "bootp_relay", "bootp_relay %s ",
+ conf[IPV4_DEVCONF_BOOTP_RELAY - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_LOG_MARTIANS)
+ print_bool(PRINT_ANY, "log_martians", "log_martians %s ",
+ conf[IPV4_DEVCONF_LOG_MARTIANS - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_TAG)
+ print_int(PRINT_ANY, "tag", "tag %d ",
+ conf[IPV4_DEVCONF_TAG - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ARPFILTER)
+ print_bool(PRINT_ANY, "arpfilter", "arpfilter %s ",
+ conf[IPV4_DEVCONF_ARPFILTER - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_MEDIUM_ID)
+ print_int(PRINT_ANY, "medium_id", "medium_id %d ",
+ conf[IPV4_DEVCONF_MEDIUM_ID - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_NOXFRM)
+ print_bool(PRINT_ANY, "noxfrm", "noxfrm %s ",
+ conf[IPV4_DEVCONF_NOXFRM - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_NOPOLICY)
+ print_bool(PRINT_ANY, "nopolicy", "nopolicy %s ",
+ conf[IPV4_DEVCONF_NOPOLICY - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_FORCE_IGMP_VERSION)
+ print_int(PRINT_ANY, "force_igmp_version", "force_igmp_version %d ",
+ conf[IPV4_DEVCONF_FORCE_IGMP_VERSION - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ARP_ANNOUNCE)
+ print_int(PRINT_ANY, "arp_announce", "arp_announce %d ",
+ conf[IPV4_DEVCONF_ARP_ANNOUNCE - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ARP_IGNORE)
+ print_int(PRINT_ANY, "arp_ignore", "arp_ignore %d ",
+ conf[IPV4_DEVCONF_ARP_IGNORE - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_PROMOTE_SECONDARIES)
+ print_bool(PRINT_ANY, "promote_secondaries",
+ "promote_secondaries %s ",
+ conf[IPV4_DEVCONF_PROMOTE_SECONDARIES - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ARP_ACCEPT)
+ print_int(PRINT_ANY, "arp_accept", "arp_accept %d ",
+ conf[IPV4_DEVCONF_ARP_ACCEPT - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ARP_NOTIFY)
+ print_bool(PRINT_ANY, "arp_notify", "arp_notify %s ",
+ conf[IPV4_DEVCONF_ARP_NOTIFY - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ACCEPT_LOCAL)
+ print_bool(PRINT_ANY, "accept_local", "accept_local %s ",
+ conf[IPV4_DEVCONF_ACCEPT_LOCAL - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_SRC_VMARK)
+ print_bool(PRINT_ANY, "src_vmark", "src_vmark %s ",
+ conf[IPV4_DEVCONF_SRC_VMARK - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_PROXY_ARP_PVLAN)
+ print_bool(PRINT_ANY, "proxy_arp_pvlan", "proxy_arp_pvlan %s ",
+ conf[IPV4_DEVCONF_PROXY_ARP_PVLAN - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ROUTE_LOCALNET)
+ print_bool(PRINT_ANY, "route_localnet", "route_localnet %s ",
+ conf[IPV4_DEVCONF_ROUTE_LOCALNET - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_BC_FORWARDING)
+ print_bool(PRINT_ANY, "bc_forwarding", "bc_forwarding %s ",
+ conf[IPV4_DEVCONF_BC_FORWARDING - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL)
+ print_int(PRINT_ANY, "igmpv2_unsolicited_report_interval",
+ "igmpv2_unsolicited_report_interval %d ",
+ conf[IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL)
+ print_int(PRINT_ANY, "igmpv3_unsolicited_report_interval",
+ "igmpv3_unsolicited_report_interval %d ",
+ conf[IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN)
+ print_bool(PRINT_ANY, "ignore_routes_with_linkdown",
+ "ignore_routes_with_linkdown %s ",
+ conf[IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST)
+ print_bool(PRINT_ANY, "drop_unicast_in_l2_multicast",
+ "drop_unicast_in_l2_multicast %s ",
+ conf[IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_DROP_GRATUITOUS_ARP)
+ print_bool(PRINT_ANY, "drop_gratuitous_arp",
+ "drop_gratuitous_arp %s ",
+ conf[IPV4_DEVCONF_DROP_GRATUITOUS_ARP - 1]);
+
+ if (max_elements >= IPV4_DEVCONF_ARP_EVICT_NOCARRIER)
+ print_bool(PRINT_ANY, "arp_evict_nocarrier",
+ "arp_evict_nocarrier %s ",
+ conf[IPV4_DEVCONF_ARP_EVICT_NOCARRIER - 1]);
+ }
+}
+
static void print_af_spec(FILE *fp, struct rtattr *af_spec_attr)
{
- struct rtattr *inet6_attr;
struct rtattr *tb[IFLA_INET6_MAX + 1];
+ struct rtattr *inet6_attr;
+ struct rtattr *inet_attr;
- inet6_attr = parse_rtattr_one_nested(AF_INET6, af_spec_attr);
- if (!inet6_attr)
- return;
+ inet_attr = parse_rtattr_one_nested(AF_INET, af_spec_attr);
+ if (inet_attr)
+ print_inet(fp, inet_attr);
- parse_rtattr_nested(tb, IFLA_INET6_MAX, inet6_attr);
+ inet6_attr = parse_rtattr_one_nested(AF_INET6, af_spec_attr);
+ if (inet6_attr) {
+ parse_rtattr_nested(tb, IFLA_INET6_MAX, inet6_attr);
- if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
- __u8 mode = rta_getattr_u8(tb[IFLA_INET6_ADDR_GEN_MODE]);
- SPRINT_BUF(b1);
+ if (tb[IFLA_INET6_ADDR_GEN_MODE]) {
+ __u8 mode = rta_getattr_u8(tb[IFLA_INET6_ADDR_GEN_MODE]);
- switch (mode) {
- case IN6_ADDR_GEN_MODE_EUI64:
- print_string(PRINT_ANY,
- "inet6_addr_gen_mode",
- "addrgenmode %s ",
- "eui64");
- break;
- case IN6_ADDR_GEN_MODE_NONE:
- print_string(PRINT_ANY,
- "inet6_addr_gen_mode",
- "addrgenmode %s ",
- "none");
- break;
- case IN6_ADDR_GEN_MODE_STABLE_PRIVACY:
- print_string(PRINT_ANY,
- "inet6_addr_gen_mode",
- "addrgenmode %s ",
- "stable_secret");
- break;
- case IN6_ADDR_GEN_MODE_RANDOM:
- print_string(PRINT_ANY,
- "inet6_addr_gen_mode",
- "addrgenmode %s ",
- "random");
- break;
- default:
- snprintf(b1, sizeof(b1), "%#.2hhx", mode);
- print_string(PRINT_ANY,
- "inet6_addr_gen_mode",
- "addrgenmode %s ",
- b1);
- break;
+ SPRINT_BUF(b1);
+ switch (mode) {
+ case IN6_ADDR_GEN_MODE_EUI64:
+ print_string(PRINT_ANY,
+ "inet6_addr_gen_mode",
+ "addrgenmode %s ",
+ "eui64");
+ break;
+ case IN6_ADDR_GEN_MODE_NONE:
+ print_string(PRINT_ANY,
+ "inet6_addr_gen_mode",
+ "addrgenmode %s ",
+ "none");
+ break;
+ case IN6_ADDR_GEN_MODE_STABLE_PRIVACY:
+ print_string(PRINT_ANY,
+ "inet6_addr_gen_mode",
+ "addrgenmode %s ",
+ "stable_secret");
+ break;
+ case IN6_ADDR_GEN_MODE_RANDOM:
+ print_string(PRINT_ANY,
+ "inet6_addr_gen_mode",
+ " addrgenmode %s ",
+ "random");
+ break;
+ default:
+ snprintf(b1, sizeof(b1), "%#.2hhx", mode);
+ print_string(PRINT_ANY,
+ "inet6_addr_gen_mode",
+ "addrgenmode %s ",
+ b1);
+ break;
+ }
}
}
}
--
2.54.0
^ permalink raw reply related
* [PATCH net v2 2/2] selftests: pmtu: fix incorrect PMTU exception generation
From: Laika Price via B4 Relay @ 2026-06-13 15:12 UTC (permalink / raw)
To: David Ahern, Ido Schimmel, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Shuah Khan
Cc: netdev, linux-kernel, linux-kselftest, Laika Price
In-Reply-To: <20260613-master-v2-0-061b70fd45dd@gmail.com>
From: Laika Price <laikabcprice@gmail.com>
pmtu_ipv4_br_vxlan4_exception generates PMTU exceptions by pinging an IP
on the other side of a tunnel. This was incorrect as it would return upon
the first ICMP Fragmentation Needed due to the -w flag being used in
conjunction with || return 1.
This patch updates pmtu_ipv4_br_vxlan4_exception to be in line with how
PMTU exceptions are generated in other tests such as in test_pmtu_ipvX
run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst1}
run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst2}
Signed-off-by: Laika Price <laikabcprice@gmail.com>
---
tools/testing/selftests/net/pmtu.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index a3323c21f..9498d9f53 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -1456,8 +1456,8 @@ test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() {
mtu "${ns_a}" ${type}_a $((${ll_mtu} + 1000))
mtu "${ns_b}" ${type}_b $((${ll_mtu} + 1000))
- run_cmd ${ns_c} ${ping} -q -M want -i 0.1 -c 10 -s $((${ll_mtu} + 500)) ${dst} || return 1
- run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s $((${ll_mtu} + 500)) ${dst} || return 1
+ run_cmd ${ns_c} ${ping} -q -M want -i 0.1 -w 1 -s $((${ll_mtu} + 500)) ${dst}
+ run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s $((${ll_mtu} + 500)) ${dst}
# Check that exceptions were created
pmtu="$(route_get_dst_pmtu_from_exception "${ns_c}" ${dst})"
--
2.54.0
^ permalink raw reply related
* [PATCH net v2 1/2] ip_tunnel: drop stale dst from generated PMTU ICMP replies
From: Laika Price via B4 Relay @ 2026-06-13 15:12 UTC (permalink / raw)
To: David Ahern, Ido Schimmel, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Shuah Khan
Cc: netdev, linux-kernel, linux-kselftest, Laika Price
In-Reply-To: <20260613-master-v2-0-061b70fd45dd@gmail.com>
From: Laika Price <laikabcprice@gmail.com>
iptunnel_pmtud_build_icmp(...) and iptunnel_pmtud_build_icmpv6(...) take
in an sk_buff, modify it to create a PMTU ICMP error reply, and return it.
As part of these modifications, the source/destination ethernet and IP
addresses are swapped around which makes the sk_buff's current dst invalid.
If the stale dst is left, the packet can skip input routing and be
forwarded using the original output device. This was observed when sending
packets to a VXLAN over a WireGuard tunnel - the ICMP reply was generated
but it was sent over the VXLAN instead of to the WireGuard tunnel.
Drop the stale dst after building the PMTU reply so that the packet is
routed using its new headers when it is reinjected.
Signed-off-by: Laika Price <laikabcprice@gmail.com>
---
net/ipv4/ip_tunnel_core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index d3c677e9b..949150e43 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -267,6 +267,7 @@ static int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu)
eth_header(skb, skb->dev, ntohs(eh.h_proto), eh.h_source, eh.h_dest, 0);
skb_reset_mac_header(skb);
+ skb_dst_drop(skb);
return skb->len;
}
@@ -370,6 +371,7 @@ static int iptunnel_pmtud_build_icmpv6(struct sk_buff *skb, int mtu)
eth_header(skb, skb->dev, ntohs(eh.h_proto), eh.h_source, eh.h_dest, 0);
skb_reset_mac_header(skb);
+ skb_dst_drop(skb);
return skb->len;
}
--
2.54.0
^ permalink raw reply related
* [PATCH net v2 0/2] ip_tunnel: fix PMTU ICMP reply routing
From: Laika Price via B4 Relay @ 2026-06-13 15:12 UTC (permalink / raw)
To: David Ahern, Ido Schimmel, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Shuah Khan
Cc: netdev, linux-kernel, linux-kselftest, Laika Price
---
Changes in v2:
- Fix incorrect PMTU exceptions test
- Link to v1: https://patch.msgid.link/20260613-master-v1-1-df796e8e2d74@gmail.com
To: David Ahern <dsahern@kernel.org>
To: Ido Schimmel <idosch@nvidia.com>
To: "David S. Miller" <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Simon Horman <horms@kernel.org>
To: Shuah Khan <shuah@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Laika Price (2):
[net] ip_tunnel: drop stale dst from generated PMTU ICMP replies
[net] selftests: pmtu: fix incorrect PMTU exception generation
net/ipv4/ip_tunnel_core.c | 2 ++
tools/testing/selftests/net/pmtu.sh | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
---
base-commit: 2a2974b5145cdf2f4db134be1a2157e9ca4a1cf0
change-id: 20260613-master-a299166b9069
Best regards,
--
Laika Price <laikabcprice@gmail.com>
^ permalink raw reply
* [PATCH net] appletalk: aarp: fix proxy probe conflict lookup
From: Yizhou Zhao @ 2026-06-13 15:00 UTC (permalink / raw)
To: netdev
Cc: Yizhou Zhao, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Kito Xu (veritas501), Kees Cook,
linux-kernel, Yuxiang Yang, Ao Wang, Xuewei Feng, Qi Li, Ke Xu,
stable
aarp_rcv() computes hash from the packet source node and later uses it
for the normal AARP reply lookup against the unresolved table. The same
hash is also reused earlier for the proxy probe conflict check, but that
check builds its lookup key from the packet destination address.
Proxy AARP entries are inserted into the proxy table using the proxied
address node as the hash key. AARP packets are not required to have the
same source and destination node numbers, so the proxy probe conflict
check can search the wrong bucket and miss an entry that is still in
ATIF_PROBE state.
If that happens, SIOCSARP can accept a proxy address even though a
conflicting AARP packet was observed on the wire. This can create
duplicate AppleTalk address ownership. Depending on the network setup,
traffic for that address may then be misdirected, or the address may
become intermittently unreachable.
Look up the proxy probe entry using a hash derived from da.s_node, which
matches how proxy entries are inserted and removed. Leave the source-node
hash unchanged for the later unresolved-entry reply handling.
In a veth/SNAP/AARP reproducer on a KASAN-enabled kernel, a conflicting
AARP packet with different source and destination nodes allowed SIOCSARP
to succeed before this change. With this change, the same conflict
returns EADDRINUSE, while a no-conflict proxy add still succeeds.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Assisted-by: GLM:GLM-5.1
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
---
net/appletalk/aarp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/appletalk/aarp.c b/net/appletalk/aarp.c
index 078fb7a6efa5..1352ede79668 100644
--- a/net/appletalk/aarp.c
+++ b/net/appletalk/aarp.c
@@ -755,7 +755,8 @@ static int aarp_rcv(struct sk_buff *skb, struct net_device *dev,
da.s_net = ea->pa_dst_net;
write_lock_bh(&aarp_lock);
- a = __aarp_find_entry(proxies[hash], dev, &da);
+ a = __aarp_find_entry(proxies[da.s_node % (AARP_HASH_SIZE - 1)],
+ dev, &da);
if (a && a->status & ATIF_PROBE) {
a->status |= ATIF_PROBE_FAIL;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] r8152: add vendor/device ID for CoreChips SR9900
From: Nicolai Buchwitz @ 2026-06-13 14:57 UTC (permalink / raw)
To: zjzhao; +Cc: hayeswang, andrew+netdev, linux-usb, netdev, linux-kernel
In-Reply-To: <20260613090154.1975753-1-zjzhao@edatec.cn>
On 13.6.2026 11:01, zjzhao@edatec.cn wrote:
> From: zjzhao-eda <zjzhao@edatec.cn>
>
> The CoreChips SR9900 (0x0fe6:0x9900) is a USB 2.0 10/100
> Ethernet adapter. Testing shows it works correctly with the
> r8152 driver, reaching wire speed (94 Mbps) with zero packet
> loss on both TCP and UDP.
Do you know how they differ to the other CoreChip devices (eg.
drivers/net/usb/sr9800.c and others)?
>
> Tested on Raspberry Pi, including hotplug and extended data
> transfer.
>
> Signed-off-by: zjzhao-eda <zjzhao@edatec.cn>
> ---
> drivers/net/usb/r8152.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
> index d61074178279..ea1733e3619c 100644
> --- a/drivers/net/usb/r8152.c
> +++ b/drivers/net/usb/r8152.c
> @@ -10062,6 +10062,7 @@ static const struct usb_device_id
> rtl8152_table[] = {
> { USB_DEVICE(VENDOR_ID_DELL, 0xb097) },
> { USB_DEVICE(VENDOR_ID_ASUS, 0x1976) },
> { USB_DEVICE(VENDOR_ID_TRENDNET, 0xe02b) },
> + { USB_DEVICE(0x0fe6, 0x9900) },
Instead of hardcoded 0x0fe6, please add a proper VENDOR_ID define in
include/linux/usb/r8152.h
> {}
> };
Thanks
Nicolai
^ permalink raw reply
* Re: [PATCH 6.12.y v3 0/2] xfrm: hold dev ref until after transport_finish NF_HOOK
From: Sasha Levin @ 2026-06-13 14:51 UTC (permalink / raw)
To: Steffen Klassert, Herbert Xu, David S . Miller, David Ahern,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev,
linux-kernel, stable, Simon Liebold
Cc: Sasha Levin, Simon Liebold
In-Reply-To: <20260612111327.1613710-1-simonlie@amazon.de>
On Fri, Jun 12, 2026 at 11:13:25AM +0000, Simon Liebold wrote:
> Thanks for the detailed analysis on v2, Sasha. Here's v3.
>
> v3: Backport b05d42eefac7 ("xfrm: hold device only for the asynchronous
> decryption") as a prerequisite, making the tree structurally match mainline so
> the fix applies without the lifetime gap Sasha identified in v2, where the
> dev_put at resume: dropped the ref before the re-hold could cover it.
Whole series queued for 6.12.y, thanks.
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH net-next v7 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs
From: Simon Schippers @ 2026-06-13 14:14 UTC (permalink / raw)
To: hawk, netdev
Cc: kernel-team, Jonas Köppeler, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
bpf
In-Reply-To: <20260612083530.1650245-6-hawk@kernel.org>
On 6/12/26 10:35, hawk@kernel.org wrote:
> From: Simon Schippers <simon.schippers@tu-dortmund.de>
>
> Per-packet BQL completion forces DQL to converge on limit=2, causing
> excessive NAPI scheduling overhead and qdisc requeues.
>
> Accumulate BQL completions and flush them when a configurable time
> threshold (tx-usecs) is exceeded, letting DQL discover a limit that
> bounds actual queuing delay to the configured interval. Coalescing
> state persists across NAPI polls in struct veth_rq so completions can
> accumulate beyond a single budget=64 cycle.
>
> The flush condition is:
>
> state->time + bql_flush_ns <= current_time || state->n_bql > dql.limit
>
> Flushing when n_bql exceeds dql.limit handles BQL starvation.
>
> The comparison is strictly greater-than because netdev_tx_sent_queue()
> always lets the producer exceed the limit by one before it stops, so
> n_bql == dql.limit is a normal in-flight state. dql.limit lives in
> the same cacheline as the completion path, so the check is cheap.
>
> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
> setting tx-usecs to 0 disables coalescing and falls back to per-packet
> completion.
>
> ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
> ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)
>
> Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Co-developed-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
> Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
> drivers/net/veth.c | 123 ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 117 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 2473f730734b..c62d87a8402c 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -28,6 +28,7 @@
> #include <linux/bpf_trace.h>
> #include <linux/net_tstamp.h>
> #include <linux/skbuff_ref.h>
> +#include <linux/sched/clock.h>
> #include <net/page_pool/helpers.h>
>
> #define DRV_NAME "veth"
> @@ -50,6 +51,7 @@
> * delay => 64 * 250 ms = 16 s.
> */
> #define VETH_WATCHDOG_TIMEOUT_MS (64 * 250)
> +#define VETH_BQL_COAL_TX_USECS 100 /* default tx-usecs for BQL batching */
>
> struct veth_stats {
> u64 rx_drops;
> @@ -69,6 +71,11 @@ struct veth_rq_stats {
> struct u64_stats_sync syncp;
> };
>
> +struct veth_bql_state {
> + u64 time; /* sched_clock() when current coalescing window started */
> + uint n_bql; /* BQL completions batched in the current window */
> +};
> +
> struct veth_rq {
> struct napi_struct xdp_napi;
> struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
> @@ -76,6 +83,7 @@ struct veth_rq {
> struct bpf_prog __rcu *xdp_prog;
> struct xdp_mem_info xdp_mem;
> struct veth_rq_stats stats;
> + struct veth_bql_state bql_state;
> bool rx_notify_masked;
> struct ptr_ring xdp_ring;
> struct xdp_rxq_info xdp_rxq;
> @@ -88,6 +96,7 @@ struct veth_priv {
> struct bpf_prog *_xdp_prog;
> struct veth_rq *rq;
> unsigned int requested_headroom;
> + unsigned int tx_coal_usecs; /* BQL completion coalescing */
> };
>
> struct veth_xdp_tx_bq {
> @@ -272,7 +281,56 @@ static void veth_get_channels(struct net_device *dev,
> static int veth_set_channels(struct net_device *dev,
> struct ethtool_channels *ch);
>
> +static int veth_get_coalesce(struct net_device *dev,
> + struct ethtool_coalesce *ec,
> + struct kernel_ethtool_coalesce *kernel_coal,
> + struct netlink_ext_ack *extack)
> +{
> + struct veth_priv *priv = netdev_priv(dev);
> +
> + ec->tx_coalesce_usecs = priv->tx_coal_usecs;
> + return 0;
> +}
> +
> +static int veth_set_coalesce(struct net_device *dev,
> + struct ethtool_coalesce *ec,
> + struct kernel_ethtool_coalesce *kernel_coal,
> + struct netlink_ext_ack *extack)
> +{
> + struct veth_priv *priv = netdev_priv(dev);
> + struct net_device *peer;
> +
> + /* The coalescing window delays BQL completions, so keep tx-usecs well
> + * below the tx_timeout watchdog; otherwise a large value could stall a
> + * stopped queue long enough to trip a false watchdog timeout. Cap at
> + * half the watchdog to leave a generous safety margin. tx-usecs is
> + * microseconds, the watchdog is milliseconds.
> + */
> + if (ec->tx_coalesce_usecs > VETH_WATCHDOG_TIMEOUT_MS / 2 * USEC_PER_MSEC) {
> + NL_SET_ERR_MSG_MOD(extack,
> + "tx-usecs must stay below half the tx_timeout watchdog");
> + return -ERANGE;
> + }
> +
> + /* Paired with READ_ONCE in veth_xdp_rcv(). */
> + WRITE_ONCE(priv->tx_coal_usecs, ec->tx_coalesce_usecs);
> +
> + /* veth_xdp_rcv() reads each device's own value, so mirror it onto
> + * the peer to keep the pair symmetric: both directions coalesce
> + * with the same tx-usecs. Called under RTNL, rtnl_dereference() is safe.
> + */
> + peer = rtnl_dereference(priv->peer);
> + if (peer) {
> + struct veth_priv *peer_priv = netdev_priv(peer);
> +
> + WRITE_ONCE(peer_priv->tx_coal_usecs, ec->tx_coalesce_usecs);
> + }
> +
> + return 0;
> +}
> +
> static const struct ethtool_ops veth_ethtool_ops = {
> + .supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
> .get_drvinfo = veth_get_drvinfo,
> .get_link = ethtool_op_get_link,
> .get_strings = veth_get_strings,
> @@ -282,6 +340,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
> .get_ts_info = ethtool_op_get_ts_info,
> .get_channels = veth_get_channels,
> .set_channels = veth_set_channels,
> + .get_coalesce = veth_get_coalesce,
> + .set_coalesce = veth_set_coalesce,
> };
>
> /* general routines */
> @@ -969,13 +1029,54 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
> return NULL;
> }
>
> +static void veth_bql_maybe_complete(struct veth_bql_state *state,
> + struct netdev_queue *peer_txq,
> + u64 bql_flush_ns)
> +{
> + u64 current_time;
> +
> + /* There is no reason to complete with 0 and
> + * peer_txq could go away.
> + */
> + if (!state->n_bql || !peer_txq)
> + return;
> +
> + current_time = sched_clock();
> +
> + /* We complete if:
> + * 1. We reach bql_flush_ns.
> + * 2. We potentially have BQL starvation.
> + */
> + if (state->time + bql_flush_ns <= current_time ||
> + state->n_bql > peer_txq->dql.limit) {
Both Sashiko-Nipa and Sashiko-Gemini are right, this is missing a
#ifdef CONFIG_BQL. Not sure what is the best way to add them.
And for the struct we could maybe do:
#ifdef CONFIG_BQL
struct veth_bql_state {
u64 time; /* sched_clock() when current coalescing window started */
uint n_bql; /* BQL completions batched in the current window */
};
#else
struct veth_bql_state {};
#endif
> + netdev_tx_completed_queue(peer_txq, state->n_bql,
> + state->n_bql * VETH_BQL_UNIT);
> + state->time = current_time;
> + state->n_bql = 0;
> + }
> +}
> +
> static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> struct veth_xdp_tx_bq *bq,
> struct veth_stats *stats,
> struct netdev_queue *peer_txq)
> {
> + struct veth_priv *priv = netdev_priv(rq->dev);
> + struct veth_bql_state *state = &rq->bql_state;
> int i, done = 0, n_xdpf = 0;
> void *xdpf[VETH_XDP_BATCH];
> + u64 bql_flush_ns;
> +
> + /* Mirrored to both peers; paired with WRITE_ONCE() in veth_set_coalesce */
> + bql_flush_ns = (u64)READ_ONCE(priv->tx_coal_usecs) * 1000;
> +
> + /* Clamp stored timestamp in case we migrated to a CPU with a behind
> + * sched_clock(); tries to reduce late BQL flushes.
> + */
> + state->time = min(state->time, sched_clock());
> +
> + /* Flush completions that timed out since the previous NAPI poll. */
> + veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
>
> for (i = 0; i < budget; i++) {
> void *ptr = __ptr_ring_consume(&rq->xdp_ring);
> @@ -1000,12 +1101,11 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> }
> } else {
> /* ndo_start_xmit */
> - bool bql_charged = veth_ptr_is_bql(ptr);
> struct sk_buff *skb = veth_ptr_to_skb(ptr);
>
> + if (veth_ptr_is_bql(ptr))
> + state->n_bql++;
> stats->xdp_bytes += skb->len;
> - if (peer_txq && bql_charged)
> - netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>
> skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
> if (skb) {
> @@ -1015,6 +1115,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> napi_gro_receive(&rq->xdp_napi, skb);
> }
> }
> + veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
> done++;
Sashiko-Nipa reports:
"If veth_xdp_rcv() finishes and returns a done count less than the budget,
NAPI will go to sleep in veth_poll(). Do we need to unconditionally flush
any stranded BQL completions in veth_poll() before sleeping?
If completions are left in rq->bql_state indefinitely across NAPI idle
periods, it might present an artificially massive delay to DQL. This could
cause DQL to mistakenly conclude the hardware is extremely slow and
aggressively shrink dql.limit to its minimum, crippling throughput on
subsequent bursts."
Again the issue that I found to be non-problematic in [1] and can be
seen by an BQL inflight > 0 when for example pktgen suddenly stops.
If we would "unconditionally flush any stranded BQL completions in
veth_poll() before sleeping" we would *not* accumulate BQL completions
across NAPI polls but we want to do that.
Do you agree?
[1] https://lore.kernel.org/netdev/c8650d3a-e488-4279-b28f-549d766c23a1@tu-dortmund.de/
^ permalink raw reply
* [PATCH net-next] selftests/net/openvswitch: add ICMPv6 echo type match test
From: Minxi Hou @ 2026-06-13 14:14 UTC (permalink / raw)
To: netdev
Cc: aconole, echaudro, i.maximets, davem, edumazet, kuba, pabeni,
horms, shuah, dev, linux-kselftest, Minxi Hou
Register OVS_KEY_ATTR_ICMPV6 in the flow key parser so that
icmpv6(type=...) can be used in flow specifications. Without this
registration the parser silently drops the token and the kernel
rejects the flow with EINVAL because the expected ICMPv6 key
attribute is missing.
While here, add convert_int() to the ovs_key_ipv6 and ovs_key_icmp
fields_map entries so that specifying a field value produces the
correct wildcard mask (0xff for bytes, 0xffffffff for the label)
instead of using the value itself as the mask. The ipv4 counterpart
already does this via convert_int(); the ipv6 and icmp classes were
simply missing the fifth tuple element. Existing callers that pass
empty parentheses are unaffected because convert_int("") returns
(0, 0).
Add test_icmpv6 exercising the ICMPv6 echo flow key. The test uses
static neighbour entries to bypass NDP, then verifies in three steps:
install icmpv6(type=128) and icmpv6(type=129) flows and confirm ping
works, remove the flows and confirm ping fails, reinstall and confirm
recovery.
Signed-off-by: Minxi Hou <houminxi@gmail.com>
---
.../selftests/net/openvswitch/openvswitch.sh | 63 +++++++++++++++++++
.../selftests/net/openvswitch/ovs-dpctl.py | 26 +++++---
2 files changed, 82 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index d533decca5c1..8923224fa88e 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -31,6 +31,7 @@ tests="
pop_vlan vlan: POP_VLAN action strips tag
dec_ttl ttl: dec_ttl decrements IP TTL
flow_set flow-set: Flow modify
+ icmpv6 icmpv6: ICMPv6 echo type match
psample psample: Sampling packets with psample"
info() {
@@ -377,6 +378,68 @@ test_flow_set() {
return 0
}
+test_icmpv6() {
+ sbx_add "test_icmpv6" || return $?
+ ovs_add_dp "test_icmpv6" icmpv6 || return 1
+
+ info "create namespaces"
+ for ns in client server; do
+ ovs_add_netns_and_veths "test_icmpv6" "icmpv6" \
+ "$ns" "${ns:0:1}0" "${ns:0:1}1" || return 1
+ done
+
+ ip netns exec client ip addr add fd00::1/64 dev c1 nodad
+ ip netns exec client ip link set c1 up
+ ip netns exec server ip addr add fd00::2/64 dev s1 nodad
+ ip netns exec server ip link set s1 up
+
+ local cl_mac sl_mac
+ cl_mac=$(ip netns exec client \
+ ip link show c1 | awk '/link\/ether/ {print $2}')
+ [ -z "$cl_mac" ] && \
+ { info "failed to get c1 hwaddr"; return 1; }
+ sl_mac=$(ip netns exec server \
+ ip link show s1 | awk '/link\/ether/ {print $2}')
+ [ -z "$sl_mac" ] && \
+ { info "failed to get s1 hwaddr"; return 1; }
+ ip netns exec client \
+ ip -6 neigh add fd00::2 lladdr "$sl_mac" dev c1
+ ip netns exec server \
+ ip -6 neigh add fd00::1 lladdr "$cl_mac" dev s1
+
+ ovs_add_flow "test_icmpv6" icmpv6 \
+ 'in_port(1),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=128)' \
+ '2' || return 1
+ ovs_add_flow "test_icmpv6" icmpv6 \
+ 'in_port(2),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=129)' \
+ '1' || return 1
+
+ info "verify ICMPv6 echo with type-specific flows"
+ ovs_sbx "test_icmpv6" ip netns exec client \
+ ping -6 -c 1 -W 2 fd00::2 || return 1
+
+ ovs_del_flows "test_icmpv6" icmpv6
+
+ info "verify ping fails without echo flows"
+ ovs_sbx "test_icmpv6" ip netns exec client \
+ ping -6 -c 1 -W 2 fd00::2 >/dev/null 2>&1 \
+ && { info "FAIL: ping should fail without flows"
+ return 1; }
+
+ ovs_add_flow "test_icmpv6" icmpv6 \
+ 'in_port(1),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=128)' \
+ '2' || return 1
+ ovs_add_flow "test_icmpv6" icmpv6 \
+ 'in_port(2),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=129)' \
+ '1' || return 1
+
+ info "verify connectivity restored"
+ ovs_sbx "test_icmpv6" ip netns exec client \
+ ping -6 -c 1 -W 2 fd00::2 || return 1
+
+ return 0
+}
+
# psample test
# - use psample to observe packets
test_psample() {
diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index e1ecfad2c03e..049791b2573b 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -1255,11 +1255,16 @@ class ovskey(nla):
lambda x: ipaddress.IPv6Address(x).packed if x else 0,
convert_ipv6,
),
- ("label", "label", "%d", lambda x: int(x) if x else 0),
- ("proto", "proto", "%d", lambda x: int(x) if x else 0),
- ("tclass", "tclass", "%d", lambda x: int(x) if x else 0),
- ("hlimit", "hlimit", "%d", lambda x: int(x) if x else 0),
- ("frag", "frag", "%d", lambda x: int(x) if x else 0),
+ ("label", "label", "%d", lambda x: int(x) if x else 0,
+ convert_int(32)),
+ ("proto", "proto", "%d", lambda x: int(x) if x else 0,
+ convert_int(8)),
+ ("tclass", "tclass", "%d", lambda x: int(x) if x else 0,
+ convert_int(8)),
+ ("hlimit", "hlimit", "%d", lambda x: int(x) if x else 0,
+ convert_int(8)),
+ ("frag", "frag", "%d", lambda x: int(x) if x else 0,
+ convert_int(8)),
)
def __init__(
@@ -1344,8 +1349,10 @@ class ovskey(nla):
)
fields_map = (
- ("type", "type", "%d", lambda x: int(x) if x else 0),
- ("code", "code", "%d", lambda x: int(x) if x else 0),
+ ("type", "type", "%d", lambda x: int(x) if x else 0,
+ convert_int(8)),
+ ("code", "code", "%d", lambda x: int(x) if x else 0,
+ convert_int(8)),
)
def __init__(
@@ -1982,6 +1989,11 @@ class ovskey(nla):
"icmp",
ovskey.ovs_key_icmp,
),
+ (
+ "OVS_KEY_ATTR_ICMPV6",
+ "icmpv6",
+ ovskey.ovs_key_icmpv6,
+ ),
(
"OVS_KEY_ATTR_TCP_FLAGS",
"tcp_flags",
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net-next v2 1/2] netdev: expose io_uring rx_page_order order via netlink
From: Dragos Tatulea @ 2026-06-13 14:09 UTC (permalink / raw)
To: Pavel Begunkov, Donald Hunter, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, Andrew Lunn, Jens Axboe
Cc: Yael Chemla, Tariq Toukan, netdev, linux-kernel, io-uring
In-Reply-To: <d0401fab-61c5-43e7-93ae-d4757433eb7a@gmail.com>
On 13.06.26 11:53, Pavel Begunkov wrote:
> On 6/12/26 22:17, Dragos Tatulea wrote:
>> This adds observability for the io_uring zcrx rx-buf-len configuration.
>
> It might be nicer to look it up in the queue, e.g. rxq->mp_params,
> and make it a queue attribute instead of zcrx specific one. In either
> case, no objections.
>
In io_pp_nl_fill() or in page_pool_nl_fill() as it was done in v1 for order?
Thanks,
Dragos
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox