* Re: [PATCH 15/22] arch: vdso: consolidate gettime prototypes
From: Vincenzo Frascino @ 2023-11-10 10:16 UTC (permalink / raw)
To: Arnd Bergmann, Andrew Morton, linux-kernel, Masahiro Yamada,
linux-kbuild
Cc: Arnd Bergmann, Matt Turner, Vineet Gupta, Russell King,
Catalin Marinas, Will Deacon, Steven Rostedt, Masami Hiramatsu,
Mark Rutland, Guo Ren, Peter Zijlstra, Ard Biesheuvel,
Huacai Chen, Greg Ungerer, Michal Simek, Thomas Bogendoerfer,
Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Geoff Levand, Palmer Dabbelt, Heiko Carstens,
John Paul Adrian Glaubitz, David S. Miller, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, x86, Helge Deller, Sudip Mukherjee,
Greg Kroah-Hartman, Timur Tabi, Kent Overstreet, David Woodhouse,
Naveen N. Rao, Anil S Keshavamurthy, Kees Cook, Juri Lelli,
Vincent Guittot, Nathan Chancellor, Nick Desaulniers,
Nicolas Schier, Al Viro, Uwe Kleine-König, linux-alpha,
linux-snps-arc, linux-arm-kernel, linux-trace-kernel, linux-csky,
loongarch, linux-m68k, linux-mips, linuxppc-dev, linux-riscv,
linux-s390, linux-sh, sparclinux, netdev, linux-parisc, linux-usb,
linux-fbdev, dri-devel, linux-bcachefs, linux-mtd
In-Reply-To: <20231108125843.3806765-16-arnd@kernel.org>
Hi Arnd,
On 11/8/23 12:58, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
>
> The VDSO functions are defined as globals in the kernel sources but intended
> to be called from userspace, so there is no need to declare them in a kernel
> side header.
>
> Without a prototype, this now causes warnings such as
>
> arch/mips/vdso/vgettimeofday.c:14:5: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
> arch/mips/vdso/vgettimeofday.c:28:5: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
> arch/mips/vdso/vgettimeofday.c:36:5: error: no previous prototype for '__vdso_clock_getres' [-Werror=missing-prototypes]
> arch/mips/vdso/vgettimeofday.c:42:5: error: no previous prototype for '__vdso_clock_gettime64' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:254:1: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:282:1: error: no previous prototype for '__vdso_clock_gettime_stick' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:307:1: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:343:1: error: no previous prototype for '__vdso_gettimeofday_stick' [-Werror=missing-prototypes]
>
> Most architectures have already added workarounds for these by adding
> declarations somewhere, but since these are all compatible, we should
> really just have one copy, with an #ifdef check for the 32-bit vs
> 64-bit variant and use that everywhere.
>
I agree, it is a good idea to have a single header for this purpose.
> Unfortunately, the sparc version is currently incompatible since
> that never added support for __vdso_clock_gettime64() in 32-bit
> userland. For the moment, I'm leaving this one out, as I can't
> easily test it and it requires a larger rework.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> ---
> arch/arm/include/asm/vdso.h | 5 -----
> arch/arm/vdso/vgettimeofday.c | 1 +
> arch/arm64/kernel/vdso32/vgettimeofday.c | 1 +
> arch/csky/kernel/vdso/vgettimeofday.c | 11 +----------
> arch/loongarch/vdso/vgettimeofday.c | 7 +------
> arch/mips/vdso/vgettimeofday.c | 1 +
> arch/riscv/kernel/vdso/vgettimeofday.c | 7 +------
> arch/x86/entry/vdso/vclock_gettime.c | 10 +---------
> arch/x86/include/asm/vdso/gettimeofday.h | 2 --
> arch/x86/um/vdso/um_vdso.c | 1 +
> include/vdso/gettime.h | 23 +++++++++++++++++++++++
> 11 files changed, 31 insertions(+), 38 deletions(-)
> create mode 100644 include/vdso/gettime.h
>
> diff --git a/arch/arm/include/asm/vdso.h b/arch/arm/include/asm/vdso.h
> index 422c3afa806a..5b85889f82ee 100644
> --- a/arch/arm/include/asm/vdso.h
> +++ b/arch/arm/include/asm/vdso.h
> @@ -24,11 +24,6 @@ static inline void arm_install_vdso(struct mm_struct *mm, unsigned long addr)
>
> #endif /* CONFIG_VDSO */
>
> -int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
> -int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> -int __vdso_clock_getres(clockid_t clock_id, struct old_timespec32 *res);
> -
> #endif /* __ASSEMBLY__ */
>
> #endif /* __KERNEL__ */
> diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c
> index a003beacac76..3554aa35f1ba 100644
> --- a/arch/arm/vdso/vgettimeofday.c
> +++ b/arch/arm/vdso/vgettimeofday.c
> @@ -8,6 +8,7 @@
> #include <linux/types.h>
> #include <asm/vdso.h>
> #include <asm/unwind.h>
> +#include <vdso/gettime.h>
>
> int __vdso_clock_gettime(clockid_t clock,
> struct old_timespec32 *ts)
> diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c b/arch/arm64/kernel/vdso32/vgettimeofday.c
> index 5acff29c5991..e23c7f4ef26b 100644
> --- a/arch/arm64/kernel/vdso32/vgettimeofday.c
> +++ b/arch/arm64/kernel/vdso32/vgettimeofday.c
> @@ -5,6 +5,7 @@
> * Copyright (C) 2018 ARM Limited
> *
> */
> +#include <vdso/gettime.h>
>
> int __vdso_clock_gettime(clockid_t clock,
> struct old_timespec32 *ts)
> diff --git a/arch/csky/kernel/vdso/vgettimeofday.c b/arch/csky/kernel/vdso/vgettimeofday.c
> index c4831145eed5..55af30e83752 100644
> --- a/arch/csky/kernel/vdso/vgettimeofday.c
> +++ b/arch/csky/kernel/vdso/vgettimeofday.c
> @@ -2,36 +2,27 @@
>
> #include <linux/time.h>
> #include <linux/types.h>
> +#include <vdso/gettime.h>
>
> extern
> -int __vdso_clock_gettime(clockid_t clock,
> - struct old_timespec32 *ts);
> int __vdso_clock_gettime(clockid_t clock,
> struct old_timespec32 *ts)
> {
> return __cvdso_clock_gettime32(clock, ts);
> }
>
> -int __vdso_clock_gettime64(clockid_t clock,
> - struct __kernel_timespec *ts);
> int __vdso_clock_gettime64(clockid_t clock,
> struct __kernel_timespec *ts)
> {
> return __cvdso_clock_gettime(clock, ts);
> }
>
> -extern
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
> - struct timezone *tz);
> int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
> struct timezone *tz)
> {
> return __cvdso_gettimeofday(tv, tz);
> }
>
> -extern
> -int __vdso_clock_getres(clockid_t clock_id,
> - struct old_timespec32 *res);
> int __vdso_clock_getres(clockid_t clock_id,
> struct old_timespec32 *res)
> {
> diff --git a/arch/loongarch/vdso/vgettimeofday.c b/arch/loongarch/vdso/vgettimeofday.c
> index 8f22863bd7ea..0885c1f3a89d 100644
> --- a/arch/loongarch/vdso/vgettimeofday.c
> +++ b/arch/loongarch/vdso/vgettimeofday.c
> @@ -5,23 +5,18 @@
> * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
> */
> #include <linux/types.h>
> +#include <vdso/gettime.h>
>
> -extern
> -int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
> int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
> {
> return __cvdso_clock_gettime(clock, ts);
> }
>
> -extern
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
> {
> return __cvdso_gettimeofday(tv, tz);
> }
>
> -extern
> -int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res);
> int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res)
> {
> return __cvdso_clock_getres(clock_id, res);
> diff --git a/arch/mips/vdso/vgettimeofday.c b/arch/mips/vdso/vgettimeofday.c
> index 6b83b6376a4b..604afea3f336 100644
> --- a/arch/mips/vdso/vgettimeofday.c
> +++ b/arch/mips/vdso/vgettimeofday.c
> @@ -9,6 +9,7 @@
> */
> #include <linux/time.h>
> #include <linux/types.h>
> +#include <vdso/gettime.h>
>
> #if _MIPS_SIM != _MIPS_SIM_ABI64
> int __vdso_clock_gettime(clockid_t clock,
> diff --git a/arch/riscv/kernel/vdso/vgettimeofday.c b/arch/riscv/kernel/vdso/vgettimeofday.c
> index cc0d80699c31..b35057802584 100644
> --- a/arch/riscv/kernel/vdso/vgettimeofday.c
> +++ b/arch/riscv/kernel/vdso/vgettimeofday.c
> @@ -8,23 +8,18 @@
>
> #include <linux/time.h>
> #include <linux/types.h>
> +#include <vdso/gettime.h>
>
> -extern
> -int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
> int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
> {
> return __cvdso_clock_gettime(clock, ts);
> }
>
> -extern
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
> {
> return __cvdso_gettimeofday(tv, tz);
> }
>
> -extern
> -int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res);
> int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res)
> {
> return __cvdso_clock_getres(clock_id, res);
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index 7d70935b6758..0debc194bd78 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -11,12 +11,10 @@
> #include <linux/time.h>
> #include <linux/kernel.h>
> #include <linux/types.h>
> +#include <vdso/gettime.h>
>
> #include "../../../../lib/vdso/gettimeofday.c"
>
> -extern int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> -extern __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
> -
> int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
> {
> return __cvdso_gettimeofday(tv, tz);
> @@ -35,9 +33,6 @@ __kernel_old_time_t time(__kernel_old_time_t *t) __attribute__((weak, alias("__v
>
> #if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
> /* both 64-bit and x32 use these */
> -extern int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
> -extern int __vdso_clock_getres(clockid_t clock, struct __kernel_timespec *res);
> -
> int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
> {
> return __cvdso_clock_gettime(clock, ts);
> @@ -56,9 +51,6 @@ int clock_getres(clockid_t, struct __kernel_timespec *)
>
> #else
> /* i386 only */
> -extern int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
> -extern int __vdso_clock_getres(clockid_t clock, struct old_timespec32 *res);
> -
> int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts)
> {
> return __cvdso_clock_gettime32(clock, ts);
> diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h
> index c81858d903dc..a46edb0e0cf7 100644
> --- a/arch/x86/include/asm/vdso/gettimeofday.h
> +++ b/arch/x86/include/asm/vdso/gettimeofday.h
> @@ -337,8 +337,6 @@ u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
> }
> #define vdso_calc_delta vdso_calc_delta
>
> -int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
> -
> #endif /* !__ASSEMBLY__ */
>
> #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
> diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
> index ff0f3b4b6c45..63768dd347ce 100644
> --- a/arch/x86/um/vdso/um_vdso.c
> +++ b/arch/x86/um/vdso/um_vdso.c
> @@ -12,6 +12,7 @@
> #include <linux/time.h>
> #include <linux/getcpu.h>
> #include <asm/unistd.h>
> +#include <vdso/gettime.h>
>
> int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
> {
> diff --git a/include/vdso/gettime.h b/include/vdso/gettime.h
> new file mode 100644
> index 000000000000..c50d152e7b3e
> --- /dev/null
> +++ b/include/vdso/gettime.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _VDSO_GETTIME_H
> +#define _VDSO_GETTIME_H
> +
> +#include <linux/types.h>
> +
> +struct __kernel_timespec;
> +struct timezone;
> +
> +#if !defined(CONFIG_64BIT) || defined(BUILD_VDSO32_64)
> +struct old_timespec32;
> +int __vdso_clock_getres(clockid_t clock, struct old_timespec32 *res);
> +int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
> +#else
> +int __vdso_clock_getres(clockid_t clock, struct __kernel_timespec *res);
> +int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
> +#endif
> +
> +__kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
> +int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> +int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
> +
> +#endif
--
Regards,
Vincenzo
^ permalink raw reply
* [RFC PATCHv3 net-next 10/10] docs: bridge: add small features
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add some small features in the "Others" part of bridge document.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index 7f63d21c9f46..8adf99774d59 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -273,6 +273,20 @@ So, br_netfilter is only needed if users, for some reason, need to use
ip(6)tables to filter packets forwarded by the bridge, or NAT bridged
traffic. For pure link layer filtering, this module isn't needed.
+Other Features
+==============
+
+The Linux bridge also supports `IEEE 802.11 Proxy ARP
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=958501163ddd6ea22a98f94fa0e7ce6d4734e5c4>`_,
+`Media Redundancy Protocol (MRP)
+<https://lore.kernel.org/netdev/20200426132208.3232-1-horatiu.vultur@microchip.com/>`_,
+`Media Redundancy Protocol (MRP) LC mode
+<https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.com>`_,
+`IEEE 802.1X port authentication
+<https://lore.kernel.org/netdev/20220218155148.2329797-1-schultz.hans+netdev@gmail.com/>`_,
+and `MAC Authentication Bypass (MAB)
+<https://lore.kernel.org/netdev/20221101193922.2125323-2-idosch@nvidia.com/>`_.
+
FAQ
===
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 09/10] docs: bridge: add netfilter doc
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add netfilter part for bridge document.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 36 +++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index 4ff1e4ab6dd5..7f63d21c9f46 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -237,6 +237,42 @@ kernel.
Please see the :ref:`switchdev` document for more deatils.
+Netfilter
+=========
+
+The bridge netfilter module is a legacy feature that allows to filter bridged
+packets with iptables and ip6tables. Its use is discouraged. Users should
+consider using nftables for packet filtering.
+
+The older ebtables tool is more feature-limited compared to nftables, but
+just like nftables it doesn't need this module either to function.
+
+The br_netfilter module intercepts packets entering the bridge, performs
+minimal sanity tests on ipv4 and ipv6 packets and then pretends that
+these packets are being routed, not bridged. br_netfilter then calls
+the ip and ipv6 netfilter hooks from the bridge layer, i.e. ip(6)tables
+rulesets will also see these packets.
+
+br_netfilter is also the reason for the iptables *physdev* match:
+This match is the only way to reliably tell routed and bridged packets
+apart in an iptables ruleset.
+
+Note that ebtables and nftables will work fine without the br_netfilter module.
+iptables/ip6tables/arptables do not work for bridged traffic because they
+plug in the routing stack. nftables rules in ip/ip6/inet/arp families won't
+see traffic that is forwarded by a bridge either, but thats very much how it
+should be.
+
+Historically the feature set of ebtables was very limited (it still is),
+this module was added to pretend packets are routed and invoke the ipv4/ipv6
+netfilter hooks from the bridge so users had access to the more feature-rich
+iptables matching capabilities (including conntrack). nftables doesn't have
+this limitation, pretty much all features work regardless of the protocol family.
+
+So, br_netfilter is only needed if users, for some reason, need to use
+ip(6)tables to filter packets forwarded by the bridge, or NAT bridged
+traffic. For pure link layer filtering, this module isn't needed.
+
FAQ
===
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 08/10] docs: bridge: add switchdev doc
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add switchdev part for bridge document.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index 1fe645c9543d..4ff1e4ab6dd5 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -219,6 +219,24 @@ which is disabled by default but can be enabled. And `Multicast Router Discovery
<https://lore.kernel.org/netdev/20190121062628.2710-1-linus.luessing@c0d3.blue/>`_,
which help identify the location of multicast routers.
+Switchdev
+=========
+
+Linux Bridge Switchdev is a feature in the Linux kernel that extends the
+capabilities of the traditional Linux bridge to work more efficiently with
+hardware switches that support switchdev. With Linux Bridge Switchdev, certain
+networking functions like forwarding, filtering, and learning of Ethernet
+frames can be offloaded to a hardware switch. This offloading reduces the
+burden on the Linux kernel and CPU, leading to improved network performance
+and lower latency.
+
+To use Linux Bridge Switchdev, you need hardware switches that support the
+switchdev interface. This means that the switch hardware needs to have the
+necessary drivers and functionality to work in conjunction with the Linux
+kernel.
+
+Please see the :ref:`switchdev` document for more deatils.
+
FAQ
===
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 07/10] docs: bridge: add multicast doc
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add multicast part for bridge document.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 55 +++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index 88dfc6eb0919..1fe645c9543d 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -164,6 +164,61 @@ on bridge is disabled by default. After enabling VLAN
filter on bridge, the bridge can handle VLAN-tagged frames and forward them
to the appropriate destinations.
+Multicast
+=========
+
+The Linux bridge driver has multicast support allowing it to process Internet
+Group Management Protocol (IGMP) or Multicast Listener Discovery (MLD)
+messages, and to efficiently forward multicast data packets. The bridge
+driver support IGMPv2/IGMPv3 and MLDv1/MLDv2.
+
+Multicast snooping
+------------------
+
+Multicast snooping is a networking technology that allows network switches
+to intelligently manage multicast traffic within a local area network (LAN).
+
+The switch maintains a multicast group table, which records the association
+between multicast group addresses and the ports where hosts have joined these
+groups. The group table is dynamically updated based on the IGMP/MLD messages
+received. With the multicast group information gathered through snooping, the
+switch optimizes the forwarding of multicast traffic. Instead of blindly
+broadcasting the multicast traffic to all ports, it sends the multicast
+traffic based on the destination MAC address only to ports which have joined
+the respective destination multicast group.
+
+When created, the Linux bridge devices have multicast snooping enabled by
+default. It maintains a Multicast forwarding database (MDB) which keeps track
+of port and group relationships.
+
+IGMPv3/MLDv2 ETH support
+------------------------
+
+The Linux bridge supports IGMPv3/MLDv2 ETH (Explicit Tracking of Hosts), which
+was added by `474ddb37fa3a ("net: bridge: multicast: add EHT allow/block handling")
+<https://lore.kernel.org/netdev/20210120145203.1109140-1-razor@blackwall.org/>`_
+
+The explicit tracking of hosts enables the device to keep track of each
+individual host that is joined to a particular group or channel. The main
+benefit of the explicit tracking of hosts in IGMP is to allow minimal leave
+latencies when a host leaves a multicast group or channel.
+
+The length of time between a host wanting to leave and a device stopping
+traffic forwarding is called the IGMP leave latency. A device configured
+with IGMPv3 or MLDv2 and explicit tracking can immediately stop forwarding
+traffic if the last host to request to receive traffic from the device
+indicates that it no longer wants to receive traffic. The leave latency
+is thus bound only by the packet transmission latencies in the multiaccess
+network and the processing time in the device.
+
+Other multicast features
+------------------------
+The Linux bridge also supports `per-VLAN multicast snooping
+<https://lore.kernel.org/netdev/20210719170637.435541-1-razor@blackwall.org/>`_,
+which is disabled by default but can be enabled. And `Multicast Router Discovery
+<https://lore.kernel.org/netdev/20190121062628.2710-1-linus.luessing@c0d3.blue/>`_,
+which help identify the location of multicast routers.
+
FAQ
===
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 06/10] docs: bridge: add VLAN doc
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add VLAN part for bridge document.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index e168f86ddd82..88dfc6eb0919 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -135,6 +135,35 @@ Proper configuration of STP parameters, such as the bridge priority, can
influence which bridge becomes the Root Bridge. Careful configuration can
optimize network performance and path selection.
+VLAN
+====
+
+A LAN (Local Area Network) is a network that covers a small geographic area,
+typically within a single building or a campus. LANs are used to connect
+computers, servers, printers, and other networked devices within a localized
+area. LANs can be wired (using Ethernet cables) or wireless (using Wi-Fi).
+
+A VLAN (Virtual Local Area Network) is a logical segmentation of a physical
+network into multiple isolated broadcast domains. VLANs are used to divide
+a single physical LAN into multiple virtual LANs, allowing different groups of
+devices to communicate as if they were on separate physical networks.
+
+Typically there are two VLAN implementations, IEEE 802.1Q and IEEE 802.1ad
+(also known as QinQ). IEEE 802.1Q is a standard for VLAN tagging in Ethernet
+networks. It allows network administrators to create logical VLANs on a
+physical network and tag Ethernet frames with VLAN information, which is
+called *VLAN-tagged frames*. IEEE 802.1ad, commonly known as QinQ or Double
+VLAN, is an extension of the IEEE 802.1Q standard. QinQ allows for the
+stacking of multiple VLAN tags within a single Ethernet frame. The Linux
+bridge supports both the IEEE 802.1Q and `802.1AD
+<https://lore.kernel.org/netdev/1402401565-15423-1-git-send-email-makita.toshiaki@lab.ntt.co.jp/>`_
+protocol for VLAN tagging.
+
+The `VLAN filtering <https://lore.kernel.org/netdev/1360792820-14116-1-git-send-email-vyasevic@redhat.com/>`_
+on bridge is disabled by default. After enabling VLAN
+filter on bridge, the bridge can handle VLAN-tagged frames and forward them
+to the appropriate destinations.
+
FAQ
===
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 05/10] docs: bridge: add STP doc
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add STP part for bridge document.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 85 +++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index d06c51960f45..e168f86ddd82 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -50,6 +50,91 @@ options are added.
.. kernel-doc:: net/bridge/br_sysfs_br.c
:doc: The sysfs bridge attrs
+STP
+===
+
+The STP (Spanning Tree Protocol) implementation in the Linux bridge driver
+is a critical feature that helps prevent loops and broadcast storms in
+Ethernet networks by identifying and disabling redundant links. In a Linux
+bridge context, STP is crucial for network stability and availability.
+
+STP is a Layer 2 protocol that operates at the Data Link Layer of the OSI
+model. It was originally developed as IEEE 802.1D and has since evolved into
+multiple versions, including Rapid Spanning Tree Protocol (RSTP) and
+`Multiple Spanning Tree Protocol (MSTP)
+<https://lore.kernel.org/netdev/20220316150857.2442916-1-tobias@waldekranz.com/>`_.
+
+Bridge Ports and STP States
+---------------------------
+
+In the context of STP, bridge ports can be in one of the following states:
+ * Blocking: The port is disabled for data traffic and only listens for
+ BPDUs (Bridge Protocol Data Units) from other devices to determine the
+ network topology.
+ * Listening: The port begins to participate in the STP process and listens
+ for BPDUs.
+ * Learning: The port continues to listen for BPDUs and begins to learn MAC
+ addresses from incoming frames but does not forward data frames.
+ * Forwarding: The port is fully operational and forwards both BPDUs and
+ data frames.
+ * Disabled: The port is administratively disabled and does not participate
+ in the STP process. The data frames forwarding are also disabled.
+
+Root Bridge and Convergence
+---------------------------
+
+In the context of networking and Ethernet bridging in Linux, the root bridge
+is a designated switch in a bridged network that serves as a reference point
+for the spanning tree algorithm to create a loop-free topology.
+
+Here's how the STP works and root bridge is chosen:
+ 1. Bridge Priority: Each bridge running a spanning tree protocol, has a
+ configurable Bridge Priority value. The lower the value, the higher the
+ priority. By default, the Bridge Priority is set to a standard value
+ (e.g., 32768).
+ 2. Bridge ID: The Bridge ID is composed of two components: Bridge Priority
+ and the MAC address of the bridge. It uniquely identifies each bridge
+ in the network. The Bridge ID is used to compare the priorities of
+ different bridges.
+ 3. Bridge Election: When the network starts, all bridges initially assume
+ that they are the root bridge. They start advertising Bridge Protocol
+ Data Units (BPDU) to their neighbors, containing their Bridge ID and
+ other information.
+ 4. BPDU Comparison: Bridges exchange BPDUs to determine the root bridge.
+ Each bridge examines the received BPDUs, including the Bridge Priority
+ and Bridge ID, to determine if it should adjust its own priorities.
+ The bridge with the lowest Bridge ID will become the root bridge.
+ 5. Root Bridge Announcement: Once the root bridge is determined, it sends
+ BPDUs with information about the root bridge to all other bridges in the
+ network. This information is used by other bridges to calculate the
+ shortest path to the root bridge and, in doing so, create a loop-free
+ topology.
+ 6. Forwarding Ports: After the root bridge is selected and the spanning tree
+ topology is established, each bridge determines which of its ports should
+ be in the forwarding state (used for data traffic) and which should be in
+ the blocking state (used to prevent loops). The root bridge's ports are
+ all in the forwarding state. while other bridges have some ports in the
+ blocking state to avoid loops.
+ 7. Root Ports: After the root bridge is selected and the spanning tree
+ topology is established, each non-root bridge processes incoming
+ BPDUs and determines which of its ports provides the shortest path to the
+ root bridge based on the information in the received BPDUs. This port is
+ designated as the root port. And it is in the Forwarding state, allowing
+ it to actively forward network traffic.
+ 8. Designated ports: A designated port is the port through which the non-root
+ bridge will forward traffic towards the designated segment. Designated ports
+ are placed in the Forwarding state. All other ports on the non-root
+ bridge that are not designated for specific segments are placed in the
+ Blocking state to prevent network loops.
+
+STP ensures network convergence by calculating the shortest path and disabling
+redundant links. When network topology changes occur (e.g., a link failure),
+STP recalculates the network topology to restore connectivity while avoiding loops.
+
+Proper configuration of STP parameters, such as the bridge priority, can
+influence which bridge becomes the Root Bridge. Careful configuration can
+optimize network performance and path selection.
+
FAQ
===
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 04/10] docs: bridge: Add kAPI/uAPI fields
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
The current bridge kernel doc is too old. It only pointed to the
linuxfoundation wiki page which lacks of the new features.
Here let's start the new bridge document and put all the bridge info
so new developers and users could catch up the last bridge status soon.
First add kAPI/uAPI and FAQ fields. These 2 fileds are only examples and
more APIs need to be added in future.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
Documentation/networking/bridge.rst | 83 +++++++++++++++++++++++++----
1 file changed, 73 insertions(+), 10 deletions(-)
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index c859f3c1636e..d06c51960f45 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -4,18 +4,81 @@
Ethernet Bridging
=================
-In order to use the Ethernet bridging functionality, you'll need the
-userspace tools.
+Introduction
+============
-Documentation for Linux bridging is on:
- https://wiki.linuxfoundation.org/networking/bridge
+A bridge is a way to connect multiple Ethernet segments together in a protocol
+independent way. Packets are forwarded based on Layer 2 destination Ethernet
+address, rather than IP address (like a router). Since forwarding is done
+at Layer 2, all Layer 3 protocols can pass through a bridge transparently.
-The bridge-utilities are maintained at:
- git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
+Bridge kAPI
+===========
-Additionally, the iproute2 utilities can be used to configure
-bridge devices.
+Here are some core structures of bridge code.
-If you still have questions, don't hesitate to post to the mailing list
-(more info https://lists.linux-foundation.org/mailman/listinfo/bridge).
+.. kernel-doc:: net/bridge/br_private.h
+ :identifiers: net_bridge_vlan
+Bridge uAPI
+===========
+
+Modern Linux bridge uAPI is accessed via Netlink interface. You can find
+below files where the bridge and bridge port netlink attributes are defined.
+
+Bridge netlink attributes
+-------------------------
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :doc: The bridge enum defination
+
+Bridge port netlink attributes
+------------------------------
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :doc: The bridge port enum defination
+
+Bridge sysfs
+------------
+
+All the sysfs parameters are also exported via the bridge netlink API.
+Here you can find the explanation based on the correspond netlink attributes.
+
+NOTE: the sysfs interface is deprecated and should not be extended if new
+options are added.
+
+.. kernel-doc:: net/bridge/br_sysfs_br.c
+ :doc: The sysfs bridge attrs
+
+FAQ
+===
+
+What does a bridge do?
+----------------------
+
+A bridge transparently forwards traffic between multiple network interfaces.
+In plain English this means that a bridge connects two or more physical
+Ethernet networks, to form one larger (logical) Ethernet network.
+
+Is it L3 protocol independent?
+------------------------------
+
+Yes. The bridge sees all frames, but it *uses* only L2 headers/information.
+As such, the bridging functionality is protocol independent, and there should
+be no trouble forwarding IPX, NetBEUI, IP, IPv6, etc.
+
+Contact Info
+============
+
+The code is currently maintained by Roopa Prabhu <roopa@nvidia.com> and
+Nikolay Aleksandrov <razor@blackwall.org>. Bridge bugs and enhancements
+are discussed on the linux-netdev mailing list netdev@vger.kernel.org and
+bridge@lists.linux-foundation.org.
+
+The list is open to anyone interested: http://vger.kernel.org/vger-lists.html#netdev
+
+External Links
+==============
+
+The old Documentation for Linux bridging is on:
+https://wiki.linuxfoundation.org/networking/bridge
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 03/10] net: bridge: add document for bridge sysfs attribute
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Although the sysfs interface is deprecated and should not be extended
if new options are added. There are still users and admins use this
interface to config bridge options. It would help users to know what
the meaning of each field. Add correspond netlink enums (as we have
document for them) for bridge sysfs attributes, so we can use it in
Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
net/bridge/br_sysfs_br.c | 93 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index ea733542244c..148ecf5aafc6 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -933,6 +933,99 @@ static ssize_t vlan_stats_per_port_store(struct device *d,
static DEVICE_ATTR_RW(vlan_stats_per_port);
#endif
+/**
+ * DOC: The sysfs bridge attrs
+ *
+ * @forward_delay: IFLA_BR_FORWARD_DELAY
+ *
+ * @hello_time: IFLA_BR_HELLO_TIME
+ *
+ * @max_age: IFLA_BR_MAX_AGE
+ *
+ * @ageing_time: IFLA_BR_AGEING_TIME
+ *
+ * @stp_state: IFLA_BR_STP_STATE
+ *
+ * @group_fwd_mask: IFLA_BR_GROUP_FWD_MASK
+ *
+ * @priority: IFLA_BR_PRIORITY
+ *
+ * @bridge_id: IFLA_BR_BRIDGE_ID
+ *
+ * @root_id: IFLA_BR_ROOT_ID
+ *
+ * @root_path_cost: IFLA_BR_ROOT_PATH_COST
+ *
+ * @root_port: IFLA_BR_ROOT_PORT
+ *
+ * @topology_change: IFLA_BR_TOPOLOGY_CHANGE
+ *
+ * @topology_change_detected: IFLA_BR_TOPOLOGY_CHANGE_DETECTED
+ *
+ * @hello_timer: IFLA_BR_HELLO_TIMER
+ *
+ * @tcn_timer: IFLA_BR_TCN_TIMER
+ *
+ * @topology_change_timer: IFLA_BR_TOPOLOGY_CHANGE_TIMER
+ *
+ * @gc_timer: IFLA_BR_GC_TIMER
+ *
+ * @group_addr: IFLA_BR_GROUP_ADDR
+ *
+ * @flush: IFLA_BR_FDB_FLUSH
+ *
+ * @no_linklocal_learn: BR_BOOLOPT_NO_LL_LEARN
+ *
+ * @multicast_router: IFLA_BR_MCAST_ROUTER
+ *
+ * @multicast_snooping: IFLA_BR_MCAST_SNOOPING
+ *
+ * @multicast_querier: IFLA_BR_MCAST_QUERIER
+ *
+ * @multicast_query_use_ifaddr: IFLA_BR_MCAST_QUERY_USE_IFADDR
+ *
+ * @hash_elasticity: IFLA_BR_MCAST_HASH_ELASTICITY
+ *
+ * @hash_max: IFLA_BR_MCAST_HASH_MAX
+ *
+ * @multicast_last_member_count: IFLA_BR_MCAST_LAST_MEMBER_CNT
+ *
+ * @multicast_startup_query_count: IFLA_BR_MCAST_STARTUP_QUERY_CNT
+ *
+ * @multicast_last_member_interval: IFLA_BR_MCAST_LAST_MEMBER_INTVL
+ *
+ * @multicast_membership_interval: IFLA_BR_MCAST_MEMBERSHIP_INTVL
+ *
+ * @multicast_querier_interval: IFLA_BR_MCAST_QUERIER_INTVL
+ *
+ * @multicast_query_interval: IFLA_BR_MCAST_QUERY_INTVL
+ *
+ * @multicast_query_response_interval: IFLA_BR_MCAST_QUERY_RESPONSE_INTVL
+ *
+ * @multicast_startup_query_interval: IFLA_BR_MCAST_STARTUP_QUERY_INTVL
+ *
+ * @multicast_stats_enabled: IFLA_BR_MCAST_STATS_ENABLED
+ *
+ * @multicast_igmp_version: IFLA_BR_MCAST_IGMP_VERSION
+ *
+ * @multicast_mld_version: IFLA_BR_MCAST_MLD_VERSION
+ *
+ * @nf_call_iptables: IFLA_BR_NF_CALL_IPTABLES
+ *
+ * @nf_call_ip6tables: IFLA_BR_NF_CALL_IP6TABLES
+ *
+ * @nf_call_arptables: IFLA_BR_NF_CALL_ARPTABLES
+ *
+ * @vlan_filtering: IFLA_BR_VLAN_FILTERING
+ *
+ * @vlan_protocol: IFLA_BR_VLAN_PROTOCOL
+ *
+ * @default_pvid: IFLA_BR_VLAN_DEFAULT_PVID
+ *
+ * @vlan_stats_enabled: IFLA_BR_VLAN_STATS_ENABLED
+ *
+ * @vlan_stats_per_port: IFLA_BR_VLAN_STATS_PER_PORT
+ */
static struct attribute *bridge_attrs[] = {
&dev_attr_forward_delay.attr,
&dev_attr_hello_time.attr,
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 02/10] net: bridge: add document for IFLA_BRPORT enum
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add document for IFLA_BRPORT enum so we can use it in
Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
include/uapi/linux/if_link.h | 227 +++++++++++++++++++++++++++++++++++
1 file changed, 227 insertions(+)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 32d6980b78d1..a196a6e4dafb 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -792,11 +792,238 @@ struct ifla_bridge_id {
__u8 addr[6]; /* ETH_ALEN */
};
+/**
+ * DOC: The bridge mode enum defination
+ *
+ * @BRIDGE_MODE_HAIRPIN
+ * Controls whether traffic may be send back out of the port on which it
+ * was received. This option is also called reflective relay mode, and is
+ * used to support basic VEPA (Virtual Ethernet Port Aggregator)
+ * capabilities. By default, this flag is turned off and the bridge will
+ * not forward traffic back out of the receiving port.
+ */
+
enum {
BRIDGE_MODE_UNSPEC,
BRIDGE_MODE_HAIRPIN,
};
+/**
+ * DOC: The bridge port enum defination
+ *
+ * @IFLA_BRPORT_STATE
+ * The operation state of the port. Except state 0 (disable STP or BPDU
+ * filter feature), this is primarily used by user space STP/RSTP
+ * implementation.
+ *
+ * * 0 - port is in STP *DISABLED* state. Make this port completely
+ * inactive for STP. This is also called BPDU filter and could be used
+ * to disable STP on an untrusted port, like a leaf virtual devices.
+ * The traffic forwarding is also stopped on this port.
+ * * 1 - port is in STP *LISTENING* state. Only valid if STP is enabled
+ * on the bridge. In this state the port listens for STP BPDUs and
+ * drops all other traffic frames.
+ * * 2 - port is in STP *LEARNING* state. Only valid if STP is enabled on
+ * the bridge. In this state the port will accept traffic only for the
+ * purpose of updating MAC address tables.
+ * * 3 - port is in STP *FORWARDING* state. Port is fully active.
+ * * 4 - port is in STP *BLOCKING* state. Only valid if STP is enabled on
+ * the bridge. This state is used during the STP election process.
+ * In this state, port will only process STP BPDUs.
+ *
+ * @IFLA_BRPORT_PRIORITY
+ * The STP port priority. The valid values are between 0 and 255.
+ *
+ * @IFLA_BRPORT_COST
+ * The STP path cost of the port. The valid values are between 1 and 65535.
+ *
+ * @IFLA_BRPORT_MODE
+ * Set the bridge port mode. See *BRIDGE_MODE_HAIRPIN* for more details.
+ *
+ * @IFLA_BRPORT_GUARD
+ * Controls whether STP BPDUs will be processed by the bridge port. By
+ * default, the flag is turned off allowed BPDU processing. Turning this
+ * flag on will disables the bridge port if a STP BPDU packet is received.
+ *
+ * If running Spanning Tree on bridge, hostile devices on the network may
+ * send BPDU on a port and cause network failure. Setting *guard on* will
+ * detect and stop this by disabling the port. The port will be restarted
+ * if link is brought down, or removed and reattached.
+ *
+ * @IFLA_BRPORT_PROTECT
+ * Controls whether a given port is allowed to become root port or not.
+ * Only used when STP is enabled on the bridge. By default the flag is off.
+ *
+ * This feature is also called root port guard. If BPDU is received from a
+ * leaf (edge) port, it should not be elected as root port. This could
+ * be used if using STP on a bridge and the downstream bridges are not fully
+ * trusted; this prevents a hostile guest from rerouting traffic.
+ *
+ * @IFLA_BRPORT_FAST_LEAVE
+ * This flag allows the bridge to immediately stop multicast traffic on a
+ * port that receives IGMP Leave message. It is only used with IGMP snooping
+ * is enabled on the bridge. By default the flag is off.
+ *
+ * @IFLA_BRPORT_LEARNING
+ * Controls whether a given port will learn MAC addresses from received
+ * traffic or not. If learning if off, the bridge will end up flooding any
+ * traffic for which it has no FDB entry. By default this flag is on.
+ *
+ * @IFLA_BRPORT_UNICAST_FLOOD
+ * Controls whether unicast traffic for which there is no FDB entry will
+ * be flooded towards this given port. By default this flag is on.
+ *
+ * @IFLA_BRPORT_PROXYARP
+ * Enable proxy ARP on this port.
+ *
+ * @IFLA_BRPORT_LEARNING_SYNC
+ * Controls whether a given port will sync MAC addresses learned on device
+ * port to bridge FDB.
+ *
+ * @IFLA_BRPORT_PROXYARP_WIFI
+ * Enable proxy ARP on this port which meets extended requirements by
+ * IEEE 802.11 and Hotspot 2.0 specifications.
+ *
+ * @IFLA_BRPORT_ROOT_ID
+ *
+ * @IFLA_BRPORT_BRIDGE_ID
+ *
+ * @IFLA_BRPORT_DESIGNATED_PORT
+ *
+ * @IFLA_BRPORT_DESIGNATED_COST
+ *
+ * @IFLA_BRPORT_ID
+ *
+ * @IFLA_BRPORT_NO
+ *
+ * @IFLA_BRPORT_TOPOLOGY_CHANGE_ACK
+ *
+ * @IFLA_BRPORT_CONFIG_PENDING
+ *
+ * @IFLA_BRPORT_MESSAGE_AGE_TIMER
+ *
+ * @IFLA_BRPORT_FORWARD_DELAY_TIMER
+ *
+ * @IFLA_BRPORT_HOLD_TIMER
+ *
+ * @IFLA_BRPORT_FLUSH
+ * Flush bridge ports' fdb dynamic entries.
+ *
+ * @IFLA_BRPORT_MULTICAST_ROUTER
+ * Configure the port's multicast router presence. A port with
+ * a multicast router will receive all multicast traffic.
+ * The valid values are:
+ *
+ * * 0 disable multicast routers on this port
+ * * 1 let the system detect the presence of routers (default)
+ * * 2 permanently enable multicast traffic forwarding on this port
+ * * 3 enable multicast routers temporarily on this port, not depending
+ * on incoming queries.
+ *
+ * @IFLA_BRPORT_PAD
+ *
+ * @IFLA_BRPORT_MCAST_FLOOD
+ * Controls whether a given port will flood multicast traffic for which
+ * there is no MDB entry. By default this flag is on.
+ *
+ * @IFLA_BRPORT_MCAST_TO_UCAST
+ * Controls whether a given port will replicate packets using unicast
+ * instead of multicast. By default this flag is off.
+ *
+ * This is done by copying the packet per host and changing the multicast
+ * destination MAC to a unicast one accordingly.
+ *
+ * *mcast_to_unicast* works on top of the multicast snooping feature of the
+ * bridge. Which means unicast copies are only delivered to hosts which
+ * are interested in unicast and signaled this via IGMP/MLD reports previously.
+ *
+ * This feature is intended for interface types which have a more reliable
+ * and/or efficient way to deliver unicast packets than broadcast ones
+ * (e.g. WiFi).
+ *
+ * However, it should only be enabled on interfaces where no IGMPv2/MLDv1
+ * report suppression takes place. IGMP/MLD report suppression issue is
+ * usually overcome by the network daemon (supplicant) enabling AP isolation
+ * and by that separating all STAs.
+ *
+ * Delivery of STA-to-STA IP multicast is made possible again by enabling
+ * and utilizing the bridge hairpin mode, which considers the incoming port
+ * as a potential outgoing port, too (see *BRIDGE_MODE_HAIRPIN* option).
+ * Hairpin mode is performed after multicast snooping, therefore leading
+ * to only deliver reports to STAs running a multicast router.
+ *
+ * @IFLA_BRPORT_VLAN_TUNNEL
+ * Controls whether vlan to tunnel mapping is enabled on the port.
+ * By default this flag is off.
+ *
+ * @IFLA_BRPORT_BCAST_FLOOD
+ * Controls flooding of broadcast traffic on the given port. By default
+ * this flag is on.
+ *
+ * @IFLA_BRPORT_GROUP_FWD_MASK
+ * Set the group forward mask. This is a bitmask that is applied to
+ * decide whether to forward incoming frames destined to link-local
+ * addresses. The addresses of the form are 01:80:C2:00:00:0X (defaults
+ * to 0, which means the bridge does not forward any link-local frames
+ * coming on this port).
+ *
+ * @IFLA_BRPORT_NEIGH_SUPPRESS
+ * Controls whether neighbor discovery (arp and nd) proxy and suppression
+ * is enabled on the port. By default this flag is off.
+ *
+ * @IFLA_BRPORT_ISOLATED
+ * Controls whether a given port will be isolated, which means it will be
+ * able to communicate with non-isolated ports only. By default this
+ * flag is off.
+ *
+ * @IFLA_BRPORT_BACKUP_PORT
+ * Set a backup port. If the port loses carrier all traffic will be
+ * redirected to the configured backup port. Set the value to 0 to disable
+ * it.
+ *
+ * @IFLA_BRPORT_MRP_RING_OPEN
+ *
+ * @IFLA_BRPORT_MRP_IN_OPEN
+ *
+ * @IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT
+ * The number of per-port EHT hosts limit. The default value is 512.
+ * Setting to 0 is not allowed.
+ *
+ * @IFLA_BRPORT_MCAST_EHT_HOSTS_CNT
+ * The current number of tracked hosts, read only.
+ *
+ * @IFLA_BRPORT_LOCKED
+ * Controls whether a port will be locked, meaning that hosts behind the
+ * port will not be able to communicate through the port unless an FDB
+ * entry with the units MAC address is in the FDB. The common use case is that
+ * hosts are allowed access through authentication with the IEEE 802.1X
+ * protocol or based on whitelists. By default this flag is off.
+ *
+ * @IFLA_BRPORT_MAB
+ *
+ * @IFLA_BRPORT_MCAST_N_GROUPS
+ *
+ * @IFLA_BRPORT_MCAST_MAX_GROUPS
+ * Sets the maximum number of MDB entries that can be registered for a
+ * given port. Attempts to register more MDB entries at the port than this
+ * limit allows will be rejected, whether they are done through netlink
+ * (e.g. the bridge tool), or IGMP or MLD membership reports. Setting a
+ * limit to 0 disables the limit. The default value is 0.
+ *
+ * @IFLA_BRPORT_NEIGH_VLAN_SUPPRESS
+ * Controls whether neighbor discovery (arp and nd) proxy and suppression is
+ * enabled for a given VLAN on a given port. By default this flag is off.
+ *
+ * Note that this option only takes effect when *IFLA_BRPORT_NEIGH_SUPPRESS*
+ * is enabled for a given port.
+ *
+ * @IFLA_BRPORT_BACKUP_NHID
+ * The FDB nexthop object ID to attach to packets being redirected to a
+ * backup port that has VLAN tunnel mapping enabled (via the
+ * *IFLA_BRPORT_VLAN_TUNNEL* option). Setting a value of 0 (default) has
+ * the effect of not attaching any ID.
+ */
+
enum {
IFLA_BRPORT_UNSPEC,
IFLA_BRPORT_STATE, /* Spanning tree state */
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 01/10] net: bridge: add document for IFLA_BR enum
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
In-Reply-To: <20231110101548.1900519-1-liuhangbin@gmail.com>
Add document for IFLA_BR enum so we can use it in
Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
include/uapi/linux/if_link.h | 270 +++++++++++++++++++++++++++++++++++
1 file changed, 270 insertions(+)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 29ff80da2775..32d6980b78d1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -461,6 +461,276 @@ enum in6_addr_gen_mode {
/* Bridge section */
+/**
+ * DOC: The bridge enum defination
+ *
+ * please *note* that all the timer in the following section is in clock_t
+ * type, which is seconds multiplied by user_hz (generally defined as 100).
+ *
+ * @IFLA_BR_FORWARD_DELAY
+ * The bridge forwarding delay is the time spent in LISTENING state
+ * (before moving to LEARNING) and in LEARNING state (before moving
+ * to FORWARDING). Only relevant if STP is enabled.
+ *
+ * The valid values are between (2 * USER_HZ) and (30 * USER_HZ).
+ * The default value is (15 * USER_HZ).
+ *
+ * @IFLA_BR_HELLO_TIME
+ * The time between hello packets sent by the bridge, when it is a root
+ * bridge or a designated bridge. Only relevant if STP is enabled.
+ *
+ * The valid values are between (1 * USER_HZ) and (10 * USER_HZ).
+ * The default value is (2 * USER_HZ).
+ *
+ * @IFLA_BR_MAX_AGE
+ * The hello packet timeout, is the time until another bridge in the
+ * spanning tree is assumed to be dead, after reception of its last hello
+ * message. Only relevant if STP is enabled.
+ *
+ * The valid values are between (6 * USER_HZ) and (40 * USER_HZ).
+ * The default value is (20 * USER_HZ).
+ *
+ * @IFLA_BR_AGEING_TIME
+ * Configure the bridge's FDB entries ageing time. It is the time a MAC
+ * address will be kept in the FDB after a packet has been received from
+ * that address. After this time has passed, entries are cleaned up.
+ * Allow values outside the 802.1 standard specification for special cases:
+ *
+ * * 0 - entry never ages (all permanent)
+ * * 1 - entry disappears (no persistence)
+ *
+ * The default value is (300 * USER_HZ).
+ *
+ * @IFLA_BR_STP_STATE
+ * Turn spanning tree protocol on (*IFLA_BR_STP_STATE* > 0) or off
+ * (*IFLA_BR_STP_STATE* == 0) for this bridge.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_PRIORITY
+ * set this bridge's spanning tree priority, used during STP root bridge
+ * election.
+ *
+ * The valid values are between 0 and 65535.
+ *
+ * @IFLA_BR_VLAN_FILTERING
+ * Turn VLAN filtering on (*IFLA_BR_VLAN_FILTERING* > 0) or off
+ * (*IFLA_BR_VLAN_FILTERING* == 0). When disabled, the bridge will not
+ * consider the VLAN tag when handling packets.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_VLAN_PROTOCOL
+ * Set the protocol used for VLAN filtering.
+ *
+ * The valid values are 0x8100(802.1Q) or 0x88A8(802.1AD). The default value
+ * is 0x8100(802.1Q).
+ *
+ * @IFLA_BR_GROUP_FWD_MASK
+ * The group forward mask. This is the bitmask that is applied to
+ * decide whether to forward incoming frames destined to link-local
+ * addresses. The addresses of the form is 01:80:C2:00:00:0X, which
+ * means the bridge does not forward any link-local frames coming on
+ * this port).
+ *
+ * The default value is 0.
+ *
+ * @IFLA_BR_ROOT_ID
+ * The bridge root id, read only.
+ *
+ * @IFLA_BR_BRIDGE_ID
+ * The bridge id, read only.
+ *
+ * @IFLA_BR_ROOT_PORT
+ * The bridge root port, read only.
+ *
+ * @IFLA_BR_ROOT_PATH_COST
+ * The bridge root path cost, read only.
+ *
+ * @IFLA_BR_TOPOLOGY_CHANGE
+ * The bridge topology change, read only.
+ *
+ * @IFLA_BR_TOPOLOGY_CHANGE_DETECTED
+ * The bridge topology change detected, read only.
+ *
+ * @IFLA_BR_HELLO_TIMER
+ * The bridge hello timer, read only.
+ *
+ * @IFLA_BR_TCN_TIMER
+ * The bridge tcn timer, read only.
+ *
+ * @IFLA_BR_TOPOLOGY_CHANGE_TIMER
+ * The bridge topology change timer, read only.
+ *
+ * @IFLA_BR_GC_TIMER
+ * The bridge gc timer, read only.
+ *
+ * @IFLA_BR_GROUP_ADDR
+ * set the MAC address of the multicast group this bridge uses for STP.
+ * The address must be a link-local address in standard Ethernet MAC address
+ * format. It is an address of the form 01:80:C2:00:00:0X, with X in [0, 4..f].
+ *
+ * The default value is 0.
+ *
+ * @IFLA_BR_FDB_FLUSH
+ * Flush bridge's fdb dynamic entries.
+ *
+ * @IFLA_BR_MCAST_ROUTER
+ * Set bridge's multicast router if IGMP snooping is enabled.
+ * The valid values are:
+ *
+ * * 0 - disabled.
+ * * 1 - automatic (queried).
+ * * 2 - permanently enabled.
+ *
+ * The default value is 1.
+ *
+ * @IFLA_BR_MCAST_SNOOPING
+ * Turn multicast snooping on (*IFLA_BR_MCAST_SNOOPING* > 0) or off
+ * (*IFLA_BR_MCAST_SNOOPING* == 0).
+ *
+ * The default value is 1.
+ *
+ * @IFLA_BR_MCAST_QUERY_USE_IFADDR
+ * whether to use the bridge's own IP address as source address for IGMP
+ * queries (*IFLA_BR_MCAST_QUERY_USE_IFADDR* > 0) or the default of 0.0.0.0
+ * (*IFLA_BR_MCAST_QUERY_USE_IFADDR* == 0).
+ *
+ * The default value is 0.
+ *
+ * @IFLA_BR_MCAST_QUERIER
+ * Enable (*IFLA_BR_MULTICAST_QUERIER* > 0) or disable
+ * (*IFLA_BR_MULTICAST_QUERIER* == 0) IGMP querier, ie sending of multicast
+ * queries by the bridge.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_MCAST_HASH_ELASTICITY
+ * Set multicast database hash elasticity, It is the maximum chain length in
+ * the multicast hash table. This attribute is *deprecated* and the value
+ * is always 16.
+ *
+ * @IFLA_BR_MCAST_HASH_MAX
+ * Set maximum size of the multicast hash table
+ *
+ * The default value is 4096, value must be a power of 2.
+ *
+ * @IFLA_BR_MCAST_LAST_MEMBER_CNT
+ * The Last Member Query Count is the number of Group-Specific Queries
+ * sent before the router assumes there are no local members. The Last
+ * Member Query Count is also the number of Group-and-Source-Specific
+ * Queries sent before the router assumes there are no listeners for a
+ * particular source.
+ *
+ * The default value is 2.
+ *
+ * @IFLA_BR_MCAST_STARTUP_QUERY_CNT
+ * The Startup Query Count is the number of Queries sent out on startup,
+ * separated by the Startup Query Interval.
+ *
+ * The default value is 2.
+ *
+ * @IFLA_BR_MCAST_LAST_MEMBER_INTVL
+ * The Last Member Query Interval is the Max Response Time inserted into
+ * Group-Specific Queries sent in response to Leave Group messages, and
+ * is also the amount of time between Group-Specific Query messages.
+ *
+ * The default value is (1 * USER_HZ).
+ *
+ * @IFLA_BR_MCAST_MEMBERSHIP_INTVL
+ * The interval after which the bridge will leave a group, if no membership
+ * reports for this group are received.
+ *
+ * The default value is (260 * USER_HZ).
+ *
+ * @IFLA_BR_MCAST_QUERIER_INTVL
+ * The interval between queries sent by other routers. if no queries are
+ * seen after this delay has passed, the bridge will start to send its own
+ * queries (as if **IFLA_BR_MCAST_QUERIER_INTVL** was enabled).
+ *
+ * The default value is (255 * USER_HZ).
+ *
+ * @IFLA_BR_MCAST_QUERY_INTVL
+ * The Query Interval is the interval between General Queries sent by
+ * the Querier.
+ *
+ * The default value is (125 * USER_HZ). The minimum value is (1 * USER_HZ).
+ *
+ * @IFLA_BR_MCAST_QUERY_RESPONSE_INTVL
+ * The Max Response Time used to calculate the Max Resp Code inserted
+ * into the periodic General Queries.
+ *
+ * The default value is (10 * USER_HZ).
+ *
+ * @IFLA_BR_MCAST_STARTUP_QUERY_INTVL
+ * The interval between queries in the startup phase.
+ *
+ * The default value is (125 * USER_HZ) / 4. The minimum value is (1 * USER_HZ).
+ *
+ * @IFLA_BR_NF_CALL_IPTABLES
+ * Enable (*NF_CALL_IPTABLES* > 0) or disable (*NF_CALL_IPTABLES* == 0)
+ * iptables hooks on the bridge.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_NF_CALL_IP6TABLES
+ * Enable (*NF_CALL_IP6TABLES* > 0) or disable (*NF_CALL_IP6TABLES* == 0)
+ * ip6tables hooks on the bridge.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_NF_CALL_ARPTABLES
+ * Enable (*NF_CALL_ARPTABLES* > 0) or disable (*NF_CALL_ARPTABLES* == 0)
+ * arptables hooks on the bridge.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_VLAN_DEFAULT_PVID
+ * The default PVID (native/untagged VLAN ID) for this bridge.
+ *
+ * The default value is 1.
+ *
+ * @IFLA_BR_PAD
+ * Bridge attribute padding type for netlink message.
+ *
+ * @IFLA_BR_VLAN_STATS_ENABLED
+ * Enable (*IFLA_BR_VLAN_STATS_ENABLED* == 1) or disable
+ * (*IFLA_BR_VLAN_STATS_ENABLED* == 0) per-VLAN stats accounting.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_MCAST_STATS_ENABLED
+ * Enable (*IFLA_BR_MCAST_STATS_ENABLED* > 0) or disable
+ * (*IFLA_BR_MCAST_STATS_ENABLED* == 0) multicast (IGMP/MLD) stats
+ * accounting.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_MCAST_IGMP_VERSION
+ * Set the IGMP version.
+ *
+ * The valid values are 2 and 3. The default value is 2.
+ *
+ * @IFLA_BR_MCAST_MLD_VERSION
+ * Set the MLD version.
+ *
+ * The valid values are 1 and 2. The default value is 1.
+ *
+ * @IFLA_BR_VLAN_STATS_PER_PORT
+ * Enable (*IFLA_BR_VLAN_STATS_PER_PORT* == 1) or disable
+ * (*IFLA_BR_VLAN_STATS_PER_PORT* == 0) per-VLAN per-port stats accounting.
+ * Can be changed only when there are no port VLANs configured.
+ *
+ * The default value is 0 (disabled).
+ *
+ * @IFLA_BR_MULTI_BOOLOPT
+ * The multi_boolopt is used to control new boolean options to avoid adding
+ * new netlink attributes.
+ *
+ * @IFLA_BR_MCAST_QUERIER_STATE
+ * Bridge mcast querier states, read only.
+ */
+
enum {
IFLA_BR_UNSPEC,
IFLA_BR_FORWARD_DELAY,
--
2.41.0
^ permalink raw reply related
* [RFC PATCHv3 net-next 00/10] Doc: update bridge doc
From: Hangbin Liu @ 2023-11-10 10:15 UTC (permalink / raw)
To: netdev
Cc: David S . Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
Stephen Hemminger, Florian Westphal, Andrew Lunn,
Florian Fainelli, Vladimir Oltean, Jiri Pirko, Hangbin Liu
The current bridge kernel doc is too old. It only pointed to the
linuxfoundation wiki page which lacks of the new features.
Here let's start the new bridge document and put all the bridge info
so new developers and users could catch up the last bridge status soon.
Something I'd like to do in the future:
- add iproute2 cmd examples for each feature
v3:
- Update netfilter part (Florian Westphal)
- Break the one large patch in to multiparts for easy reviewing. Please tell
me if I break it too much.. (Nikolay Aleksandro)
- Update the description of each enum and doc (Nikolay Aleksandro)
- Add more descriptions for STP/Multicast/VLAN.
v2:
- Drop the python tool that generate iproute man page from kernel doc
Hangbin Liu (10):
net: bridge: add document for IFLA_BR enum
net: bridge: add document for IFLA_BRPORT enum
net: bridge: add document for bridge sysfs attribute
docs: bridge: Add kAPI/uAPI fields
docs: bridge: add STP doc
docs: bridge: add VLAN doc
docs: bridge: add multicast doc
docs: bridge: add switchdev doc
docs: bridge: add netfilter doc
docs: bridge: add small features
Documentation/networking/bridge.rst | 320 +++++++++++++++++-
include/uapi/linux/if_link.h | 497 ++++++++++++++++++++++++++++
net/bridge/br_sysfs_br.c | 93 ++++++
3 files changed, 900 insertions(+), 10 deletions(-)
--
2.41.0
^ permalink raw reply
* Re: [PATCH net-next] packet: add a generic drop reason for receive
From: Jiri Pirko @ 2023-11-10 10:13 UTC (permalink / raw)
To: Eric Dumazet
Cc: Yan Zhai, netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
Willem de Bruijn, Weongyo Jeong, Ivan Babrou, David Ahern,
Jesper Brouer, linux-kernel, kernel-team
In-Reply-To: <CANn89iKZYsWGT1weXZ6W7_z28dqJwTZeg+2_Lw+x+6spUHp8Eg@mail.gmail.com>
Fri, Nov 10, 2023 at 10:30:49AM CET, edumazet@google.com wrote:
>On Fri, Nov 10, 2023 at 6:49 AM Yan Zhai <yan@cloudflare.com> wrote:
[..]
>1) Note that net-next is currently closed.
I wonder, can't some bot be easily set up to warn about
this automatically?
^ permalink raw reply
* Re: [PATCH net 0/3] dpll: fix unordered unbind/bind registerer issues
From: Jiri Pirko @ 2023-11-10 10:09 UTC (permalink / raw)
To: Kubalewski, Arkadiusz
Cc: Vadim Fedorenko, netdev@vger.kernel.org, Michalik, Michal,
Olech, Milena, pabeni@redhat.com, kuba@kernel.org
In-Reply-To: <DM6PR11MB4657DE812ADB8C5079705DC99BAEA@DM6PR11MB4657.namprd11.prod.outlook.com>
Fri, Nov 10, 2023 at 10:06:59AM CET, arkadiusz.kubalewski@intel.com wrote:
>>From: Jiri Pirko <jiri@resnulli.us>
>>Sent: Friday, November 10, 2023 7:49 AM
>>
>>Fri, Nov 10, 2023 at 12:35:43AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>Sent: Thursday, November 9, 2023 7:07 PM
>>>>
>>>>Thu, Nov 09, 2023 at 06:20:14PM CET, arkadiusz.kubalewski@intel.com
>>>>wrote:
>>>>>>From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
>>>>>>Sent: Thursday, November 9, 2023 11:51 AM
>>>>>>
>>>>>>On 08/11/2023 10:32, Arkadiusz Kubalewski wrote:
>>>>>>> Fix issues when performing unordered unbind/bind of a kernel modules
>>>>>>> which are using a dpll device with DPLL_PIN_TYPE_MUX pins.
>>>>>>> Currently only serialized bind/unbind of such use case works, fix
>>>>>>> the issues and allow for unserialized kernel module bind order.
>>>>>>>
>>>>>>> The issues are observed on the ice driver, i.e.,
>>>>>>>
>>>>>>> $ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind
>>>>>>> $ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind
>>>>>>>
>>>>>>> results in:
>>>>>>>
>>>>>>> ice 0000:af:00.0: Removed PTP clock
>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000010
>>>>>>> PF: supervisor read access in kernel mode
>>>>>>> PF: error_code(0x0000) - not-present page
>>>>>>> PGD 0 P4D 0
>>>>>>> Oops: 0000 [#1] PREEMPT SMP PTI
>>>>>>> CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-
>>>>>>>rc5_next-
>>>>>>>queue_19th-Oct-2023-01625-g039e5d15e451 #1
>>>>>>> Hardware name: Intel Corporation S2600STB/S2600STB, BIOS
>>>>>>>SE5C620.86B.02.01.0008.031920191559 03/19/2019
>>>>>>> RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
>>>>>>> Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c 8b
>>>>>>>66
>>>>>>>08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6 5b
>>>>>>>10
>>>>>>>41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b
>>>>>>> RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246
>>>>>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>>>>> RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90
>>>>>>> RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000
>>>>>>> R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0
>>>>>>> R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000
>>>>>>> FS: 00007fdc7dae0740(0000) GS:ffff888c105c0000(0000)
>>>>>>>knlGS:0000000000000000
>>>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0
>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>> PKRU: 55555554
>>>>>>> Call Trace:
>>>>>>> <TASK>
>>>>>>> ? __die+0x20/0x70
>>>>>>> ? page_fault_oops+0x76/0x170
>>>>>>> ? exc_page_fault+0x65/0x150
>>>>>>> ? asm_exc_page_fault+0x22/0x30
>>>>>>> ? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
>>>>>>> ? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice]
>>>>>>> dpll_msg_add_pin_parents+0x142/0x1d0
>>>>>>> dpll_pin_event_send+0x7d/0x150
>>>>>>> dpll_pin_on_pin_unregister+0x3f/0x100
>>>>>>> ice_dpll_deinit_pins+0xa1/0x230 [ice]
>>>>>>> ice_dpll_deinit+0x29/0xe0 [ice]
>>>>>>> ice_remove+0xcd/0x200 [ice]
>>>>>>> pci_device_remove+0x33/0xa0
>>>>>>> device_release_driver_internal+0x193/0x200
>>>>>>> unbind_store+0x9d/0xb0
>>>>>>> kernfs_fop_write_iter+0x128/0x1c0
>>>>>>> vfs_write+0x2bb/0x3e0
>>>>>>> ksys_write+0x5f/0xe0
>>>>>>> do_syscall_64+0x59/0x90
>>>>>>> ? filp_close+0x1b/0x30
>>>>>>> ? do_dup2+0x7d/0xd0
>>>>>>> ? syscall_exit_work+0x103/0x130
>>>>>>> ? syscall_exit_to_user_mode+0x22/0x40
>>>>>>> ? do_syscall_64+0x69/0x90
>>>>>>> ? syscall_exit_work+0x103/0x130
>>>>>>> ? syscall_exit_to_user_mode+0x22/0x40
>>>>>>> ? do_syscall_64+0x69/0x90
>>>>>>> entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>>>>>> RIP: 0033:0x7fdc7d93eb97
>>>>>>> Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f
>>>>>>>1e
>>>>>>>fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00
>>>>>>>f0
>>>>>>>ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>>>>>> RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX:
>>>>>>>0000000000000001
>>>>>>> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97
>>>>>>> RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001
>>>>>>> RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0
>>>>>>> R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d
>>>>>>> R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0
>>>>>>> </TASK>
>>>>>>> Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1
>>>>>>>vfio
>>>>>>>irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer
>>>>>>>snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs
>>>>>>>libcrc32c
>>>>>>>rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod
>>>>>>>target_core_mod
>>>>>>>ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm
>>>>>>>intel_rapl_msr
>>>>>>>intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
>>>>>>>isst_if_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal
>>>>>>>intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt
>>>>>>>iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr
>>>>>>>i2c_i801
>>>>>>>ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus
>>>>>>>ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4 mbcache
>>>>>>>jbd2
>>>>>>>sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper ice
>>>>>>>crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e libahci
>>>>>>>ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded:
>>>>>>>iavf]
>>>>>>> CR2: 0000000000000010
>>>>>>>
>>>>>>> Arkadiusz Kubalewski (3):
>>>>>>> dpll: fix pin dump crash after module unbind
>>>>>>> dpll: fix pin dump crash for rebound module
>>>>>>> dpll: fix register pin with unregistered parent pin
>>>>>>>
>>>>>>> drivers/dpll/dpll_core.c | 8 ++------
>>>>>>> drivers/dpll/dpll_core.h | 4 ++--
>>>>>>> drivers/dpll/dpll_netlink.c | 37 ++++++++++++++++++++++------------
>>>>>>>--
>>>>>>>-
>>>>>>> 3 files changed, 26 insertions(+), 23 deletions(-)
>>>>>>>
>>>>>>
>>>>>>
>>>>>>I still don't get how can we end up with unregistered pin. And
>>>>>>shouldn't
>>>>>>drivers do unregister of dpll/pin during release procedure? I thought
>>>>>>it
>>>>>>was kind of agreement we reached while developing the subsystem.
>>>>>>
>>>>>
>>>>>It's definitely not about ending up with unregistered pins.
>>>>>
>>>>>Usually the driver is loaded for PF0, PF1, PF2, PF3 and unloaded in
>>>>>opposite
>>>>>order: PF3, PF2, PF1, PF0. And this is working without any issues.
>>>>
>>>>Please fix this in the driver.
>>>>
>>>
>>>Thanks for your feedback, but this is already wrong advice.
>>>
>>>Our HW/FW is designed in different way than yours, it doesn't mean it is
>>>wrong.
>>>As you might recall from our sync meetings, the dpll subsystem is to unify
>>>approaches and reduce the code in the drivers, where your advice is
>>>exactly
>>>opposite, suggested fix would require to implement extra synchronization
>>>of the
>>>dpll and pin registration state between driver instances, most probably
>>>with
>>>use of additional modules like aux-bus or something similar, which was
>>>from the
>>>very beginning something we tried to avoid.
>>>Only ice uses the infrastructure of muxed pins, and this is broken as it
>>>doesn't allow unbind the driver which have registered dpll and pins
>>>without
>>>crashing the kernel, so a fix is required in dpll subsystem, not in the
>>>driver.
>>
>>I replied in the other patch thread.
>>
>
>Yes, so did I.
>But what is the reason you have moved the discussion from the other thread
>into this one?
I didn't, not sure why you say so. I just wanted to make sure you
follow.
>
>Thank you!
>Arkadiusz
>
>>
>>>
>>>Thank you!
>>>Arkadiusz
>>>
>>>>
>>>>>
>>>>>Above crash is caused because of unordered driver unload, where dpll
>>>>>subsystem
>>>>>tries to notify muxed pin was deleted, but at that time the parent is
>>>>>already
>>>>>gone, thus data points to memory which is no longer available, thus
>>>>>crash
>>>>>happens when trying to dump pin parents.
>>>>>
>>>>>This series fixes all issues I could find connected to the situation
>>>>>where
>>>>>muxed-pins are trying to access their parents, when parent registerer
>>>>>was
>>>>>removed
>>>>>in the meantime.
>>>>>
>>>>>Thank you!
>>>>>Arkadiusz
^ permalink raw reply
* Re: [PATCH net 3/3] dpll: fix register pin with unregistered parent pin
From: Jiri Pirko @ 2023-11-10 10:07 UTC (permalink / raw)
To: Kubalewski, Arkadiusz
Cc: Vadim Fedorenko, netdev@vger.kernel.org, Michalik, Michal,
Olech, Milena, pabeni@redhat.com, kuba@kernel.org
In-Reply-To: <DM6PR11MB465767D550C3DF2CE24E70EA9BAEA@DM6PR11MB4657.namprd11.prod.outlook.com>
Fri, Nov 10, 2023 at 09:50:34AM CET, arkadiusz.kubalewski@intel.com wrote:
>>From: Jiri Pirko <jiri@resnulli.us>
>>Sent: Friday, November 10, 2023 7:48 AM
>>
>>Fri, Nov 10, 2023 at 12:21:11AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>Sent: Thursday, November 9, 2023 7:04 PM
>>>>
>>>>Thu, Nov 09, 2023 at 05:02:48PM CET, arkadiusz.kubalewski@intel.com
>>>>wrote:
>>>>>>From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
>>>>>>Sent: Thursday, November 9, 2023 11:56 AM
>>>>>>To: Kubalewski, Arkadiusz <arkadiusz.kubalewski@intel.com>; Jiri Pirko
>>>>>>
>>>>>>On 09/11/2023 09:59, Kubalewski, Arkadiusz wrote:
>>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>> Sent: Wednesday, November 8, 2023 4:08 PM
>>>>>>>>
>>>>>>>> Wed, Nov 08, 2023 at 11:32:26AM CET, arkadiusz.kubalewski@intel.com
>>>>>>>> wrote:
>>>>>>>>> In case of multiple kernel module instances using the same dpll
>>>>>>>>>device:
>>>>>>>>> if only one registers dpll device, then only that one can register
>>>>>>>>
>>>>>>>> They why you don't register in multiple instances? See mlx5 for a
>>>>>>>> reference.
>>>>>>>>
>>>>>>>
>>>>>>> Every registration requires ops, but for our case only PF0 is able to
>>>>>>> control dpll pins and device, thus only this can provide ops.
>>>>>>> Basically without PF0, dpll is not able to be controlled, as well
>>>>>>> as directly connected pins.
>>>>>>>
>>>>>>But why do you need other pins then, if FP0 doesn't exist?
>>>>>>
>>>>>
>>>>>In general we don't need them at that point, but this is a corner case,
>>>>>where users for some reason decided to unbind PF 0, and I treat this
>>>>>state
>>>>>as temporary, where dpll/pins controllability is temporarily broken.
>>>>
>>>>So resolve this broken situation internally in the driver, registering
>>>>things only in case PF0 is present. Some simple notification infra would
>>>>do. Don't drag this into the subsystem internals.
>>>>
>>>
>>>Thanks for your feedback, but this is already wrong advice.
>>>
>>>Our HW/FW is designed in different way than yours, it doesn't mean it is
>>>wrong.
>>>As you might recall from our sync meetings, the dpll subsystem is to unify
>>>approaches and reduce the code in the drivers, where your advice is
>>>exactly
>>
>>No. Your driver knows when what objects are valid or not. Of a pin of
>>PF1 is not valid because the "master" PF0 is gone, it is responsibility
>>of your driver to resolve that. Don't bring this internal dependencies
>>to the dpll core please, does not make any sense to do so. Thanks!
>>
>
>No, a driver doesn't know it, those are separated instances, and you already
>suggested to implement special notification bus in the driver.
>This is not needed and prone for another errors. The dpll subsystem is here to
>make driver life easier.
See the other thread for my reply.
>
>Thank you!
>Arkadiusz
>
>>
>>>opposite, suggested fix would require to implement extra synchronization
>>>of the
>>>dpll and pin registration state between driver instances, most probably
>>>with
>>>use of additional modules like aux-bus or something similar, which was
>>>from the
>>>very beginning something we tried to avoid.
>>>Only ice uses the infrastructure of muxed pins, and this is broken as it
>>>doesn't allow unbind the driver which have registered dpll and pins
>>>without
>>>crashing the kernel, so a fix is required in dpll subsystem, not in the
>>>driver.
>>>
>>>Thank you!
>>>Arkadiusz
>>>
>>>>
>>>>>
>>>>>The dpll at that point is not registered, all the direct pins are also
>>>>>not registered, thus not available to the users.
>>>>>
>>>>>When I do dump at that point there are still 3 pins present, one for
>>>>>each
>>>>>PF, although they are all zombies - no parents as their parent pins are
>>>>>not
>>>>>registered (as the other patch [1/3] prevents dump of pin parent if the
>>>>>parent is not registered). Maybe we can remove the REGISTERED mark for
>>>>>all
>>>>>the muxed pins, if all their parents have been unregistered, so they
>>>>>won't
>>>>>be visible to the user at all. Will try to POC that.
>>>>>
>>>>>>>>
>>>>>>>>> directly connected pins with a dpll device. If unregistered parent
>>>>>>>>> determines if the muxed pin can be register with it or not, it
>>>>>>>>>forces
>>>>>>>>> serialized driver load order - first the driver instance which
>>>>>>>>> registers the direct pins needs to be loaded, then the other
>>>>>>>>> instances
>>>>>>>>> could register muxed type pins.
>>>>>>>>>
>>>>>>>>> Allow registration of a pin with a parent even if the parent was
>>>>>>>>>not
>>>>>>>>> yet registered, thus allow ability for unserialized driver instance
>>>>>>>>
>>>>>>>> Weird.
>>>>>>>>
>>>>>>>
>>>>>>> Yeah, this is issue only for MUX/parent pin part, couldn't find
>>>>>>>better
>>>>>>> way, but it doesn't seem to break things around..
>>>>>>>
>>>>>>
>>>>>>I just wonder how do you see the registration procedure? How can parent
>>>>>>pin exist if it's not registered? I believe you cannot get it through
>>>>>>DPLL API, then the only possible way is to create it within the same
>>>>>>driver code, which can be simply re-arranged. Am I wrong here?
>>>>>>
>>>>>
>>>>>By "parent exist" I mean the parent pin exist in the dpll subsystem
>>>>>(allocated on pins xa), but it doesn't mean it is available to the
>>>>>users,
>>>>>as it might not be registered with a dpll device.
>>>>>
>>>>>We have this 2 step init approach:
>>>>>1. dpll_pin_get(..) -> allocate new pin or increase reference if exist
>>>>>2.1. dpll_pin_register(..) -> register with a dpll device
>>>>>2.2. dpll_pin_on_pin_register -> register with a parent pin
>>>>>
>>>>>Basically:
>>>>>- PF 0 does 1 & 2.1 for all the direct inputs, and steps: 1 & 2.2 for
>>>>>its
>>>>> recovery clock pin,
>>>>>- other PF's only do step 1 for the direct input pins (as they must get
>>>>> reference to those in order to register recovery clock pin with them),
>>>>> and steps: 1 & 2.2 for their recovery clock pin.
>>>>>
>>>>>
>>>>>Thank you!
>>>>>Arkadiusz
>>>>>
>>>>>>> Thank you!
>>>>>>> Arkadiusz
>>>>>>>
>>>>>>>>
>>>>>>>>> load order.
>>>>>>>>> Do not WARN_ON notification for unregistered pin, which can be
>>>>>>>>> invoked
>>>>>>>>> for described case, instead just return error.
>>>>>>>>>
>>>>>>>>> Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base
>>>>>>>>>functions")
>>>>>>>>> Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>>>>>>>> functions")
>>>>>>>>> Signed-off-by: Arkadiusz Kubalewski
>>>>>>>>><arkadiusz.kubalewski@intel.com>
>>>>>>>>> ---
>>>>>>>>> drivers/dpll/dpll_core.c | 4 ----
>>>>>>>>> drivers/dpll/dpll_netlink.c | 2 +-
>>>>>>>>> 2 files changed, 1 insertion(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>> index
>>>>>>>>> 4077b562ba3b..ae884b92d68c 100644
>>>>>>>>> --- a/drivers/dpll/dpll_core.c
>>>>>>>>> +++ b/drivers/dpll/dpll_core.c
>>>>>>>>> @@ -28,8 +28,6 @@ static u32 dpll_xa_id;
>>>>>>>>> WARN_ON_ONCE(!xa_get_mark(&dpll_device_xa, (d)->id,
>>>>>>>>> DPLL_REGISTERED))
>>>>>>>>> #define ASSERT_DPLL_NOT_REGISTERED(d) \
>>>>>>>>> WARN_ON_ONCE(xa_get_mark(&dpll_device_xa, (d)->id,
>>>>>>>>> DPLL_REGISTERED))
>>>>>>>>> -#define ASSERT_PIN_REGISTERED(p) \
>>>>>>>>> - WARN_ON_ONCE(!xa_get_mark(&dpll_pin_xa, (p)->id,
>>>>>>>>> DPLL_REGISTERED))
>>>>>>>>>
>>>>>>>>> struct dpll_device_registration {
>>>>>>>>> struct list_head list;
>>>>>>>>> @@ -641,8 +639,6 @@ int dpll_pin_on_pin_register(struct dpll_pin
>>>>>>>>> *parent,
>>>>>>>>> struct dpll_pin *pin,
>>>>>>>>> WARN_ON(!ops->state_on_pin_get) ||
>>>>>>>>> WARN_ON(!ops->direction_get))
>>>>>>>>> return -EINVAL;
>>>>>>>>> - if (ASSERT_PIN_REGISTERED(parent))
>>>>>>>>> - return -EINVAL;
>>>>>>>>>
>>>>>>>>> mutex_lock(&dpll_lock);
>>>>>>>>> ret = dpll_xa_ref_pin_add(&pin->parent_refs, parent, ops,
>>>>>>>>> priv); diff
>>>>>>>>> --git a/drivers/dpll/dpll_netlink.c b/drivers/dpll/dpll_netlink.c
>>>>>>>>> index
>>>>>>>>> 963bbbbe6660..ff430f43304f 100644
>>>>>>>>> --- a/drivers/dpll/dpll_netlink.c
>>>>>>>>> +++ b/drivers/dpll/dpll_netlink.c
>>>>>>>>> @@ -558,7 +558,7 @@ dpll_pin_event_send(enum dpll_cmd event, struct
>>>>>>>>> dpll_pin *pin)
>>>>>>>>> int ret = -ENOMEM;
>>>>>>>>> void *hdr;
>>>>>>>>>
>>>>>>>>> - if (WARN_ON(!xa_get_mark(&dpll_pin_xa, pin->id,
>>>>>>>>> DPLL_REGISTERED)))
>>>>>>>>> + if (!xa_get_mark(&dpll_pin_xa, pin->id, DPLL_REGISTERED))
>>>>>>>>> return -ENODEV;
>>>>>>>>>
>>>>>>>>> msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>>>>>>>>> --
>>>>>>>>> 2.38.1
>>>>>>>>>
>>>>>>
>>>>>
^ permalink raw reply
* Re: [PATCH net 2/3] dpll: fix pin dump crash for rebound module
From: Jiri Pirko @ 2023-11-10 10:06 UTC (permalink / raw)
To: Kubalewski, Arkadiusz
Cc: netdev@vger.kernel.org, vadim.fedorenko@linux.dev,
Michalik, Michal, Olech, Milena, pabeni@redhat.com,
kuba@kernel.org
In-Reply-To: <DM6PR11MB4657B61E86D5DFFAF83BC1E59BAEA@DM6PR11MB4657.namprd11.prod.outlook.com>
Fri, Nov 10, 2023 at 10:01:50AM CET, arkadiusz.kubalewski@intel.com wrote:
>>From: Jiri Pirko <jiri@resnulli.us>
>>Sent: Friday, November 10, 2023 7:46 AM
>>
>>Fri, Nov 10, 2023 at 12:32:21AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>Sent: Thursday, November 9, 2023 7:06 PM
>>>>
>>>>Thu, Nov 09, 2023 at 05:30:20PM CET, arkadiusz.kubalewski@intel.com
>>>>wrote:
>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>Sent: Thursday, November 9, 2023 2:19 PM
>>>>>>
>>>>>>Thu, Nov 09, 2023 at 01:20:48PM CET, arkadiusz.kubalewski@intel.com
>>>>>>wrote:
>>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>Sent: Wednesday, November 8, 2023 3:30 PM
>>>>>>>>
>>>>>>>>Wed, Nov 08, 2023 at 11:32:25AM CET, arkadiusz.kubalewski@intel.com
>>>>>>>>wrote:
>>>>>>>>>When a kernel module is unbound but the pin resources were not
>>>>>>>>>entirely
>>>>>>>>>freed (other kernel module instance have had kept the reference to
>>>>>>>>>that
>>>>>>>>>pin), and kernel module is again bound, the pin properties would not
>>>>>>>>>be
>>>>>>>>>updated (the properties are only assigned when memory for the pin is
>>>>>>>>>allocated), prop pointer still points to the kernel module memory of
>>>>>>>>>the kernel module which was deallocated on the unbind.
>>>>>>>>>
>>>>>>>>>If the pin dump is invoked in this state, the result is a kernel
>>>>>>>>>crash.
>>>>>>>>>Prevent the crash by storing persistent pin properties in dpll
>>>>>>>>>subsystem,
>>>>>>>>>copy the content from the kernel module when pin is allocated,
>>>>>>>>>instead
>>>>>>>>>of
>>>>>>>>>using memory of the kernel module.
>>>>>>>>>
>>>>>>>>>Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base
>>>>>>>>>functions")
>>>>>>>>>Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>>>>>>>>functions")
>>>>>>>>>Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
>>>>>>>>>---
>>>>>>>>> drivers/dpll/dpll_core.c | 4 ++--
>>>>>>>>> drivers/dpll/dpll_core.h | 4 ++--
>>>>>>>>> drivers/dpll/dpll_netlink.c | 28 ++++++++++++++--------------
>>>>>>>>> 3 files changed, 18 insertions(+), 18 deletions(-)
>>>>>>>>>
>>>>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>>index 3568149b9562..4077b562ba3b 100644
>>>>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>>>>@@ -442,7 +442,7 @@ dpll_pin_alloc(u64 clock_id, u32 pin_idx, struct
>>>>>>>>>module *module,
>>>>>>>>> ret = -EINVAL;
>>>>>>>>> goto err;
>>>>>>>>> }
>>>>>>>>>- pin->prop = prop;
>>>>>>>>>+ memcpy(&pin->prop, prop, sizeof(pin->prop));
>>>>>>>>
>>>>>>>>Odd, you don't care about the pointer within this structure?
>>>>>>>>
>>>>>>>
>>>>>>>Well, true. Need a fix.
>>>>>>>Wondering if copying idea is better than just assigning prop pointer
>>>>>>>on
>>>>>>>each call to dpll_pin_get(..) function (when pin already exists)?
>>>>>>
>>>>>>Not sure what do you mean. Examples please.
>>>>>>
>>>>>
>>>>>Sure,
>>>>>
>>>>>Basically this change:
>>>>>
>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>index ae884b92d68c..06b72d5877c3 100644
>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>@@ -483,6 +483,7 @@ dpll_pin_get(u64 clock_id, u32 pin_idx, struct
>>>>>module
>>>>>*module,
>>>>> pos->pin_idx == pin_idx &&
>>>>> pos->module == module) {
>>>>> ret = pos;
>>>>>+ pos->prop = prop;
>>>>> refcount_inc(&ret->refcount);
>>>>> break;
>>>>> }
>>>>>
>>>>>would replace whole of this patch changes, although seems a bit hacky.
>>>>
>>>>Or event better, as I suggested in the other patch reply, resolve this
>>>>internally in the driver registering things only when they are valid.
>>>>Much better then to hack anything in dpll core.
>>>>
>>>
>>>This approach seemed to me hacky, that is why started with coping the
>>>data.
>>>It is not about registering, rather about unregistering on driver
>>>unbind, which brakes things, and currently cannot be recovered in
>>>described case.
>>
>>Sure it can. PF0 unbind-> internal notification-> PF1 unregisters all
>>related object. Very clean and simple.
>>
>
>What you are suggesting is:
>- special purpose bus in the driver,
No, it is a simple notificator. Very common infra over the whole
kernel code.
>- dpll-related,
Is this the only thing that PF0 is special with? Perhaps you can
utilize this for other features as well, since your fw design is like
this.
>- not needed,
>- prone for errors.
>
>The dpll subsystem is here to make driver life easier.
No, the subsystem is never here to handle device specific issues. And
your PF0 dependency is very clearly something device specific. Don't
pollute the dpll subsystem with workaround to handle specific device
needs. create/register the dplls objects from your driver only when it
is valid to do so. Make sure the lifetime of such object stays in
the scope of validity. Handle that in the driver. Very clear and simple.
Thanks!
>
>Thank you!
>Arkadiusz
>
>>
>>>
>>>Thank you!
>>>Arkadiusz
>>>
>>>>
>>>>>
>>>>>Thank you!
>>>>>Arkadiusz
>>>>>
>>>>>>
>>>>>>>
>>>>>>>Thank you!
>>>>>>>Arkadiusz
>>>>>>>
>>>>>>>>
>>>>>>>>> refcount_set(&pin->refcount, 1);
>>>>>>>>> xa_init_flags(&pin->dpll_refs, XA_FLAGS_ALLOC);
>>>>>>>>> xa_init_flags(&pin->parent_refs, XA_FLAGS_ALLOC);
>>>>>>>>>@@ -634,7 +634,7 @@ int dpll_pin_on_pin_register(struct dpll_pin
>>>>>>>>>*parent,
>>>>>>>>>struct dpll_pin *pin,
>>>>>>>>> unsigned long i, stop;
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>>- if (WARN_ON(parent->prop->type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>+ if (WARN_ON(parent->prop.type != DPLL_PIN_TYPE_MUX))
>>>>>>>>> return -EINVAL;
>>>>>>>>>
>>>>>>>>> if (WARN_ON(!ops) ||
>>>>>>>>>diff --git a/drivers/dpll/dpll_core.h b/drivers/dpll/dpll_core.h
>>>>>>>>>index 5585873c5c1b..717f715015c7 100644
>>>>>>>>>--- a/drivers/dpll/dpll_core.h
>>>>>>>>>+++ b/drivers/dpll/dpll_core.h
>>>>>>>>>@@ -44,7 +44,7 @@ struct dpll_device {
>>>>>>>>> * @module: module of creator
>>>>>>>>> * @dpll_refs: hold referencees to dplls pin was registered
>>>>>>>>>with
>>>>>>>>> * @parent_refs: hold references to parent pins pin was
>>registered
>>>>>>>>>with
>>>>>>>>>- * @prop: pointer to pin properties given by registerer
>>>>>>>>>+ * @prop: pin properties copied from the registerer
>>>>>>>>> * @rclk_dev_name: holds name of device when pin can recover
>>>>>>>>>clock
>>>>>>>>>from it
>>>>>>>>> * @refcount: refcount
>>>>>>>>> **/
>>>>>>>>>@@ -55,7 +55,7 @@ struct dpll_pin {
>>>>>>>>> struct module *module;
>>>>>>>>> struct xarray dpll_refs;
>>>>>>>>> struct xarray parent_refs;
>>>>>>>>>- const struct dpll_pin_properties *prop;
>>>>>>>>>+ struct dpll_pin_properties prop;
>>>>>>>>> refcount_t refcount;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>>diff --git a/drivers/dpll/dpll_netlink.c
>>b/drivers/dpll/dpll_netlink.c
>>>>>>>>>index 93fc6c4b8a78..963bbbbe6660 100644
>>>>>>>>>--- a/drivers/dpll/dpll_netlink.c
>>>>>>>>>+++ b/drivers/dpll/dpll_netlink.c
>>>>>>>>>@@ -278,17 +278,17 @@ dpll_msg_add_pin_freq(struct sk_buff *msg,
>>>>>>>>>struct
>>>>>>>>>dpll_pin *pin,
>>>>>>>>> if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY, sizeof(freq),
>>>>>>>>>&freq,
>>>>>>>>> DPLL_A_PIN_PAD))
>>>>>>>>> return -EMSGSIZE;
>>>>>>>>>- for (fs = 0; fs < pin->prop->freq_supported_num; fs++) {
>>>>>>>>>+ for (fs = 0; fs < pin->prop.freq_supported_num; fs++) {
>>>>>>>>> nest = nla_nest_start(msg,
>>>>>>>>>DPLL_A_PIN_FREQUENCY_SUPPORTED);
>>>>>>>>> if (!nest)
>>>>>>>>> return -EMSGSIZE;
>>>>>>>>>- freq = pin->prop->freq_supported[fs].min;
>>>>>>>>>+ freq = pin->prop.freq_supported[fs].min;
>>>>>>>>> if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MIN,
>>>>>>>>>sizeof(freq),
>>>>>>>>> &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>> nla_nest_cancel(msg, nest);
>>>>>>>>> return -EMSGSIZE;
>>>>>>>>> }
>>>>>>>>>- freq = pin->prop->freq_supported[fs].max;
>>>>>>>>>+ freq = pin->prop.freq_supported[fs].max;
>>>>>>>>> if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MAX,
>>>>>>>>>sizeof(freq),
>>>>>>>>> &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>> nla_nest_cancel(msg, nest);
>>>>>>>>>@@ -304,9 +304,9 @@ static bool dpll_pin_is_freq_supported(struct
>>>>>>>>>dpll_pin
>>>>>>>>>*pin, u32 freq)
>>>>>>>>> {
>>>>>>>>> int fs;
>>>>>>>>>
>>>>>>>>>- for (fs = 0; fs < pin->prop->freq_supported_num; fs++)
>>>>>>>>>- if (freq >= pin->prop->freq_supported[fs].min &&
>>>>>>>>>- freq <= pin->prop->freq_supported[fs].max)
>>>>>>>>>+ for (fs = 0; fs < pin->prop.freq_supported_num; fs++)
>>>>>>>>>+ if (freq >= pin->prop.freq_supported[fs].min &&
>>>>>>>>>+ freq <= pin->prop.freq_supported[fs].max)
>>>>>>>>> return true;
>>>>>>>>> return false;
>>>>>>>>> }
>>>>>>>>>@@ -403,7 +403,7 @@ static int
>>>>>>>>> dpll_cmd_pin_get_one(struct sk_buff *msg, struct dpll_pin *pin,
>>>>>>>>> struct netlink_ext_ack *extack)
>>>>>>>>> {
>>>>>>>>>- const struct dpll_pin_properties *prop = pin->prop;
>>>>>>>>>+ const struct dpll_pin_properties *prop = &pin->prop;
>>>>>>>>> struct dpll_pin_ref *ref;
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>>@@ -696,7 +696,7 @@ dpll_pin_on_pin_state_set(struct dpll_pin *pin,
>>>>>>>>>u32
>>>>>>>>>parent_idx,
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>> if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>- pin->prop->capabilities)) {
>>>>>>>>>+ pin->prop.capabilities)) {
>>>>>>>>> NL_SET_ERR_MSG(extack, "state changing is not allowed");
>>>>>>>>> return -EOPNOTSUPP;
>>>>>>>>> }
>>>>>>>>>@@ -732,7 +732,7 @@ dpll_pin_state_set(struct dpll_device *dpll,
>>>>>>>>>struct
>>>>>>>>>dpll_pin *pin,
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>> if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>- pin->prop->capabilities)) {
>>>>>>>>>+ pin->prop.capabilities)) {
>>>>>>>>> NL_SET_ERR_MSG(extack, "state changing is not allowed");
>>>>>>>>> return -EOPNOTSUPP;
>>>>>>>>> }
>>>>>>>>>@@ -759,7 +759,7 @@ dpll_pin_prio_set(struct dpll_device *dpll,
>>struct
>>>>>>>>>dpll_pin *pin,
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>> if (!(DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE &
>>>>>>>>>- pin->prop->capabilities)) {
>>>>>>>>>+ pin->prop.capabilities)) {
>>>>>>>>> NL_SET_ERR_MSG(extack, "prio changing is not allowed");
>>>>>>>>> return -EOPNOTSUPP;
>>>>>>>>> }
>>>>>>>>>@@ -787,7 +787,7 @@ dpll_pin_direction_set(struct dpll_pin *pin,
>>>>>>>>>struct
>>>>>>>>>dpll_device *dpll,
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>> if (!(DPLL_PIN_CAPABILITIES_DIRECTION_CAN_CHANGE &
>>>>>>>>>- pin->prop->capabilities)) {
>>>>>>>>>+ pin->prop.capabilities)) {
>>>>>>>>> NL_SET_ERR_MSG(extack, "direction changing is not
>>>>>>>>>allowed");
>>>>>>>>> return -EOPNOTSUPP;
>>>>>>>>> }
>>>>>>>>>@@ -817,8 +817,8 @@ dpll_pin_phase_adj_set(struct dpll_pin *pin,
>>>>>>>>>struct
>>>>>>>>>nlattr *phase_adj_attr,
>>>>>>>>> int ret;
>>>>>>>>>
>>>>>>>>> phase_adj = nla_get_s32(phase_adj_attr);
>>>>>>>>>- if (phase_adj > pin->prop->phase_range.max ||
>>>>>>>>>- phase_adj < pin->prop->phase_range.min) {
>>>>>>>>>+ if (phase_adj > pin->prop.phase_range.max ||
>>>>>>>>>+ phase_adj < pin->prop.phase_range.min) {
>>>>>>>>> NL_SET_ERR_MSG_ATTR(extack, phase_adj_attr,
>>>>>>>>> "phase adjust value not supported");
>>>>>>>>> return -EINVAL;
>>>>>>>>>@@ -999,7 +999,7 @@ dpll_pin_find(u64 clock_id, struct nlattr
>>>>>>>>>*mod_name_attr,
>>>>>>>>> unsigned long i;
>>>>>>>>>
>>>>>>>>> xa_for_each_marked(&dpll_pin_xa, i, pin, DPLL_REGISTERED) {
>>>>>>>>>- prop = pin->prop;
>>>>>>>>>+ prop = &pin->prop;
>>>>>>>>> cid_match = clock_id ? pin->clock_id == clock_id : true;
>>>>>>>>> mod_match = mod_name_attr && module_name(pin->module) ?
>>>>>>>>> !nla_strcmp(mod_name_attr,
>>>>>>>>>--
>>>>>>>>>2.38.1
>>>>>>>>>
>>>>>>>
>>>
^ permalink raw reply
* Re: [PATCH v2 1/3] net: phy: at803x: add QCA8084 ethernet phy support
From: Jie Luo @ 2023-11-10 9:56 UTC (permalink / raw)
To: Maxime Chevallier
Cc: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
linux-kernel
In-Reply-To: <20231110103328.0bc3d28f@fedora>
On 11/10/2023 5:33 PM, Maxime Chevallier wrote:
> On Fri, 10 Nov 2023 17:17:58 +0800
> Jie Luo <quic_luoj@quicinc.com> wrote:
>
>> On 11/10/2023 4:53 PM, Jie Luo wrote:
>>>
>>>
>>> On 11/9/2023 5:16 PM, Maxime Chevallier wrote:
>>>> Hello,
>>>>
>>>> On Thu, 9 Nov 2023 16:32:36 +0800
>>>> Jie Luo <quic_luoj@quicinc.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>>> What I understand from this is that this PHY can be used either as a
>>>>>> switch, in which case port 4 would be connected to the host interface
>>>>>> at up to 2.5G, or as a quad-phy, but since it uses QUSGMII the link
>>>>>> speed would be limited to 1G per-port, is that correct ?
>>>>>
>>>>> When the PHY works on the interface mode QUSGMII for quad-phy, all 4
>>>>> PHYs can support to the max link speed 2.5G, actually the PHY can
>>>>> support to max link speed 2.5G for all supported interface modes
>>>>> including qusgmii and sgmii.
>>>>
>>>> I'm a bit confused then, as the USGMII spec says that Quad USGMII really
>>>> is for quad 10/100/1000 speeds, using 10b/8b encoding.
>>>>
>>>> Aren't you using the USXGMII mode instead, which can convey 4 x 2.5Gbps
>>>> with 66b/64b encoding ?
>>>>
>>>> Thanks,
>>>>
>>>> Maxime
>>>
>>> Hi Maxime,
>>> Yes, for quad PHY mode, it is using 66b/64 encoding.
>>>
>>> it seems that PHY_INTERFACE_MODE_USXGMII is for single port,
>>> so i take the interface name PHY_INTERFACE_MODE_QUSGMII for
>>> quad PHYs here.
>>>
>>> can we apply PHY_INTERFACE_MODE_USXGMII to quad PHYs in this
>>> case(qca8084 quad PHY mode)?
>>>
>>> Thanks,
>>> Jie.
>>
>> one more thing, if we use the PHY_INTERFACE_MODE_USXGMII for
>> the quad PHY here, the MAC serdes can't distinguish the actual
>> mode PHY_INTERFACE_MODE_USXGMII and 10G-QXGMII(qca8084 quad phy mode),
>> the MAC serdes has the different configurations for usxgmii(10g single
>> port) and qxsgmii(quad PHY).
>
> Yes you do need a way to know which mode to use, what I'm wondering is
> that the usxgmii spec actually defines something like 9 different modes
> ( 1/2/4/8 ports, with a total bandwidth ranging from 2.5Gbps to 20 Gbps
> ), should we define a phy mode for all of these variants, or should we
> have another way of getting the mode variant (like, saying I want to
> use usxgmii, in 4 ports mode, with the serdes at 10.3125Gbps).
>
> That being said, QUSGMII already exists to define a specific variant of
> USGMII, so maybe adding 10G-QXGMII is fine...
Yes, Maxime, I agree with this solution, the name 10G-QXGMII is exactly
the working mode of qca8084 quad phy mode.
>
> Also, net-next is still currently closed.
Ok, thanks for this reminder.
^ permalink raw reply
* Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice
From: Mina Almasry @ 2023-11-10 9:45 UTC (permalink / raw)
To: Yunsheng Lin
Cc: Paolo Abeni, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
Kaiyuan Zhang
In-Reply-To: <aec0f586-c3b9-8da8-6a39-f313105267f8@huawei.com>
On Thu, Nov 9, 2023 at 11:38 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/11/10 10:59, Mina Almasry wrote:
> > On Thu, Nov 9, 2023 at 12:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >> I'm trying to wrap my head around the whole infra... the above line is
> >> confusing. Why do you increment dma_addr? it will be re-initialized in
> >> the next iteration.
> >>
> >
> > That is just a mistake, sorry. Will remove this increment.
>
> You seems to be combining comments in different thread and replying in
> one thread, I am not sure that is a good practice and I almost missed the
> reply below as I don't seem to be cc'ed.
>
Sorry about that.
> >
> > On Thu, Nov 9, 2023 at 1:29 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:> >>>
> >>>>> gen_pool_destroy BUG_ON() if it's not empty at the time of destroying.
> >>>>> Technically that should never happen, because
> >>>>> __netdev_devmem_binding_free() should only be called when the refcount
> >>>>> hits 0, so all the chunks have been freed back to the gen_pool. But,
> >>>>> just in case, I don't want to crash the server just because I'm
> >>>>> leaking a chunk... this is a bit of defensive programming that is
> >>>>> typically frowned upon, but the behavior of gen_pool is so severe I
> >>>>> think the WARN() + check is warranted here.
> >>>>
> >>>> It seems it is pretty normal for the above to happen nowadays because of
> >>>> retransmits timeouts, NAPI defer schemes mentioned below:
> >>>>
> >>>> https://lkml.kernel.org/netdev/168269854650.2191653.8465259808498269815.stgit@firesoul/
> >>>>
> >>>> And currently page pool core handles that by using a workqueue.
> >>>
> >>> Forgive me but I'm not understanding the concern here.
> >>>
> >>> __netdev_devmem_binding_free() is called when binding->ref hits 0.
> >>>
> >>> binding->ref is incremented when an iov slice of the dma-buf is
> >>> allocated, and decremented when an iov is freed. So,
> >>> __netdev_devmem_binding_free() can't really be called unless all the
> >>> iovs have been freed, and gen_pool_size() == gen_pool_avail(),
> >>> regardless of what's happening on the page_pool side of things, right?
> >>
> >> I seems to misunderstand it. In that case, it seems to be about
> >> defensive programming like other checking.
> >>
> >> By looking at it more closely, it seems napi_frag_unref() call
> >> page_pool_page_put_many() directly, which means devmem seems to
> >> be bypassing the napi_safe optimization.
> >>
> >> Can napi_frag_unref() reuse napi_pp_put_page() in order to reuse
> >> the napi_safe optimization?
> >>
> >
> > I think it already does. page_pool_page_put_many() is only called if
> > !recycle or !napi_pp_put_page(). In that case
> > page_pool_page_put_many() is just a replacement for put_page(),
> > because this 'page' may be an iov.
>
> Is there a reason why not calling napi_pp_put_page() for devmem too
> instead of calling page_pool_page_put_many()? mem provider has a
> 'release_page' ops, calling page_pool_page_put_many() directly here
> seems to be bypassing the 'release_page' ops, which means devmem is
> bypassing most of the main features of page pool.
>
I think we're still calling napi_pp_put_page() as normal:
/**
@@ -3441,13 +3466,13 @@ bool napi_pp_put_page(struct page *page, bool
napi_safe);
static inline void
napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
{
- struct page *page = skb_frag_page(frag);
-
#ifdef CONFIG_PAGE_POOL
- if (recycle && napi_pp_put_page(page, napi_safe))
+ if (recycle && napi_pp_put_page(frag->bv_page, napi_safe))
return;
+ page_pool_page_put_many(frag->bv_page, 1);
+#else
+ put_page(skb_frag_page(frag));
#endif
- put_page(page);
}
The code change here is to replace put_page() with
page_pool_page_put_many(), only, because bv_page may be a
page_pool_iov, so we need to use page_pool_page_put_many() which
handles page_pool_iov correcly. I did not change whether or not
napi_pp_put_page() is called. It's still called if recycle==true.
> As far as I can tell, the main features of page pool:
> 1. Allow lockless allocation and freeing in pool->alloc cache by
> utilizing NAPI non-concurrent context.
> 2. Allow concurrent allocation and freeing in pool->ring cache by
> utilizing ptr_ring.
> 3. Allow dma map/unmap and cache sync optimization.
> 4. Allow detailed stats logging and tracing.
> 5. Allow some bulk allocation and freeing.
> 6. support both skb packet and xdp frame.
>
> I am wondering what is the main features that devmem is utilizing
> by intergrating into page pool?
>
> It seems the driver can just call netdev_alloc_devmem() and
> napi_frag_unref() can call netdev_free_devmem() directly without
> intergrating into page pool and it should just works too?
>
> Maybe we should consider creating a new thin layer, in order to
> demux to page pool, devmem or other mem type if my suggestion does
> not work out too?
>
I went through this discussion with Jesper on RFC v2 in this thread:
https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4pSOD1Jw@mail.gmail.com/T/#ma9285d53735d82cc315717db67a1796477c89d86
which culminates with that email where he seems on board with the
change from a performance POV, and seems on board with hiding the
memory type implementation from the drivers. That thread fully goes
over the tradeoffs of integrating over the page pool or creating new
ones. Integrating with the page pool abstracts most of the devmem
implementation (and other memory types) from the driver. It reuses
page pool features like page recycling for example.
--
Thanks,
Mina
^ permalink raw reply
* [PATCH bpf-next] bpf, sockmap: Bundle psock->sk_redir and redir_ingress into a tagged pointer
From: Pengcheng Yang @ 2023-11-10 9:41 UTC (permalink / raw)
To: Jakub Sitnicki, John Fastabend, bpf, netdev; +Cc: Pengcheng Yang
Like skb->_sk_redir, we bundle the sock redirect pointer and
the ingress bit to manage them together.
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/87cz97cnz8.fsf@cloudflare.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
---
include/linux/skmsg.h | 30 ++++++++++++++++++++++++++++--
net/core/skmsg.c | 18 ++++++++++--------
net/ipv4/tcp_bpf.c | 13 +++++++------
net/tls/tls_sw.c | 11 ++++++-----
4 files changed, 51 insertions(+), 21 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index c1637515a8a4..ae021f511f46 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -78,11 +78,10 @@ struct sk_psock_work_state {
struct sk_psock {
struct sock *sk;
- struct sock *sk_redir;
+ unsigned long _sk_redir;
u32 apply_bytes;
u32 cork_bytes;
u32 eval;
- bool redir_ingress; /* undefined if sk_redir is null */
struct sk_msg *cork;
struct sk_psock_progs progs;
#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
@@ -283,6 +282,33 @@ static inline struct sk_psock *sk_psock(const struct sock *sk)
SK_USER_DATA_PSOCK);
}
+static inline bool sk_psock_ingress(const struct sk_psock *psock)
+{
+ unsigned long sk_redir = psock->_sk_redir;
+
+ return sk_redir & BPF_F_INGRESS;
+}
+
+static inline void sk_psock_set_redir(struct sk_psock *psock, struct sock *sk_redir,
+ bool ingress)
+{
+ psock->_sk_redir = (unsigned long)sk_redir;
+ if (ingress)
+ psock->_sk_redir |= BPF_F_INGRESS;
+}
+
+static inline struct sock *sk_psock_get_redir(struct sk_psock *psock)
+{
+ unsigned long sk_redir = psock->_sk_redir;
+
+ return (struct sock *)(sk_redir & ~(BPF_F_INGRESS));
+}
+
+static inline void sk_psock_clear_redir(struct sk_psock *psock)
+{
+ psock->_sk_redir = 0;
+}
+
static inline void sk_psock_set_state(struct sk_psock *psock,
enum sk_psock_state_bits bit)
{
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6c31eefbd777..d994621f1f95 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -811,6 +811,7 @@ static void sk_psock_destroy(struct work_struct *work)
{
struct sk_psock *psock = container_of(to_rcu_work(work),
struct sk_psock, rwork);
+ struct sock *sk_redir = sk_psock_get_redir(psock);
/* No sk_callback_lock since already detached. */
sk_psock_done_strp(psock);
@@ -824,8 +825,8 @@ static void sk_psock_destroy(struct work_struct *work)
sk_psock_link_destroy(psock);
sk_psock_cork_free(psock);
- if (psock->sk_redir)
- sock_put(psock->sk_redir);
+ if (sk_redir)
+ sock_put(sk_redir);
sock_put(psock->sk);
kfree(psock);
}
@@ -865,6 +866,7 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
struct sk_msg *msg)
{
struct bpf_prog *prog;
+ struct sock *sk_redir;
int ret;
rcu_read_lock();
@@ -880,17 +882,17 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
ret = sk_psock_map_verd(ret, msg->sk_redir);
psock->apply_bytes = msg->apply_bytes;
if (ret == __SK_REDIRECT) {
- if (psock->sk_redir) {
- sock_put(psock->sk_redir);
- psock->sk_redir = NULL;
+ sk_redir = sk_psock_get_redir(psock);
+ if (sk_redir) {
+ sock_put(sk_redir);
+ sk_psock_clear_redir(psock);
}
if (!msg->sk_redir) {
ret = __SK_DROP;
goto out;
}
- psock->redir_ingress = sk_msg_to_ingress(msg);
- psock->sk_redir = msg->sk_redir;
- sock_hold(psock->sk_redir);
+ sk_psock_set_redir(psock, msg->sk_redir, sk_msg_to_ingress(msg));
+ sock_hold(msg->sk_redir);
}
out:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 53b0d62fd2c2..b3c847dc87dc 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -427,14 +427,14 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
sk_msg_apply_bytes(psock, tosend);
break;
case __SK_REDIRECT:
- redir_ingress = psock->redir_ingress;
- sk_redir = psock->sk_redir;
+ redir_ingress = sk_psock_ingress(psock);
+ sk_redir = sk_psock_get_redir(psock);
sk_msg_apply_bytes(psock, tosend);
if (!psock->apply_bytes) {
/* Clean up before releasing the sock lock. */
eval = psock->eval;
psock->eval = __SK_NONE;
- psock->sk_redir = NULL;
+ sk_psock_clear_redir(psock);
}
if (psock->cork) {
cork = true;
@@ -476,9 +476,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
if (likely(!ret)) {
if (!psock->apply_bytes) {
psock->eval = __SK_NONE;
- if (psock->sk_redir) {
- sock_put(psock->sk_redir);
- psock->sk_redir = NULL;
+ sk_redir = sk_psock_get_redir(psock);
+ if (sk_redir) {
+ sock_put(sk_redir);
+ sk_psock_clear_redir(psock);
}
}
if (msg &&
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e9d1e83a859d..c91cd07c1285 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -854,8 +854,8 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk,
}
break;
case __SK_REDIRECT:
- redir_ingress = psock->redir_ingress;
- sk_redir = psock->sk_redir;
+ redir_ingress = sk_psock_ingress(psock);
+ sk_redir = sk_psock_get_redir(psock);
memcpy(&msg_redir, msg, sizeof(*msg));
if (msg->apply_bytes < send)
msg->apply_bytes = 0;
@@ -898,9 +898,10 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk,
}
if (reset_eval) {
psock->eval = __SK_NONE;
- if (psock->sk_redir) {
- sock_put(psock->sk_redir);
- psock->sk_redir = NULL;
+ sk_redir = sk_psock_get_redir(psock);
+ if (sk_redir) {
+ sock_put(sk_redir);
+ sk_psock_clear_redir(psock);
}
}
if (rec)
--
2.38.1
^ permalink raw reply related
* [PATCH V2 net 7/7] net: hns3: fix VF wrong speed and duplex issue
From: Jijie Shao @ 2023-11-10 9:37 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, edumazet, kuba, pabeni
Cc: shenjian15, wangjie125, liuyonglong, shaojijie, netdev,
linux-kernel
In-Reply-To: <20231110093713.1895949-1-shaojijie@huawei.com>
If PF is down, firmware will returns 10 Mbit/s rate and half-duplex mode
when PF queries the port information from firmware.
After imp reset command is executed, PF status changes to down,
and PF will query link status and updates port information
from firmware in a periodic scheduled task.
However, there is a low probability that port information is updated
when PF is down, and then PF link status changes to up.
In this case, PF synchronizes incorrect rate and duplex mode to VF.
This patch fixes it by updating port information before
PF synchronizes the rate and duplex to the VF
when PF changes to up.
Fixes: 18b6e31f8bf4 ("net: hns3: PF add support for pushing link status to VFs")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index c393b4ee4a32..5ea9e59569ef 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -61,6 +61,7 @@ static void hclge_sync_fd_table(struct hclge_dev *hdev);
static void hclge_update_fec_stats(struct hclge_dev *hdev);
static int hclge_mac_link_status_wait(struct hclge_dev *hdev, int link_ret,
int wait_cnt);
+static int hclge_update_port_info(struct hclge_dev *hdev);
static struct hnae3_ae_algo ae_algo;
@@ -3041,6 +3042,9 @@ static void hclge_update_link_status(struct hclge_dev *hdev)
if (state != hdev->hw.mac.link) {
hdev->hw.mac.link = state;
+ if (state == HCLGE_LINK_STATUS_UP)
+ hclge_update_port_info(hdev);
+
client->ops->link_status_change(handle, state);
hclge_config_mac_tnl_int(hdev, state);
if (rclient && rclient->ops->link_status_change)
--
2.30.0
^ permalink raw reply related
* [PATCH V2 net 6/7] net: hns3: fix VF reset fail issue
From: Jijie Shao @ 2023-11-10 9:37 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, edumazet, kuba, pabeni
Cc: shenjian15, wangjie125, liuyonglong, shaojijie, netdev,
linux-kernel
In-Reply-To: <20231110093713.1895949-1-shaojijie@huawei.com>
Currently the reset process in hns3 and firmware watchdog init process is
asynchronous. We think firmware watchdog initialization is completed
before VF clear the interrupt source. However, firmware initialization
may not complete early. So VF will receive multiple reset interrupts
and fail to reset.
So we add delay before VF interrupt source and 5 ms delay
is enough to avoid second reset interrupt.
Fixes: 427900d27d86 ("net: hns3: fix the timing issue of VF clearing interrupt sources")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
---
ChangeLog:
v1 -> v2:
- Use timer_list to replace the 5 ms delay in irq handle suggested by Paolo
v1: https://lore.kernel.org/all/20231028025917.314305-7-shaojijie@huawei.com/
---
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 14 +++++++++++++-
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 1 +
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 1c62e58ff6d8..0aa9beefd1c7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1981,8 +1981,18 @@ static enum hclgevf_evt_cause hclgevf_check_evt_cause(struct hclgevf_dev *hdev,
return HCLGEVF_VECTOR0_EVENT_OTHER;
}
+static void hclgevf_reset_timer(struct timer_list *t)
+{
+ struct hclgevf_dev *hdev = from_timer(hdev, t, reset_timer);
+
+ hclgevf_clear_event_cause(hdev, HCLGEVF_VECTOR0_EVENT_RST);
+ hclgevf_reset_task_schedule(hdev);
+}
+
static irqreturn_t hclgevf_misc_irq_handle(int irq, void *data)
{
+#define HCLGEVF_RESET_DELAY 5
+
enum hclgevf_evt_cause event_cause;
struct hclgevf_dev *hdev = data;
u32 clearval;
@@ -1994,7 +2004,8 @@ static irqreturn_t hclgevf_misc_irq_handle(int irq, void *data)
switch (event_cause) {
case HCLGEVF_VECTOR0_EVENT_RST:
- hclgevf_reset_task_schedule(hdev);
+ mod_timer(&hdev->reset_timer,
+ jiffies + msecs_to_jiffies(HCLGEVF_RESET_DELAY));
break;
case HCLGEVF_VECTOR0_EVENT_MBX:
hclgevf_mbx_handler(hdev);
@@ -2937,6 +2948,7 @@ static int hclgevf_init_hdev(struct hclgevf_dev *hdev)
HCLGEVF_DRIVER_NAME);
hclgevf_task_schedule(hdev, round_jiffies_relative(HZ));
+ timer_setup(&hdev->reset_timer, hclgevf_reset_timer, 0);
return 0;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index 81c16b8c8da2..a73f2bf3a56a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -219,6 +219,7 @@ struct hclgevf_dev {
enum hnae3_reset_type reset_level;
unsigned long reset_pending;
enum hnae3_reset_type reset_type;
+ struct timer_list reset_timer;
#define HCLGEVF_RESET_REQUESTED 0
#define HCLGEVF_RESET_PENDING 1
--
2.30.0
^ permalink raw reply related
* [PATCH V2 net 1/7] net: hns3: fix add VLAN fail issue
From: Jijie Shao @ 2023-11-10 9:37 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, edumazet, kuba, pabeni
Cc: shenjian15, wangjie125, liuyonglong, shaojijie, netdev,
linux-kernel
In-Reply-To: <20231110093713.1895949-1-shaojijie@huawei.com>
From: Jian Shen <shenjian15@huawei.com>
The hclge_sync_vlan_filter is called in periodic task,
trying to remove VLAN from vlan_del_fail_bmap. It can
be concurrence with VLAN adding operation from user.
So once user failed to delete a VLAN id, and add it
again soon, it may be removed by the periodic task,
which may cause the software configuration being
inconsistent with hardware. So add mutex handling
to avoid this.
user hns3 driver
periodic task
│
add vlan 10 ───── hns3_vlan_rx_add_vid │
│ (suppose success) │
│ │
del vlan 10 ───── hns3_vlan_rx_kill_vid │
│ (suppose fail,add to │
│ vlan_del_fail_bmap) │
│ │
add vlan 10 ───── hns3_vlan_rx_add_vid │
(suppose success) │
foreach vlan_del_fail_bmp
del vlan 10
Fixes: fe4144d47eef ("net: hns3: sync VLAN filter entries when kill VLAN ID failed")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
---
ChangeLog:
v1 -> v2:
- Fix 'hclge_rm_vport_vlan_table()' happen with the vport_lock unlocked suggested by Paolo
v1: https://lore.kernel.org/all/20231028025917.314305-2-shaojijie@huawei.com/
---
.../hisilicon/hns3/hns3pf/hclge_main.c | 28 +++++++++++++------
.../hisilicon/hns3/hns3vf/hclgevf_main.c | 11 ++++++--
2 files changed, 29 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 66e5807903a0..e22279e5d43f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -10025,8 +10025,6 @@ static void hclge_rm_vport_vlan_table(struct hclge_vport *vport, u16 vlan_id,
struct hclge_vport_vlan_cfg *vlan, *tmp;
struct hclge_dev *hdev = vport->back;
- mutex_lock(&hdev->vport_lock);
-
list_for_each_entry_safe(vlan, tmp, &vport->vlan_list, node) {
if (vlan->vlan_id == vlan_id) {
if (is_write_tbl && vlan->hd_tbl_status)
@@ -10041,8 +10039,6 @@ static void hclge_rm_vport_vlan_table(struct hclge_vport *vport, u16 vlan_id,
break;
}
}
-
- mutex_unlock(&hdev->vport_lock);
}
void hclge_rm_vport_all_vlan_table(struct hclge_vport *vport, bool is_del_list)
@@ -10451,11 +10447,16 @@ int hclge_set_vlan_filter(struct hnae3_handle *handle, __be16 proto,
* handle mailbox. Just record the vlan id, and remove it after
* reset finished.
*/
+ mutex_lock(&hdev->vport_lock);
if ((test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state) ||
test_bit(HCLGE_STATE_RST_FAIL, &hdev->state)) && is_kill) {
set_bit(vlan_id, vport->vlan_del_fail_bmap);
+ mutex_unlock(&hdev->vport_lock);
return -EBUSY;
+ } else if (!is_kill && test_bit(vlan_id, vport->vlan_del_fail_bmap)) {
+ clear_bit(vlan_id, vport->vlan_del_fail_bmap);
}
+ mutex_unlock(&hdev->vport_lock);
/* when port base vlan enabled, we use port base vlan as the vlan
* filter entry. In this case, we don't update vlan filter table
@@ -10470,17 +10471,22 @@ int hclge_set_vlan_filter(struct hnae3_handle *handle, __be16 proto,
}
if (!ret) {
- if (!is_kill)
+ if (!is_kill) {
hclge_add_vport_vlan_table(vport, vlan_id,
writen_to_tbl);
- else if (is_kill && vlan_id != 0)
+ } else if (is_kill && vlan_id != 0) {
+ mutex_lock(&hdev->vport_lock);
hclge_rm_vport_vlan_table(vport, vlan_id, false);
+ mutex_unlock(&hdev->vport_lock);
+ }
} else if (is_kill) {
/* when remove hw vlan filter failed, record the vlan id,
* and try to remove it from hw later, to be consistence
* with stack
*/
+ mutex_lock(&hdev->vport_lock);
set_bit(vlan_id, vport->vlan_del_fail_bmap);
+ mutex_unlock(&hdev->vport_lock);
}
hclge_set_vport_vlan_fltr_change(vport);
@@ -10520,6 +10526,7 @@ static void hclge_sync_vlan_filter(struct hclge_dev *hdev)
int i, ret, sync_cnt = 0;
u16 vlan_id;
+ mutex_lock(&hdev->vport_lock);
/* start from vport 1 for PF is always alive */
for (i = 0; i < hdev->num_alloc_vport; i++) {
struct hclge_vport *vport = &hdev->vport[i];
@@ -10530,21 +10537,26 @@ static void hclge_sync_vlan_filter(struct hclge_dev *hdev)
ret = hclge_set_vlan_filter_hw(hdev, htons(ETH_P_8021Q),
vport->vport_id, vlan_id,
true);
- if (ret && ret != -EINVAL)
+ if (ret && ret != -EINVAL) {
+ mutex_unlock(&hdev->vport_lock);
return;
+ }
clear_bit(vlan_id, vport->vlan_del_fail_bmap);
hclge_rm_vport_vlan_table(vport, vlan_id, false);
hclge_set_vport_vlan_fltr_change(vport);
sync_cnt++;
- if (sync_cnt >= HCLGE_MAX_SYNC_COUNT)
+ if (sync_cnt >= HCLGE_MAX_SYNC_COUNT) {
+ mutex_unlock(&hdev->vport_lock);
return;
+ }
vlan_id = find_first_bit(vport->vlan_del_fail_bmap,
VLAN_N_VID);
}
}
+ mutex_unlock(&hdev->vport_lock);
hclge_sync_vlan_fltr_state(hdev);
}
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index a4d68fb216fb..1c62e58ff6d8 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1206,6 +1206,8 @@ static int hclgevf_set_vlan_filter(struct hnae3_handle *handle,
test_bit(HCLGEVF_STATE_RST_FAIL, &hdev->state)) && is_kill) {
set_bit(vlan_id, hdev->vlan_del_fail_bmap);
return -EBUSY;
+ } else if (!is_kill && test_bit(vlan_id, hdev->vlan_del_fail_bmap)) {
+ clear_bit(vlan_id, hdev->vlan_del_fail_bmap);
}
hclgevf_build_send_msg(&send_msg, HCLGE_MBX_SET_VLAN,
@@ -1233,20 +1235,25 @@ static void hclgevf_sync_vlan_filter(struct hclgevf_dev *hdev)
int ret, sync_cnt = 0;
u16 vlan_id;
+ if (bitmap_empty(hdev->vlan_del_fail_bmap, VLAN_N_VID))
+ return;
+
+ rtnl_lock();
vlan_id = find_first_bit(hdev->vlan_del_fail_bmap, VLAN_N_VID);
while (vlan_id != VLAN_N_VID) {
ret = hclgevf_set_vlan_filter(handle, htons(ETH_P_8021Q),
vlan_id, true);
if (ret)
- return;
+ break;
clear_bit(vlan_id, hdev->vlan_del_fail_bmap);
sync_cnt++;
if (sync_cnt >= HCLGEVF_MAX_SYNC_COUNT)
- return;
+ break;
vlan_id = find_first_bit(hdev->vlan_del_fail_bmap, VLAN_N_VID);
}
+ rtnl_unlock();
}
static int hclgevf_en_hw_strip_rxvtag(struct hnae3_handle *handle, bool enable)
--
2.30.0
^ permalink raw reply related
* [PATCH V2 net 3/7] net: hns3: fix incorrect capability bit display for copper port
From: Jijie Shao @ 2023-11-10 9:37 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, edumazet, kuba, pabeni
Cc: shenjian15, wangjie125, liuyonglong, shaojijie, netdev,
linux-kernel
In-Reply-To: <20231110093713.1895949-1-shaojijie@huawei.com>
From: Jian Shen <shenjian15@huawei.com>
Currently, the FEC capability bit is default set for device version V2.
It's incorrect for the copper port. Eventhough it doesn't make the nic
work abnormal, but the capability information display in debugfs may
confuse user. So clear it when driver get the port type inforamtion.
Fixes: 433ccce83504 ("net: hns3: use FEC capability queried from firmware")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index e22279e5d43f..c393b4ee4a32 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -11663,6 +11663,7 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
goto err_msi_irq_uninit;
if (hdev->hw.mac.media_type == HNAE3_MEDIA_TYPE_COPPER) {
+ clear_bit(HNAE3_DEV_SUPPORT_FEC_B, ae_dev->caps);
if (hnae3_dev_phy_imp_supported(hdev))
ret = hclge_update_tp_port_info(hdev);
else
--
2.30.0
^ permalink raw reply related
* [PATCH V2 net 4/7] net: hns3: fix out-of-bounds access may occur when coalesce info is read via debugfs
From: Jijie Shao @ 2023-11-10 9:37 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, edumazet, kuba, pabeni
Cc: shenjian15, wangjie125, liuyonglong, shaojijie, netdev,
linux-kernel
In-Reply-To: <20231110093713.1895949-1-shaojijie@huawei.com>
From: Yonglong Liu <liuyonglong@huawei.com>
The hns3 driver define an array of string to show the coalesce
info, but if the kernel adds a new mode or a new state,
out-of-bounds access may occur when coalesce info is read via
debugfs, this patch fix the problem.
Fixes: c99fead7cb07 ("net: hns3: add debugfs support for interrupt coalesce")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
index 0b138635bafa..c083d1d10767 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
@@ -503,11 +503,14 @@ static void hns3_get_coal_info(struct hns3_enet_tqp_vector *tqp_vector,
}
sprintf(result[j++], "%d", i);
- sprintf(result[j++], "%s", dim_state_str[dim->state]);
+ sprintf(result[j++], "%s", dim->state < ARRAY_SIZE(dim_state_str) ?
+ dim_state_str[dim->state] : "unknown");
sprintf(result[j++], "%u", dim->profile_ix);
- sprintf(result[j++], "%s", dim_cqe_mode_str[dim->mode]);
+ sprintf(result[j++], "%s", dim->mode < ARRAY_SIZE(dim_cqe_mode_str) ?
+ dim_cqe_mode_str[dim->mode] : "unknown");
sprintf(result[j++], "%s",
- dim_tune_stat_str[dim->tune_state]);
+ dim->tune_state < ARRAY_SIZE(dim_tune_stat_str) ?
+ dim_tune_stat_str[dim->tune_state] : "unknown");
sprintf(result[j++], "%u", dim->steps_left);
sprintf(result[j++], "%u", dim->steps_right);
sprintf(result[j++], "%u", dim->tired);
--
2.30.0
^ permalink raw reply related
* [PATCH V2 net 5/7] net: hns3: fix variable may not initialized problem in hns3_init_mac_addr()
From: Jijie Shao @ 2023-11-10 9:37 UTC (permalink / raw)
To: yisen.zhuang, salil.mehta, davem, edumazet, kuba, pabeni
Cc: shenjian15, wangjie125, liuyonglong, shaojijie, netdev,
linux-kernel
In-Reply-To: <20231110093713.1895949-1-shaojijie@huawei.com>
From: Yonglong Liu <liuyonglong@huawei.com>
When a VF is calling hns3_init_mac_addr(), get_mac_addr() may
return fail, then the value of mac_addr_temp is not initialized.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 06117502001f..b618797a7e8d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -5139,7 +5139,7 @@ static int hns3_init_mac_addr(struct net_device *netdev)
struct hns3_nic_priv *priv = netdev_priv(netdev);
char format_mac_addr[HNAE3_FORMAT_MAC_ADDR_LEN];
struct hnae3_handle *h = priv->ae_handle;
- u8 mac_addr_temp[ETH_ALEN];
+ u8 mac_addr_temp[ETH_ALEN] = {0};
int ret = 0;
if (h->ae_algo->ops->get_mac_addr)
--
2.30.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox