[net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver
@ 2025-05-22  7:54 Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description Lukasz Majewski
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski

This patch series adds support for More Than IP's L2 switch driver embedded
in some NXP's SoCs. This one has been tested on imx287, but is also available
in the vf610.

In the past there has been performed some attempts to upstream this driver:

1. The 4.19-cip based one [1]
2. DSA based one for 5.12 [2] - i.e. the switch itself was treat as a DSA switch
   with NO tag appended.
3. The extension for FEC driver for 5.12 [3] - the trick here was to fully reuse
   FEC when the in-HW switching is disabled. When bridge offloading is enabled,
   the driver uses already configured MAC and PHY to also configure PHY.

All three approaches were not accepted as eligible for upstreaming.

The driver from this series has floowing features:

1. It is fully separated from fec_main - i.e. can be used interchangeable
   with it. To be more specific - one can build them as modules and
   if required switch between them when e.g. bridge offloading is required.

   To be more specific:
        - Use FEC_MAIN: When one needs support for two ETH ports with separate
          uDMAs used for both and bridging can be realized in SW.

        - Use MTIPL2SW: When it is enough to support two ports with only uDMA0
          attached to switch and bridging shall be offloaded to HW. 

2. This driver uses MTIP's L2 switch internal VLAN feature to provide port
   separation at boot time. Port separation is disabled when bridging is
   required.

3. Example usage:
        Configuration:
        ip link set lan0 up; sleep 1;
        ip link set lan1 up; sleep 1;
        ip link add name br0 type bridge;
        ip link set br0 up; sleep 1;
        ip link set lan0 master br0;
        ip link set lan1 master br0;
        bridge link;
        ip addr add 192.168.2.17/24 dev br0;
        ping -c 5 192.168.2.222

        Removal:
        ip link set br0 down;
        ip link delete br0 type bridge;
        ip link set dev lan1 down
        ip link set dev lan0 down

4. Limitations:
        - Driver enables and disables switch operation with learning and ageing.
        - Missing is the advanced configuration (e.g. adding entries to FBD). This is
          on purpose, as up till now we didn't had consensus about how the driver
          shall be added to Linux.

5. Clang build:
	make LLVM_SUFFIX=-19 LLVM=1 mrproper
	cp ./arch/arm/configs/mxs_defconfig .config
	make ARCH=arm LLVM_SUFFIX=-19 LLVM=1 W=1 menuconfig
	make ARCH=arm LLVM_SUFFIX=-19 LLVM=1 W=1 -j8 LOADADDR=0x40008000 uImage dtbs

6. Kernel compliance checks:
	make coccicheck MODE=report J=4 M=drivers/net/ethernet/freescale/mtipsw/
	~/work/src/smatch/smatch_scripts/kchecker drivers/net/ethernet/freescale/mtipsw/


Links:
[1] - https://github.com/lmajewski/linux-imx28-l2switch/commits/master
[2] - https://github.com/lmajewski/linux-imx28-l2switch/tree/imx28-v5.12-L2-upstream-RFC_v1
[3] - https://source.denx.de/linux/linux-imx28-l2switch/-/tree/imx28-v5.12-L2-upstream-switchdev-RFC_v1?ref_type=heads

Lukasz Majewski (7):
  dt-bindings: net: Add MTIP L2 switch description
  ARM: dts: nxp: mxs: Adjust the imx28.dtsi L2 switch description
  ARM: dts: nxp: mxs: Adjust XEA board's DTS to support L2 switch
  net: mtip: The L2 switch driver for imx287
  ARM: mxs_defconfig: Enable CONFIG_NFS_FSCACHE
  ARM: mxs_defconfig: Update mxs_defconfig to 6.15-rc1
  ARM: mxs_defconfig: Enable CONFIG_FEC_MTIP_L2SW to support MTIP L2
    switch

 .../bindings/net/nxp,imx28-mtip-switch.yaml   |  150 ++
 MAINTAINERS                                   |    7 +
 arch/arm/boot/dts/nxp/mxs/imx28-xea.dts       |   56 +
 arch/arm/boot/dts/nxp/mxs/imx28.dtsi          |    9 +-
 arch/arm/configs/mxs_defconfig                |   13 +-
 drivers/net/ethernet/freescale/Kconfig        |    1 +
 drivers/net/ethernet/freescale/Makefile       |    1 +
 drivers/net/ethernet/freescale/mtipsw/Kconfig |   13 +
 .../net/ethernet/freescale/mtipsw/Makefile    |    4 +
 .../net/ethernet/freescale/mtipsw/mtipl2sw.c  | 1970 +++++++++++++++++
 .../net/ethernet/freescale/mtipsw/mtipl2sw.h  |  771 +++++++
 .../ethernet/freescale/mtipsw/mtipl2sw_br.c   |  120 +
 .../ethernet/freescale/mtipsw/mtipl2sw_mgnt.c |  436 ++++
 13 files changed, 3540 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/Kconfig
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/Makefile
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw.c
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw.h
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw_br.c
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw_mgnt.c

-- 
2.39.5



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description
  2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
@ 2025-05-22  7:54 ` Lukasz Majewski
  2025-06-05 13:51   ` Rob Herring (Arm)
  2025-05-22  7:54 ` [net-next v12 2/7] ARM: dts: nxp: mxs: Adjust the imx28.dtsi " Lukasz Majewski
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski

This patch provides description of the MTIP L2 switch available in some
NXP's SOCs - e.g. imx287.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>

---
Changes for v2:
- Rename the file to match exactly the compatible
  (nxp,imx287-mtip-switch)

Changes for v3:
- Remove '-' from const:'nxp,imx287-mtip-switch'
- Use '^port@[12]+$' for port patternProperties
- Drop status = "okay";
- Provide proper indentation for 'example' binding (replace 8
  spaces with 4 spaces)
- Remove smsc,disable-energy-detect; property
- Remove interrupt-parent and interrupts properties as not required
- Remove #address-cells and #size-cells from required properties check
- remove description from reg:
- Add $ref: ethernet-switch.yaml#

Changes for v4:
- Use $ref: ethernet-switch.yaml#/$defs/ethernet-ports and remove already
  referenced properties
- Rename file to nxp,imx28-mtip-switch.yaml

Changes for v5:
- Provide proper description for 'ethernet-port' node

Changes for v6:
- Proper usage of
  $ref: ethernet-switch.yaml#/$defs/ethernet-ports/patternProperties
  when specifying the 'ethernet-ports' property
- Add description and check for interrupt-names property

Changes for v7:
- Change switch interrupt name from 'mtipl2sw' to 'enet_switch'

Changes for v8:
- None

Changes for v9:
- Add GPIO_ACTIVE_LOW to reset-gpios mdio phandle

Changes for v10:
- None

Changes for v11:
- None

Changes for v12:
- Remove 'label' from required properties
- Move the reference to $ref: ethernet-switch.yaml#/$defs/ethernet-ports
  the proper place (under 'allOf:')
---
 .../bindings/net/nxp,imx28-mtip-switch.yaml   | 150 ++++++++++++++++++
 1 file changed, 150 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml

diff --git a/Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml b/Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml
new file mode 100644
index 000000000000..6a07dcd119ea
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml
@@ -0,0 +1,150 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/nxp,imx28-mtip-switch.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: NXP SoC Ethernet Switch Controller (L2 MoreThanIP switch)
+
+maintainers:
+  - Lukasz Majewski <lukma@denx.de>
+
+description:
+  The 2-port switch ethernet subsystem provides ethernet packet (L2)
+  communication and can be configured as an ethernet switch. It provides the
+  reduced media independent interface (RMII), the management data input
+  output (MDIO) for physical layer device (PHY) management.
+
+allOf:
+  - $ref: ethernet-switch.yaml#/$defs/ethernet-ports
+
+properties:
+  compatible:
+    const: nxp,imx28-mtip-switch
+
+  reg:
+    maxItems: 1
+
+  phy-supply:
+    description:
+      Regulator that powers Ethernet PHYs.
+
+  clocks:
+    items:
+      - description: Register accessing clock
+      - description: Bus access clock
+      - description: Output clock for external device - e.g. PHY source clock
+      - description: IEEE1588 timer clock
+
+  clock-names:
+    items:
+      - const: ipg
+      - const: ahb
+      - const: enet_out
+      - const: ptp
+
+  interrupts:
+    items:
+      - description: Switch interrupt
+      - description: ENET0 interrupt
+      - description: ENET1 interrupt
+
+  interrupt-names:
+    items:
+      - const: enet_switch
+      - const: enet0
+      - const: enet1
+
+  pinctrl-names: true
+
+  ethernet-ports:
+    type: object
+    additionalProperties: true
+
+    patternProperties:
+      '^ethernet-port@[12]$':
+        type: object
+        additionalProperties: true
+        properties:
+          reg:
+            items:
+              - enum: [1, 2]
+            description: MTIP L2 switch port number
+
+        required:
+          - reg
+          - phy-mode
+          - phy-handle
+
+  mdio:
+    type: object
+    $ref: mdio.yaml#
+    unevaluatedProperties: false
+    description:
+      Specifies the mdio bus in the switch, used as a container for phy nodes.
+
+required:
+  - compatible
+  - reg
+  - clocks
+  - clock-names
+  - interrupts
+  - interrupt-names
+  - mdio
+  - ethernet-ports
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    #include<dt-bindings/interrupt-controller/irq.h>
+    #include<dt-bindings/gpio/gpio.h>
+    switch@800f0000 {
+        compatible = "nxp,imx28-mtip-switch";
+        reg = <0x800f0000 0x20000>;
+        pinctrl-names = "default";
+        pinctrl-0 = <&mac0_pins_a>, <&mac1_pins_a>;
+        phy-supply = <&reg_fec_3v3>;
+        interrupts = <100>, <101>, <102>;
+        interrupt-names = "enet_switch", "enet0", "enet1";
+        clocks = <&clks 57>, <&clks 57>, <&clks 64>, <&clks 35>;
+        clock-names = "ipg", "ahb", "enet_out", "ptp";
+
+        ethernet-ports {
+            #address-cells = <1>;
+            #size-cells = <0>;
+
+            mtip_port1: ethernet-port@1 {
+                reg = <1>;
+                label = "lan0";
+                local-mac-address = [ 00 00 00 00 00 00 ];
+                phy-mode = "rmii";
+                phy-handle = <&ethphy0>;
+            };
+
+            mtip_port2: ethernet-port@2 {
+                reg = <2>;
+                label = "lan1";
+                local-mac-address = [ 00 00 00 00 00 00 ];
+                phy-mode = "rmii";
+                phy-handle = <&ethphy1>;
+            };
+        };
+
+        mdio_sw: mdio {
+            #address-cells = <1>;
+            #size-cells = <0>;
+
+            reset-gpios = <&gpio2 13 GPIO_ACTIVE_LOW>;
+            reset-delay-us = <25000>;
+            reset-post-delay-us = <10000>;
+
+            ethphy0: ethernet-phy@0 {
+                reg = <0>;
+            };
+
+            ethphy1: ethernet-phy@1 {
+                reg = <1>;
+            };
+        };
+    };
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description
  2025-05-22  7:54 ` [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description Lukasz Majewski
@ 2025-06-05 13:51   ` Rob Herring (Arm)
  0 siblings, 0 replies; 12+ messages in thread
From: Rob Herring (Arm) @ 2025-06-05 13:51 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Krzysztof Kozlowski, Simon Horman, davem, Paolo Abeni,
	Sascha Hauer, Andrew Lunn, devicetree, Richard Cochran,
	Fabio Estevam, linux-kernel, Stefan Wahren, Jakub Kicinski,
	Eric Dumazet, Pengutronix Kernel Team, Conor Dooley, Shawn Guo,
	netdev, imx, linux-arm-kernel


On Thu, 22 May 2025 09:54:49 +0200, Lukasz Majewski wrote:
> This patch provides description of the MTIP L2 switch available in some
> NXP's SOCs - e.g. imx287.
> 
> Signed-off-by: Lukasz Majewski <lukma@denx.de>
> Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
> 
> ---
> Changes for v2:
> - Rename the file to match exactly the compatible
>   (nxp,imx287-mtip-switch)
> 
> Changes for v3:
> - Remove '-' from const:'nxp,imx287-mtip-switch'
> - Use '^port@[12]+$' for port patternProperties
> - Drop status = "okay";
> - Provide proper indentation for 'example' binding (replace 8
>   spaces with 4 spaces)
> - Remove smsc,disable-energy-detect; property
> - Remove interrupt-parent and interrupts properties as not required
> - Remove #address-cells and #size-cells from required properties check
> - remove description from reg:
> - Add $ref: ethernet-switch.yaml#
> 
> Changes for v4:
> - Use $ref: ethernet-switch.yaml#/$defs/ethernet-ports and remove already
>   referenced properties
> - Rename file to nxp,imx28-mtip-switch.yaml
> 
> Changes for v5:
> - Provide proper description for 'ethernet-port' node
> 
> Changes for v6:
> - Proper usage of
>   $ref: ethernet-switch.yaml#/$defs/ethernet-ports/patternProperties
>   when specifying the 'ethernet-ports' property
> - Add description and check for interrupt-names property
> 
> Changes for v7:
> - Change switch interrupt name from 'mtipl2sw' to 'enet_switch'
> 
> Changes for v8:
> - None
> 
> Changes for v9:
> - Add GPIO_ACTIVE_LOW to reset-gpios mdio phandle
> 
> Changes for v10:
> - None
> 
> Changes for v11:
> - None
> 
> Changes for v12:
> - Remove 'label' from required properties
> - Move the reference to $ref: ethernet-switch.yaml#/$defs/ethernet-ports
>   the proper place (under 'allOf:')
> ---
>  .../bindings/net/nxp,imx28-mtip-switch.yaml   | 150 ++++++++++++++++++
>  1 file changed, 150 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml
> 

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [net-next v12 2/7] ARM: dts: nxp: mxs: Adjust the imx28.dtsi L2 switch description
  2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description Lukasz Majewski
@ 2025-05-22  7:54 ` Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 3/7] ARM: dts: nxp: mxs: Adjust XEA board's DTS to support L2 switch Lukasz Majewski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski,
	Andrew Lunn

The current range of 'reg' property is too small to allow full control
of the L2 switch on imx287.

As this IP block also uses ENET-MAC blocks for its operation, the address
range for it must be included as well.

Moreover, some SoC common properties (like compatible, clocks, interrupts
numbers) have been moved to this node.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>

---
Changes for v2:
- adding extra properties (like compatible, clocks, interupts)

Changes for v3:
- None

Changes for v4:
- Rename imx287 with imx28 (as the former is not used in kernel anymore)

Changes for v5:
- None

Changes for v6:
- Add interrupt-names property

Changes for v7:
- Change switch interrupt name from 'mtipl2sw' to 'enet_switch'

Changes for v8:
- None

Changes for v9:
- None

Changes for v10:
- None

Changes for v11:
- None

Changes for v12:
- None
---
 arch/arm/boot/dts/nxp/mxs/imx28.dtsi | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/nxp/mxs/imx28.dtsi b/arch/arm/boot/dts/nxp/mxs/imx28.dtsi
index bbea8b77386f..8aff2e87980e 100644
--- a/arch/arm/boot/dts/nxp/mxs/imx28.dtsi
+++ b/arch/arm/boot/dts/nxp/mxs/imx28.dtsi
@@ -1321,8 +1321,13 @@ mac1: ethernet@800f4000 {
 			status = "disabled";
 		};
 
-		eth_switch: switch@800f8000 {
-			reg = <0x800f8000 0x8000>;
+		eth_switch: switch@800f0000 {
+			compatible = "nxp,imx28-mtip-switch";
+			reg = <0x800f0000 0x20000>;
+			interrupts = <100>, <101>, <102>;
+			interrupt-names = "enet_switch", "enet0", "enet1";
+			clocks = <&clks 57>, <&clks 57>, <&clks 64>, <&clks 35>;
+			clock-names = "ipg", "ahb", "enet_out", "ptp";
 			status = "disabled";
 		};
 	};
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [net-next v12 3/7] ARM: dts: nxp: mxs: Adjust XEA board's DTS to support L2 switch
  2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 2/7] ARM: dts: nxp: mxs: Adjust the imx28.dtsi " Lukasz Majewski
@ 2025-05-22  7:54 ` Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 5/7] ARM: mxs_defconfig: Enable CONFIG_NFS_FSCACHE Lukasz Majewski
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski,
	Andrew Lunn

The description is similar to the one used with the new CPSW driver.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>

---
Changes for v2:
- Remove properties which are common for the imx28(7) SoC
- Use mdio properties to perform L2 switch reset (avoid using
  deprecated properties)

Changes for v3:
- Replace IRQ_TYPE_EDGE_FALLING with IRQ_TYPE_LEVEL_LOW
- Update comment regarding PHY interrupts s/AND/OR/g

Changes for v4:
- Use GPIO_ACTIVE_LOW instead of 0 in 'reset-gpios'
- Replace port@[12] with ethernet-port@[12]

Changes for v5:
- Add proper multiline comment for IRQs description

Changes for v6:
- None

Changes for v7:
- None

Changes for v8:
- None

Changes for v9:
- None

Changes for v10:
- None

Changes for v11:
- None

Changes for v12:
- None
---
 arch/arm/boot/dts/nxp/mxs/imx28-xea.dts | 56 +++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/arch/arm/boot/dts/nxp/mxs/imx28-xea.dts b/arch/arm/boot/dts/nxp/mxs/imx28-xea.dts
index 6c5e6856648a..69032b29d767 100644
--- a/arch/arm/boot/dts/nxp/mxs/imx28-xea.dts
+++ b/arch/arm/boot/dts/nxp/mxs/imx28-xea.dts
@@ -5,6 +5,7 @@
  */
 
 /dts-v1/;
+#include<dt-bindings/interrupt-controller/irq.h>
 #include "imx28-lwe.dtsi"
 
 / {
@@ -90,6 +91,61 @@ &reg_usb_5v {
 	gpio = <&gpio0 2 0>;
 };
 
+&eth_switch {
+	pinctrl-names = "default";
+	pinctrl-0 = <&mac0_pins_a>, <&mac1_pins_a>;
+	phy-supply = <&reg_fec_3v3>;
+	status = "okay";
+
+	ethernet-ports {
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		mtip_port1: ethernet-port@1 {
+			reg = <1>;
+			label = "lan0";
+			local-mac-address = [ 00 00 00 00 00 00 ];
+			phy-mode = "rmii";
+			phy-handle = <&ethphy0>;
+		};
+
+		mtip_port2: ethernet-port@2 {
+			reg = <2>;
+			label = "lan1";
+			local-mac-address = [ 00 00 00 00 00 00 ];
+			phy-mode = "rmii";
+			phy-handle = <&ethphy1>;
+		};
+	};
+
+	mdio_sw: mdio {
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		reset-gpios = <&gpio3 21 GPIO_ACTIVE_LOW>;
+		reset-delay-us = <25000>;
+		reset-post-delay-us = <10000>;
+
+		ethphy0: ethernet-phy@0 {
+			reg = <0>;
+			smsc,disable-energy-detect;
+			/*
+			 * Both PHYs (i.e. 0,1) have the same, single GPIO,
+			 * line to handle both, their interrupts (OR'ed)
+			 */
+			interrupt-parent = <&gpio4>;
+			interrupts = <13 IRQ_TYPE_LEVEL_LOW>;
+		};
+
+		ethphy1: ethernet-phy@1 {
+			reg = <1>;
+			smsc,disable-energy-detect;
+			interrupt-parent = <&gpio4>;
+			interrupts = <13 IRQ_TYPE_LEVEL_LOW>;
+		};
+	};
+};
+
 &spi2_pins_a {
 	fsl,pinmux-ids = <
 		MX28_PAD_SSP2_SCK__SSP2_SCK
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [net-next v12 5/7] ARM: mxs_defconfig: Enable CONFIG_NFS_FSCACHE
  2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
                   ` (2 preceding siblings ...)
  2025-05-22  7:54 ` [net-next v12 3/7] ARM: dts: nxp: mxs: Adjust XEA board's DTS to support L2 switch Lukasz Majewski
@ 2025-05-22  7:54 ` Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 6/7] ARM: mxs_defconfig: Update mxs_defconfig to 6.15-rc1 Lukasz Majewski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski

It is not possible to enable by user the CONFIG_NETFS_SUPPORT anymore and
hence it depends on CONFIG_NFS_FSCACHE being enabled.

This patch fixes potential performance regression for NFS on the mxs
devices.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
Suggested-by: Stefan Wahren <wahrenst@gmx.net>

---
Changes for v6:
- New patch

Changes for v7:
- None

Changes for v8:
- None

Changes for v9:
- None

Changes for v10:
- None

Changes for v11:
- None

Changes for v12:
- None
---
 arch/arm/configs/mxs_defconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/configs/mxs_defconfig b/arch/arm/configs/mxs_defconfig
index c76d66135abb..22f7639f61fe 100644
--- a/arch/arm/configs/mxs_defconfig
+++ b/arch/arm/configs/mxs_defconfig
@@ -138,8 +138,6 @@ CONFIG_PWM_MXS=y
 CONFIG_NVMEM_MXS_OCOTP=y
 CONFIG_EXT4_FS=y
 # CONFIG_DNOTIFY is not set
-CONFIG_NETFS_SUPPORT=m
-CONFIG_FSCACHE=y
 CONFIG_FSCACHE_STATS=y
 CONFIG_CACHEFILES=m
 CONFIG_VFAT_FS=y
@@ -155,6 +153,7 @@ CONFIG_NFS_FS=y
 CONFIG_NFS_V3_ACL=y
 CONFIG_NFS_V4=y
 CONFIG_ROOT_NFS=y
+CONFIG_NFS_FSCACHE=y
 CONFIG_NLS_CODEPAGE_437=y
 CONFIG_NLS_CODEPAGE_850=y
 CONFIG_NLS_ISO8859_1=y
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [net-next v12 6/7] ARM: mxs_defconfig: Update mxs_defconfig to 6.15-rc1
  2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
                   ` (3 preceding siblings ...)
  2025-05-22  7:54 ` [net-next v12 5/7] ARM: mxs_defconfig: Enable CONFIG_NFS_FSCACHE Lukasz Majewski
@ 2025-05-22  7:54 ` Lukasz Majewski
  2025-05-22  7:54 ` [net-next v12 7/7] ARM: mxs_defconfig: Enable CONFIG_FEC_MTIP_L2SW to support MTIP L2 switch Lukasz Majewski
       [not found] ` <20250522075455.1723560-5-lukma@denx.de>
  6 siblings, 0 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski

This file is the updated version of mxs_defconfig for the v6.15-rc1
linux-next.

Detailed description of removed configuration entries:
-CONFIG_MTD_M25P80=y -> it has been replaced MTD_SPI_NOR (which is enabled)
-CONFIG_SMSC_PHY=y -> is enabled implicit by USB_NET_SMSC95XX
-CONFIG_GPIO_SYSFS=y -> it has been deprecated by moving to EXPERT and
                        its replacement GPIO_CDEV is enabled by default

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
Suggested-by: Stefan Wahren <wahrenst@gmx.net>

---
Changes for v5:
- New patch

Changes for v6:
- Add detailed description on the removed configuration options
  after update

Changes for v7:
- None

Changes for v8:
- None

Changes for v9:
- None

Changes for v10:
- None

Changes for v11:
- None

Changes for v12:
- None
---
 arch/arm/configs/mxs_defconfig | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/arm/configs/mxs_defconfig b/arch/arm/configs/mxs_defconfig
index 22f7639f61fe..b1a31cb914c8 100644
--- a/arch/arm/configs/mxs_defconfig
+++ b/arch/arm/configs/mxs_defconfig
@@ -32,9 +32,6 @@ CONFIG_INET=y
 CONFIG_IP_PNP=y
 CONFIG_IP_PNP_DHCP=y
 CONFIG_SYN_COOKIES=y
-# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
-# CONFIG_INET_XFRM_MODE_TUNNEL is not set
-# CONFIG_INET_XFRM_MODE_BEET is not set
 # CONFIG_INET_DIAG is not set
 # CONFIG_IPV6 is not set
 CONFIG_CAN=m
@@ -45,7 +42,6 @@ CONFIG_MTD=y
 CONFIG_MTD_CMDLINE_PARTS=y
 CONFIG_MTD_BLOCK=y
 CONFIG_MTD_DATAFLASH=y
-CONFIG_MTD_M25P80=y
 CONFIG_MTD_SST25L=y
 CONFIG_MTD_RAW_NAND=y
 CONFIG_MTD_NAND_GPMI_NAND=y
@@ -60,7 +56,6 @@ CONFIG_ENC28J60=y
 CONFIG_ICPLUS_PHY=y
 CONFIG_MICREL_PHY=y
 CONFIG_REALTEK_PHY=y
-CONFIG_SMSC_PHY=y
 CONFIG_CAN_FLEXCAN=m
 CONFIG_USB_USBNET=y
 CONFIG_USB_NET_SMSC95XX=y
@@ -77,13 +72,11 @@ CONFIG_SERIAL_AMBA_PL011=y
 CONFIG_SERIAL_AMBA_PL011_CONSOLE=y
 CONFIG_SERIAL_MXS_AUART=y
 # CONFIG_HW_RANDOM is not set
-# CONFIG_I2C_COMPAT is not set
 CONFIG_I2C_CHARDEV=y
 CONFIG_I2C_MXS=y
 CONFIG_SPI=y
 CONFIG_SPI_GPIO=m
 CONFIG_SPI_MXS=y
-CONFIG_GPIO_SYSFS=y
 # CONFIG_HWMON is not set
 CONFIG_WATCHDOG=y
 CONFIG_STMP3XXX_RTC_WATCHDOG=y
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [net-next v12 7/7] ARM: mxs_defconfig: Enable CONFIG_FEC_MTIP_L2SW to support MTIP L2 switch
  2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
                   ` (4 preceding siblings ...)
  2025-05-22  7:54 ` [net-next v12 6/7] ARM: mxs_defconfig: Update mxs_defconfig to 6.15-rc1 Lukasz Majewski
@ 2025-05-22  7:54 ` Lukasz Majewski
       [not found] ` <20250522075455.1723560-5-lukma@denx.de>
  6 siblings, 0 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-22  7:54 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski

This patch enables support for More Than IP L2 switch available on some
imx28[7] devices.

Moreover, it also enables CONFIG_SWITCHDEV and CONFIG_BRIDGE required
by this driver for correct operation.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
---
Changes for v4:
- New patch

Changes for v5:
- Apply this patch on top of patch, which updates mxs_defconfig to
  v6.15-rc1
- Add more verbose commit message with explanation why SWITCHDEV and
  BRIDGE must be enabled as well

Changes for v6:
- None

Changes for v7:
- None

Changes for v8:
- None

Changes for v9:
- None

Changes for v10:
- None

Changes for v11:
- None

Changes for v12:
- None
---
 arch/arm/configs/mxs_defconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/configs/mxs_defconfig b/arch/arm/configs/mxs_defconfig
index b1a31cb914c8..ef4556222274 100644
--- a/arch/arm/configs/mxs_defconfig
+++ b/arch/arm/configs/mxs_defconfig
@@ -34,6 +34,8 @@ CONFIG_IP_PNP_DHCP=y
 CONFIG_SYN_COOKIES=y
 # CONFIG_INET_DIAG is not set
 # CONFIG_IPV6 is not set
+CONFIG_BRIDGE=y
+CONFIG_NET_SWITCHDEV=y
 CONFIG_CAN=m
 # CONFIG_WIRELESS is not set
 CONFIG_DEVTMPFS=y
@@ -52,6 +54,7 @@ CONFIG_EEPROM_AT24=y
 CONFIG_SCSI=y
 CONFIG_BLK_DEV_SD=y
 CONFIG_NETDEVICES=y
+CONFIG_FEC_MTIP_L2SW=y
 CONFIG_ENC28J60=y
 CONFIG_ICPLUS_PHY=y
 CONFIG_MICREL_PHY=y
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

[parent not found: <20250522075455.1723560-5-lukma@denx.de>]

* Re: [net-next v12 4/7] net: mtip: The L2 switch driver for imx287
       [not found] ` <20250522075455.1723560-5-lukma@denx.de>
@ 2025-05-27 11:39   ` Paolo Abeni
  2025-05-28 10:53     ` Lukasz Majewski
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2025-05-27 11:39 UTC (permalink / raw)
  To: Lukasz Majewski, Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Andrew Lunn

On 5/22/25 9:54 AM, Lukasz Majewski wrote:
> +/* dynamicms MAC address table learn and migration */
> +static void
> +mtip_atable_dynamicms_learn_migration(struct switch_enet_private *fep,
> +				      int curr_time, unsigned char *mac,
> +				      u8 *rx_port)
> +{
> +	u8 port = MTIP_PORT_FORWARDING_INIT;
> +	struct mtip_port_info *port_info;
> +	u32 rx_mac_lo = 0, rx_mac_hi = 0;
> +	unsigned long flags;
> +	int index;
> +
> +	spin_lock_irqsave(&fep->learn_lock, flags);

AFAICS this is called by napi context and by a plain thread context,
spin_lock_bh() should be sufficient.

> +
> +	if (mac && is_valid_ether_addr(mac)) {
> +		rx_mac_lo = (u32)((mac[3] << 24) | (mac[2] << 16) |
> +				  (mac[1] << 8) | mac[0]);
> +		rx_mac_hi = (u32)((mac[5] << 8) | (mac[4]));
> +	}
> +
> +	port_info = mtip_portinfofifo_read(fep);
> +	while (port_info) {
> +		/* get block index from lookup table */
> +		index = GET_BLOCK_PTR(port_info->hash);
> +		mtip_update_atable_dynamic1(port_info->maclo, port_info->machi,
> +					    index, port_info->port,
> +					    curr_time, fep);
> +
> +		if (mac && is_valid_ether_addr(mac) &&
> +		    port == MTIP_PORT_FORWARDING_INIT) {
> +			if (rx_mac_lo == port_info->maclo &&
> +			    rx_mac_hi == port_info->machi) {
> +				/* The newly learned MAC is the source of
> +				 * our filtered frame.
> +				 */
> +				port = (u8)port_info->port;
> +			}
> +		}
> +		port_info = mtip_portinfofifo_read(fep);
> +	}
> +
> +	if (rx_port)
> +		*rx_port = port;
> +
> +	spin_unlock_irqrestore(&fep->learn_lock, flags);
> +}
> +
> +static void mtip_aging_timer(struct timer_list *t)
> +{
> +	struct switch_enet_private *fep = from_timer(fep, t, timer_aging);
> +
> +	fep->curr_time = mtip_timeincrement(fep->curr_time);
> +
> +	mod_timer(&fep->timer_aging,
> +		  jiffies + msecs_to_jiffies(LEARNING_AGING_INTERVAL));
> +}

It's unclear to me why you need to maintain a timer just to update a
timestamp?!?

(jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 <<
AT_DENTRY_TIMESTAMP_WIDTH) - 1)

should yield the same value (and possibly define a bitmask as a shortcut)

> +static netdev_tx_t mtip_start_xmit_port(struct sk_buff *skb,
> +					struct net_device *dev, int port)
> +{
> +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> +	struct switch_enet_private *fep = priv->fep;
> +	unsigned short	status;
> +	unsigned long flags;
> +	struct cbd_t *bdp;
> +	void *bufaddr;
> +
> +	spin_lock_irqsave(&fep->hw_lock, flags);

AFAICS this lock is acquired only by napi and thread context the _bh
variant should be sufficient.

> +
> +	if (!fep->link[0] && !fep->link[1]) {
> +		/* Link is down or autonegotiation is in progress. */
> +		netif_stop_queue(dev);
> +		spin_unlock_irqrestore(&fep->hw_lock, flags);
> +		return NETDEV_TX_BUSY;

Intead you should probably stop the queue when such events happen

> +	}
> +
> +	/* Fill in a Tx ring entry */
> +	bdp = fep->cur_tx;
> +
> +	status = bdp->cbd_sc;
> +
> +	if (status & BD_ENET_TX_READY) {
> +		/* All transmit buffers are full. Bail out.
> +		 * This should not happen, since dev->tbusy should be set.
> +		 */
> +		dev_err(&fep->pdev->dev, "%s: tx queue full!.\n", dev->name);
> +		spin_unlock_irqrestore(&fep->hw_lock, flags);
> +		return NETDEV_TX_BUSY;

Instead you should use
netif_txq_maybe_stop()/netif_subqueue_maybe_stop() to stop the queue
eariler.

> +	}
> +
> +	/* Clear all of the status flags */
> +	status &= ~BD_ENET_TX_STATS;
> +
> +	/* Set buffer length and buffer pointer */
> +	bufaddr = skb->data;
> +	bdp->cbd_datlen = skb->len;
> +
> +	/* On some FEC implementations data must be aligned on
> +	 * 4-byte boundaries. Use bounce buffers to copy data
> +	 * and get it aligned.
> +	 */
> +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {
> +		unsigned int index;
> +
> +		index = bdp - fep->tx_bd_base;
> +		memcpy(fep->tx_bounce[index],
> +		       (void *)skb->data, skb->len);
> +		bufaddr = fep->tx_bounce[index];
> +	}
> +
> +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +		swap_buffer(bufaddr, skb->len);

Ouch, the above will kill performances. Also it looks like it will
access uninitialized memory if skb->len is not 4 bytes aligned.

> +
> +	/* Save skb pointer. */
> +	fep->tx_skbuff[fep->skb_cur] = skb;
> +
> +	dev->stats.tx_bytes += skb->len;

It looks like this start is incremented too early, as tx could still
fail later.

> +	fep->skb_cur = (fep->skb_cur + 1) & TX_RING_MOD_MASK;
> +
> +	/* Push the data cache so the CPM does not get stale memory
> +	 * data.
> +	 */
> +	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> +					  MTIP_SWITCH_TX_FRSIZE,
> +					  DMA_TO_DEVICE);
> +	if (unlikely(dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr))) {
> +		dev_err(&fep->pdev->dev,
> +			"Failed to map descriptor tx buffer\n");
> +		dev->stats.tx_errors++;
> +		dev->stats.tx_dropped++;
> +		dev_kfree_skb_any(skb);
> +		goto err;
> +	}
> +
> +	/* Send it on its way.  Tell FEC it's ready, interrupt when done,
> +	 * it's the last BD of the frame, and to put the CRC on the end.
> +	 */
> +

Likely you need some memory barrier here to ensure the descriptor status
update is seen by the device after the buffer addr update.

> +	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
> +			| BD_ENET_TX_LAST | BD_ENET_TX_TC);
> +	bdp->cbd_sc = status;
> +
> +	netif_trans_update(dev);
> +	skb_tx_timestamp(skb);
> +
> +	/* For port separation - force sending via specified port */
> +	if (!fep->br_offload && port != 0)
> +		mtip_forced_forward(fep, port, 1);
> +
> +	/* Trigger transmission start */
> +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);

Possibly you should check skb->xmit_more to avoid ringing the doorbell
when not needed.

> +static void mtip_timeout(struct net_device *dev, unsigned int txqueue)
> +{
> +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> +	struct switch_enet_private *fep = priv->fep;
> +	struct cbd_t *bdp;
> +	int i;
> +
> +	dev->stats.tx_errors++;
> +
> +	if (IS_ENABLED(CONFIG_SWITCH_DEBUG)) {
> +		dev_info(&dev->dev, "%s: transmit timed out.\n", dev->name);
> +		dev_info(&dev->dev,
> +			 "Ring data: cur_tx %lx%s, dirty_tx %lx cur_rx: %lx\n",
> +			 (unsigned long)fep->cur_tx,
> +			 fep->tx_full ? " (full)" : "",
> +			 (unsigned long)fep->dirty_tx,
> +			 (unsigned long)fep->cur_rx);
> +
> +		bdp = fep->tx_bd_base;
> +		dev_info(&dev->dev, " tx: %u buffers\n", TX_RING_SIZE);
> +		for (i = 0; i < TX_RING_SIZE; i++) {
> +			dev_info(&dev->dev, "  %08lx: %04x %04x %08x\n",
> +				 (kernel_ulong_t)bdp, bdp->cbd_sc,
> +				 bdp->cbd_datlen, (int)bdp->cbd_bufaddr);
> +			bdp++;
> +		}
> +
> +		bdp = fep->rx_bd_base;
> +		dev_info(&dev->dev, " rx: %lu buffers\n",
> +			 (unsigned long)RX_RING_SIZE);
> +		for (i = 0 ; i < RX_RING_SIZE; i++) {
> +			dev_info(&dev->dev, "  %08lx: %04x %04x %08x\n",
> +				 (kernel_ulong_t)bdp,
> +				 bdp->cbd_sc, bdp->cbd_datlen,
> +				 (int)bdp->cbd_bufaddr);
> +			bdp++;
> +		}
> +	}
> +
> +	rtnl_lock();

This is called in atomic scope, you can't acquire a mutex here. Instead
you could schedule a work and do the reset in such scope.

> +	if (netif_device_present(dev) || netif_running(dev)) {
> +		napi_disable(&fep->napi);
> +		netif_tx_lock_bh(dev);
> +		mtip_switch_restart(dev, fep->full_duplex[0],
> +				    fep->full_duplex[1]);
> +		netif_tx_wake_all_queues(dev);
> +		netif_tx_unlock_bh(dev);
> +		napi_enable(&fep->napi);
> +	}
> +	rtnl_unlock();
> +}

> +
> +/* During a receive, the cur_rx points to the current incoming buffer.
> + * When we update through the ring, if the next incoming buffer has
> + * not been given to the system, we just set the empty indicator,
> + * effectively tossing the packet.
> + */
> +static int mtip_switch_rx(struct net_device *dev, int budget, int *port)
> +{
> +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> +	u8 *data, rx_port = MTIP_PORT_FORWARDING_INIT;
> +	struct switch_enet_private *fep = priv->fep;
> +	unsigned short status, pkt_len;
> +	struct net_device *pndev;
> +	struct ethhdr *eth_hdr;
> +	int pkt_received = 0;
> +	struct sk_buff *skb;
> +	unsigned long flags;
> +	struct cbd_t *bdp;
> +
> +	spin_lock_irqsave(&fep->hw_lock, flags);
> +
> +	/* First, grab all of the stats for the incoming packet.
> +	 * These get messed up if we get called due to a busy condition.
> +	 */
> +	bdp = fep->cur_rx;
> +
> +	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> +		if (pkt_received >= budget)
> +			break;
> +
> +		pkt_received++;
> +		/* Since we have allocated space to hold a complete frame,
> +		 * the last indicator should be set.
> +		 */
> +		if ((status & BD_ENET_RX_LAST) == 0)
> +			dev_warn_ratelimited(&dev->dev,
> +					     "SWITCH ENET: rcv is not +last\n");
> +
> +		if (!fep->usage_count)
> +			goto rx_processing_done;
> +
> +		/* Check for errors. */
> +		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO |
> +			      BD_ENET_RX_CR | BD_ENET_RX_OV)) {
> +			dev->stats.rx_errors++;
> +			if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH)) {
> +				/* Frame too long or too short. */
> +				dev->stats.rx_length_errors++;
> +			}
> +			if (status & BD_ENET_RX_NO)	/* Frame alignment */
> +				dev->stats.rx_frame_errors++;
> +			if (status & BD_ENET_RX_CR)	/* CRC Error */
> +				dev->stats.rx_crc_errors++;
> +			if (status & BD_ENET_RX_OV)	/* FIFO overrun */
> +				dev->stats.rx_fifo_errors++;
> +		}
> +
> +		/* Report late collisions as a frame error.
> +		 * On this error, the BD is closed, but we don't know what we
> +		 * have in the buffer.  So, just drop this frame on the floor.
> +		 */
> +		if (status & BD_ENET_RX_CL) {
> +			dev->stats.rx_errors++;
> +			dev->stats.rx_frame_errors++;
> +			goto rx_processing_done;
> +		}
> +
> +		/* Process the incoming frame */
> +		pkt_len = bdp->cbd_datlen;
> +		data = (__u8 *)__va(bdp->cbd_bufaddr);
> +
> +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
> +				 bdp->cbd_datlen, DMA_FROM_DEVICE);

I have read your explaination WRT unmap/map. Actually you don't need to
do any mapping here, since you are unconditionally copying the whole
buffer (why???) and re-using it.

Still you need a dma_sync_single() to ensure the CPUs see the correct data.

> +
> +		if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +			swap_buffer(data, pkt_len);
> +
> +		if (data) {
> +			eth_hdr = (struct ethhdr *)data;
> +			mtip_atable_get_entry_port_number(fep,
> +							  eth_hdr->h_source,
> +							  &rx_port);
> +			if (rx_port == MTIP_PORT_FORWARDING_INIT)
> +				mtip_atable_dynamicms_learn_migration(fep,
> +								      fep->curr_time,
> +								      eth_hdr->h_source,
> +								      &rx_port);
> +		}
> +
> +		if (!fep->br_offload && (rx_port == 1 || rx_port == 2))
> +			pndev = fep->ndev[rx_port - 1];
> +		else
> +			pndev = dev;
> +
> +		*port = rx_port;
> +		pndev->stats.rx_packets++;
> +		pndev->stats.rx_bytes += pkt_len;

It looks like the stats are incremented too early, as the packets could
still be dropped a few lines later

> +
> +		/* This does 16 byte alignment, exactly what we need.
> +		 * The packet length includes FCS, but we don't want to
> +		 * include that when passing upstream as it messes up
> +		 * bridging applications.
> +		 */
> +		skb = netdev_alloc_skb(pndev, pkt_len + NET_IP_ALIGN);
> +		if (unlikely(!skb)) {
> +			dev_dbg(&fep->pdev->dev,
> +				"%s: Memory squeeze, dropping packet.\n",
> +				pndev->name);
> +			pndev->stats.rx_dropped++;
> +			goto err_mem;
> +		} else {
> +			skb_reserve(skb, NET_IP_ALIGN);
> +			skb_put(skb, pkt_len);      /* Make room */
> +			skb_copy_to_linear_data(skb, data, pkt_len);
> +			skb->protocol = eth_type_trans(skb, pndev);
> +			napi_gro_receive(&fep->napi, skb);
> +		}
> +
> +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, data,
> +						  bdp->cbd_datlen,
> +						  DMA_FROM_DEVICE);
> +		if (unlikely(dma_mapping_error(&fep->pdev->dev,
> +					       bdp->cbd_bufaddr))) {
> +			dev_err(&fep->pdev->dev,
> +				"Failed to map descriptor rx buffer\n");
> +			pndev->stats.rx_errors++;
> +			pndev->stats.rx_dropped++;
> +			dev_kfree_skb_any(skb);

The above statement is wrong even if you intend to keep the
dma_unmap/dma_map pair (and please, don't do that! ;). At this point the
skb ownership has been handed to the stack by the previous
napi_gro_receive(), freeing it here will cause UaF and double free.

> +			goto err_mem;
> +		}
> +
> + rx_processing_done:
> +		/* Clear the status flags for this buffer */
> +		status &= ~BD_ENET_RX_STATS;

With the dma map/unmap in place, you likely need a memory barrier to
ensure the device will see the descriptor status update after bufferptr
update.

> +static int mtip_alloc_buffers(struct net_device *dev)
> +{
> +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> +	struct switch_enet_private *fep = priv->fep;
> +	struct sk_buff *skb;
> +	struct cbd_t *bdp;
> +	int i;
> +
> +	bdp = fep->rx_bd_base;
> +	for (i = 0; i < RX_RING_SIZE; i++) {
> +		skb = netdev_alloc_skb(dev, MTIP_SWITCH_RX_FRSIZE);
> +		if (!skb)
> +			goto err;
> +
> +		fep->rx_skbuff[i] = skb;
> +
> +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data,
> +						  MTIP_SWITCH_RX_FRSIZE,
> +						  DMA_FROM_DEVICE);
> +		if (unlikely(dma_mapping_error(&fep->pdev->dev,
> +					       bdp->cbd_bufaddr))) {
> +			dev_err(&fep->pdev->dev,
> +				"Failed to map descriptor rx buffer\n");
> +			dev_kfree_skb_any(skb);

At this point fep->rx_skbuff[i] is still not NULL, and later
mtip_free_buffers() will try to free it again. You should remove the
above dev_kfree_skb_any(skb).

> +static const struct ethtool_ops mtip_ethtool_ops = {
> +	.get_link_ksettings     = phy_ethtool_get_link_ksettings,
> +	.set_link_ksettings     = phy_ethtool_set_link_ksettings,
> +	.get_drvinfo            = mtip_get_drvinfo,
> +	.get_link               = ethtool_op_get_link,
> +	.get_ts_info		= ethtool_op_get_ts_info,
> +};
> +
> +static const struct net_device_ops mtip_netdev_ops = {
> +	.ndo_open		= mtip_open,
> +	.ndo_stop		= mtip_close,
> +	.ndo_start_xmit	= mtip_start_xmit,
> +	.ndo_set_rx_mode	= mtip_set_multicast_list,
> +	.ndo_tx_timeout	= mtip_timeout,
> +	.ndo_set_mac_address	= mtip_set_mac_address,
> +};
> +
> +bool mtip_is_switch_netdev_port(const struct net_device *ndev)
> +{
> +	return ndev->netdev_ops == &mtip_netdev_ops;
> +}
> +
> +static int mtip_switch_dma_init(struct switch_enet_private *fep)
> +{
> +	struct cbd_t *bdp, *cbd_base;
> +	int ret, i;
> +
> +	/* Check mask of the streaming and coherent API */
> +	ret = dma_set_mask_and_coherent(&fep->pdev->dev, DMA_BIT_MASK(32));
> +	if (ret < 0) {
> +		dev_err(&fep->pdev->dev, "No suitable DMA available\n");
> +		return ret;
> +	}
> +
> +	/* Allocate memory for buffer descriptors */
> +	cbd_base = dma_alloc_coherent(&fep->pdev->dev, PAGE_SIZE, &fep->bd_dma,
> +				      GFP_KERNEL);
> +	if (!cbd_base)
> +		return -ENOMEM;
> +
> +	/* Set receive and transmit descriptor base */
> +	fep->rx_bd_base = cbd_base;
> +	fep->tx_bd_base = cbd_base + RX_RING_SIZE;
> +
> +	/* Initialize the receive buffer descriptors */
> +	bdp = fep->rx_bd_base;
> +	for (i = 0; i < RX_RING_SIZE; i++) {
> +		bdp->cbd_sc = 0;
> +		bdp++;
> +	}
> +
> +	/* Set the last buffer to wrap */
> +	bdp--;
> +	bdp->cbd_sc |= BD_SC_WRAP;

This is a recurring pattern, you should use an helper for it.

> +/* FEC MII MMFR bits definition */
> +#define FEC_MMFR_ST             BIT(30)
> +#define FEC_MMFR_OP_READ        BIT(29)
> +#define FEC_MMFR_OP_WRITE       BIT(28)
> +#define FEC_MMFR_PA(v)          (((v) & 0x1F) << 23)
> +#define FEC_MMFR_RA(v)          (((v) & 0x1F) << 18)

Here and elsewhere it looks like you could use FIELD_PREP and friends

This patch is really too big, I'm pretty sure I missed some relevant
issues. You should split it in multiple ones: i.e. initialization and
h/w access, rx/tx, others ndos.

/P



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v12 4/7] net: mtip: The L2 switch driver for imx287
  2025-05-27 11:39   ` [net-next v12 4/7] net: mtip: The L2 switch driver for imx287 Paolo Abeni
@ 2025-05-28 10:53     ` Lukasz Majewski
  2025-05-29  8:43       ` Paolo Abeni
  0 siblings, 1 reply; 12+ messages in thread
From: Lukasz Majewski @ 2025-05-28 10:53 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman, Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 21169 bytes --]

Hi Paolo,

> On 5/22/25 9:54 AM, Lukasz Majewski wrote:
> > +/* dynamicms MAC address table learn and migration */
> > +static void
> > +mtip_atable_dynamicms_learn_migration(struct switch_enet_private
> > *fep,
> > +				      int curr_time, unsigned char
> > *mac,
> > +				      u8 *rx_port)
> > +{
> > +	u8 port = MTIP_PORT_FORWARDING_INIT;
> > +	struct mtip_port_info *port_info;
> > +	u32 rx_mac_lo = 0, rx_mac_hi = 0;
> > +	unsigned long flags;
> > +	int index;
> > +
> > +	spin_lock_irqsave(&fep->learn_lock, flags);  
> 
> AFAICS this is called by napi context and by a plain thread context,
> spin_lock_bh() should be sufficient.

Ok

> 
> > +
> > +	if (mac && is_valid_ether_addr(mac)) {
> > +		rx_mac_lo = (u32)((mac[3] << 24) | (mac[2] << 16) |
> > +				  (mac[1] << 8) | mac[0]);
> > +		rx_mac_hi = (u32)((mac[5] << 8) | (mac[4]));
> > +	}
> > +
> > +	port_info = mtip_portinfofifo_read(fep);
> > +	while (port_info) {
> > +		/* get block index from lookup table */
> > +		index = GET_BLOCK_PTR(port_info->hash);
> > +		mtip_update_atable_dynamic1(port_info->maclo,
> > port_info->machi,
> > +					    index, port_info->port,
> > +					    curr_time, fep);
> > +
> > +		if (mac && is_valid_ether_addr(mac) &&
> > +		    port == MTIP_PORT_FORWARDING_INIT) {
> > +			if (rx_mac_lo == port_info->maclo &&
> > +			    rx_mac_hi == port_info->machi) {
> > +				/* The newly learned MAC is the
> > source of
> > +				 * our filtered frame.
> > +				 */
> > +				port = (u8)port_info->port;
> > +			}
> > +		}
> > +		port_info = mtip_portinfofifo_read(fep);
> > +	}
> > +
> > +	if (rx_port)
> > +		*rx_port = port;
> > +
> > +	spin_unlock_irqrestore(&fep->learn_lock, flags);
> > +}
> > +
> > +static void mtip_aging_timer(struct timer_list *t)
> > +{
> > +	struct switch_enet_private *fep = from_timer(fep, t,
> > timer_aging); +
> > +	fep->curr_time = mtip_timeincrement(fep->curr_time);
> > +
> > +	mod_timer(&fep->timer_aging,
> > +		  jiffies +
> > msecs_to_jiffies(LEARNING_AGING_INTERVAL)); +}  
> 
> It's unclear to me why you need to maintain a timer just to update a
> timestamp?!?
> 

This timestamp is afterwards used in:
mtip_atable_dynamicms_learn_migration(), which in turn manages the
entries in switch "dynamic" table (it is one of the fields in the
record.

> (jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 <<
> AT_DENTRY_TIMESTAMP_WIDTH) - 1)
> 

If I understood you correctly - I shall remove the timer and then just
use the above line (based on jiffies) when
mtip_atable_dynamicms_learn_migration() is called (and it requires the
timestamp)?

Otherwise the mtip_timeincrement() seems like a nice wrapper on
incrementing the timestamp.

> should yield the same value (and possibly define a bitmask as a
> shortcut)
> 
> > +static netdev_tx_t mtip_start_xmit_port(struct sk_buff *skb,
> > +					struct net_device *dev,
> > int port) +{
> > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > +	struct switch_enet_private *fep = priv->fep;
> > +	unsigned short	status;
> > +	unsigned long flags;
> > +	struct cbd_t *bdp;
> > +	void *bufaddr;
> > +
> > +	spin_lock_irqsave(&fep->hw_lock, flags);  
> 
> AFAICS this lock is acquired only by napi and thread context the _bh
> variant should be sufficient.

Ok.

> 
> > +
> > +	if (!fep->link[0] && !fep->link[1]) {
> > +		/* Link is down or autonegotiation is in progress.
> > */
> > +		netif_stop_queue(dev);
> > +		spin_unlock_irqrestore(&fep->hw_lock, flags);
> > +		return NETDEV_TX_BUSY;  
> 
> Intead you should probably stop the queue when such events happen

Please correct me if I'm wrong - the netif_stop_queue(dev); is called
before return. Shall something different be also done?

> 
> > +	}
> > +
> > +	/* Fill in a Tx ring entry */
> > +	bdp = fep->cur_tx;
> > +
> > +	status = bdp->cbd_sc;
> > +
> > +	if (status & BD_ENET_TX_READY) {
> > +		/* All transmit buffers are full. Bail out.
> > +		 * This should not happen, since dev->tbusy should
> > be set.
> > +		 */
> > +		dev_err(&fep->pdev->dev, "%s: tx queue full!.\n",
> > dev->name);
> > +		spin_unlock_irqrestore(&fep->hw_lock, flags);
> > +		return NETDEV_TX_BUSY;  
> 
> Instead you should use
> netif_txq_maybe_stop()/netif_subqueue_maybe_stop() to stop the queue
> eariler.

As I don't manage queues - maybe the netif_txq_maybe_stop() seems to be
an overkill. In the earlier code the netif_stop_queue() is used.

> 
> > +	}
> > +
> > +	/* Clear all of the status flags */
> > +	status &= ~BD_ENET_TX_STATS;
> > +
> > +	/* Set buffer length and buffer pointer */
> > +	bufaddr = skb->data;
> > +	bdp->cbd_datlen = skb->len;
> > +
> > +	/* On some FEC implementations data must be aligned on
> > +	 * 4-byte boundaries. Use bounce buffers to copy data
> > +	 * and get it aligned.
> > +	 */
> > +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {
> > +		unsigned int index;
> > +
> > +		index = bdp - fep->tx_bd_base;
> > +		memcpy(fep->tx_bounce[index],
> > +		       (void *)skb->data, skb->len);
> > +		bufaddr = fep->tx_bounce[index];
> > +	}
> > +
> > +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> > +		swap_buffer(bufaddr, skb->len);  
> 
> Ouch, the above will kill performances.

This unfortunately must be done in such a way (the same approach is
present on fec_main.c) as the IP block is implemented in such a way
(explicit conversion from big endian to little endian).

> Also it looks like it will
> access uninitialized memory if skb->len is not 4 bytes aligned.
> 

There is a few lines above a special code to prevent from such a
situation ((unsigned long)bufaddr & MTIP_ALIGNMENT).

> > +
> > +	/* Save skb pointer. */
> > +	fep->tx_skbuff[fep->skb_cur] = skb;
> > +
> > +	dev->stats.tx_bytes += skb->len;  
> 
> It looks like this start is incremented too early, as tx could still
> fail later.

Ok.

> 
> > +	fep->skb_cur = (fep->skb_cur + 1) & TX_RING_MOD_MASK;
> > +
> > +	/* Push the data cache so the CPM does not get stale memory
> > +	 * data.
> > +	 */
> > +	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> > +					  MTIP_SWITCH_TX_FRSIZE,
> > +					  DMA_TO_DEVICE);
> > +	if (unlikely(dma_mapping_error(&fep->pdev->dev,
> > bdp->cbd_bufaddr))) {
> > +		dev_err(&fep->pdev->dev,
> > +			"Failed to map descriptor tx buffer\n");
> > +		dev->stats.tx_errors++;
> > +		dev->stats.tx_dropped++;
> > +		dev_kfree_skb_any(skb);
> > +		goto err;
> > +	}
> > +
> > +	/* Send it on its way.  Tell FEC it's ready, interrupt
> > when done,
> > +	 * it's the last BD of the frame, and to put the CRC on
> > the end.
> > +	 */
> > +  
> 
> Likely you need some memory barrier here to ensure the descriptor
> status update is seen by the device after the buffer addr update.
> 
> > +	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
> > +			| BD_ENET_TX_LAST | BD_ENET_TX_TC);

I will add wmb() here.

> > +	bdp->cbd_sc = status;
> > +
> > +	netif_trans_update(dev);
> > +	skb_tx_timestamp(skb);
> > +
> > +	/* For port separation - force sending via specified port
> > */
> > +	if (!fep->br_offload && port != 0)
> > +		mtip_forced_forward(fep, port, 1);
> > +
> > +	/* Trigger transmission start */
> > +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);  
> 
> Possibly you should check skb->xmit_more to avoid ringing the doorbell
> when not needed.

I couldn't find skb->xmit_more in the current sources. Instead, there
is netdev_xmit_more().

However, the TX code just is supposed to setup one frame transmission
and hence there is no risk that we trigger "empty" transmission.

> 
> > +static void mtip_timeout(struct net_device *dev, unsigned int
> > txqueue) +{
> > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > +	struct switch_enet_private *fep = priv->fep;
> > +	struct cbd_t *bdp;
> > +	int i;
> > +
> > +	dev->stats.tx_errors++;
> > +
> > +	if (IS_ENABLED(CONFIG_SWITCH_DEBUG)) {
> > +		dev_info(&dev->dev, "%s: transmit timed out.\n",
> > dev->name);
> > +		dev_info(&dev->dev,
> > +			 "Ring data: cur_tx %lx%s, dirty_tx %lx
> > cur_rx: %lx\n",
> > +			 (unsigned long)fep->cur_tx,
> > +			 fep->tx_full ? " (full)" : "",
> > +			 (unsigned long)fep->dirty_tx,
> > +			 (unsigned long)fep->cur_rx);
> > +
> > +		bdp = fep->tx_bd_base;
> > +		dev_info(&dev->dev, " tx: %u buffers\n",
> > TX_RING_SIZE);
> > +		for (i = 0; i < TX_RING_SIZE; i++) {
> > +			dev_info(&dev->dev, "  %08lx: %04x %04x
> > %08x\n",
> > +				 (kernel_ulong_t)bdp, bdp->cbd_sc,
> > +				 bdp->cbd_datlen,
> > (int)bdp->cbd_bufaddr);
> > +			bdp++;
> > +		}
> > +
> > +		bdp = fep->rx_bd_base;
> > +		dev_info(&dev->dev, " rx: %lu buffers\n",
> > +			 (unsigned long)RX_RING_SIZE);
> > +		for (i = 0 ; i < RX_RING_SIZE; i++) {
> > +			dev_info(&dev->dev, "  %08lx: %04x %04x
> > %08x\n",
> > +				 (kernel_ulong_t)bdp,
> > +				 bdp->cbd_sc, bdp->cbd_datlen,
> > +				 (int)bdp->cbd_bufaddr);
> > +			bdp++;
> > +		}
> > +	}
> > +
> > +	rtnl_lock();  
> 
> This is called in atomic scope, you can't acquire a mutex here.
> Instead you could schedule a work and do the reset in such scope.
> 

Yes, you are right. I will rewrite it.

> > +	if (netif_device_present(dev) || netif_running(dev)) {
> > +		napi_disable(&fep->napi);
> > +		netif_tx_lock_bh(dev);
> > +		mtip_switch_restart(dev, fep->full_duplex[0],
> > +				    fep->full_duplex[1]);
> > +		netif_tx_wake_all_queues(dev);
> > +		netif_tx_unlock_bh(dev);
> > +		napi_enable(&fep->napi);
> > +	}
> > +	rtnl_unlock();
> > +}  
> 
> > +
> > +/* During a receive, the cur_rx points to the current incoming
> > buffer.
> > + * When we update through the ring, if the next incoming buffer has
> > + * not been given to the system, we just set the empty indicator,
> > + * effectively tossing the packet.
> > + */
> > +static int mtip_switch_rx(struct net_device *dev, int budget, int
> > *port) +{
> > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > +	u8 *data, rx_port = MTIP_PORT_FORWARDING_INIT;
> > +	struct switch_enet_private *fep = priv->fep;
> > +	unsigned short status, pkt_len;
> > +	struct net_device *pndev;
> > +	struct ethhdr *eth_hdr;
> > +	int pkt_received = 0;
> > +	struct sk_buff *skb;
> > +	unsigned long flags;
> > +	struct cbd_t *bdp;
> > +
> > +	spin_lock_irqsave(&fep->hw_lock, flags);
> > +

It is also called in the NAPI context, so I will change
spin_lock_irqsave() to spin_lock_bh().

> > +	/* First, grab all of the stats for the incoming packet.
> > +	 * These get messed up if we get called due to a busy
> > condition.
> > +	 */
> > +	bdp = fep->cur_rx;
> > +
> > +	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> > +		if (pkt_received >= budget)
> > +			break;
> > +
> > +		pkt_received++;
> > +		/* Since we have allocated space to hold a
> > complete frame,
> > +		 * the last indicator should be set.
> > +		 */
> > +		if ((status & BD_ENET_RX_LAST) == 0)
> > +			dev_warn_ratelimited(&dev->dev,
> > +					     "SWITCH ENET: rcv is
> > not +last\n"); +
> > +		if (!fep->usage_count)
> > +			goto rx_processing_done;
> > +
> > +		/* Check for errors. */
> > +		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH |
> > BD_ENET_RX_NO |
> > +			      BD_ENET_RX_CR | BD_ENET_RX_OV)) {
> > +			dev->stats.rx_errors++;
> > +			if (status & (BD_ENET_RX_LG |
> > BD_ENET_RX_SH)) {
> > +				/* Frame too long or too short. */
> > +				dev->stats.rx_length_errors++;
> > +			}
> > +			if (status & BD_ENET_RX_NO)	/*
> > Frame alignment */
> > +				dev->stats.rx_frame_errors++;
> > +			if (status & BD_ENET_RX_CR)	/* CRC
> > Error */
> > +				dev->stats.rx_crc_errors++;
> > +			if (status & BD_ENET_RX_OV)	/* FIFO
> > overrun */
> > +				dev->stats.rx_fifo_errors++;
> > +		}
> > +
> > +		/* Report late collisions as a frame error.
> > +		 * On this error, the BD is closed, but we don't
> > know what we
> > +		 * have in the buffer.  So, just drop this frame
> > on the floor.
> > +		 */
> > +		if (status & BD_ENET_RX_CL) {
> > +			dev->stats.rx_errors++;
> > +			dev->stats.rx_frame_errors++;
> > +			goto rx_processing_done;
> > +		}
> > +
> > +		/* Process the incoming frame */
> > +		pkt_len = bdp->cbd_datlen;
> > +		data = (__u8 *)__va(bdp->cbd_bufaddr);
> > +
> > +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
> > +				 bdp->cbd_datlen,
> > DMA_FROM_DEVICE);  
> 
> I have read your explaination WRT unmap/map. Actually you don't need
> to do any mapping here, 

There are 16 cbd_t descriptors allocated (as dma_alloc_coherent). Those
descriptors contain pointer to data (being read in this case).

Hence the need to perform dma_map_single() for each descriptor, so I
would hold the correct pointer. However, initially this is done in
mtip_alloc_buffers().

> since you are unconditionally copying the
> whole buffer (why???)

Only the value of 
pkt_len = bdp->cbd_datlen; is copied to SKB (after byte swap_buffer()).

> and re-using it.
> 
> Still you need a dma_sync_single() to ensure the CPUs see the correct
> data.

The descriptors - i.e. struct cbd_t fields are allocated with
dma_alloc_coherent(), so this is OK.

The pointer, which is provided by dma_map_single(), is then used by
cbd_t descriptor to store data read by MTIP IP block.

> 
> > +
> > +		if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> > +			swap_buffer(data, pkt_len);
> > +
> > +		if (data) {
> > +			eth_hdr = (struct ethhdr *)data;
> > +			mtip_atable_get_entry_port_number(fep,
> > +
> > eth_hdr->h_source,
> > +
> > &rx_port);
> > +			if (rx_port == MTIP_PORT_FORWARDING_INIT)
> > +
> > mtip_atable_dynamicms_learn_migration(fep,
> > +
> >    fep->curr_time,
> > +
> >    eth_hdr->h_source,
> > +
> >    &rx_port);
> > +		}
> > +
> > +		if (!fep->br_offload && (rx_port == 1 || rx_port
> > == 2))
> > +			pndev = fep->ndev[rx_port - 1];
> > +		else
> > +			pndev = dev;
> > +
> > +		*port = rx_port;
> > +		pndev->stats.rx_packets++;
> > +		pndev->stats.rx_bytes += pkt_len;  
> 
> It looks like the stats are incremented too early, as the packets
> could still be dropped a few lines later

+1

> 
> > +
> > +		/* This does 16 byte alignment, exactly what we
> > need.
> > +		 * The packet length includes FCS, but we don't
> > want to
> > +		 * include that when passing upstream as it messes
> > up
> > +		 * bridging applications.
> > +		 */
> > +		skb = netdev_alloc_skb(pndev, pkt_len +
> > NET_IP_ALIGN);
> > +		if (unlikely(!skb)) {
> > +			dev_dbg(&fep->pdev->dev,
> > +				"%s: Memory squeeze, dropping
> > packet.\n",
> > +				pndev->name);
> > +			pndev->stats.rx_dropped++;
> > +			goto err_mem;
> > +		} else {
> > +			skb_reserve(skb, NET_IP_ALIGN);
> > +			skb_put(skb, pkt_len);      /* Make room */
> > +			skb_copy_to_linear_data(skb, data,
> > pkt_len);
> > +			skb->protocol = eth_type_trans(skb, pndev);
> > +			napi_gro_receive(&fep->napi, skb);
> > +		}
> > +
> > +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev,
> > data,
> > +						  bdp->cbd_datlen,
> > +						  DMA_FROM_DEVICE);
> > +		if (unlikely(dma_mapping_error(&fep->pdev->dev,
> > +					       bdp->cbd_bufaddr)))
> > {
> > +			dev_err(&fep->pdev->dev,
> > +				"Failed to map descriptor rx
> > buffer\n");
> > +			pndev->stats.rx_errors++;
> > +			pndev->stats.rx_dropped++;
> > +			dev_kfree_skb_any(skb);  
> 
> The above statement is wrong even if you intend to keep the
> dma_unmap/dma_map pair (and please, don't do that! ;).

It looks like the in the mtip_alloc_buffers() the area to provide
pointer for bdp->cbd_bufaddr is allocated.

Then in the mtip_switch_rx() (if data is received) - the data (from
bdp->cbd_bufaddr) is read and the dma_unmap_single() is called.

When the data is "passed" via SKB to upper "layers" of network stack,
then the dma_map_single() is call (with the same bdp->cbd_datlen
parameter) to allocate pointer for bdp->cbd_bufaddr.

Indeed, it looks like not optimal solution (maybe there are some side
effects to cover from this IP block?). 

I will check if dma_sync_single_for_cpu() can be used instead (so we can
re-use the descriptors' pointers from the initial allocation).

> At this point
> the skb ownership has been handed to the stack by the previous
> napi_gro_receive(), freeing it here will cause UaF and double free.
> 

I will remove the call to dev_kfree_skb_any(skb);

> > +			goto err_mem;
> > +		}
> > +
> > + rx_processing_done:
> > +		/* Clear the status flags for this buffer */
> > +		status &= ~BD_ENET_RX_STATS;  
> 
> With the dma map/unmap in place, you likely need a memory barrier to
> ensure the device will see the descriptor status update after
> bufferptr update.

I will add wmb() here.

> 
> > +static int mtip_alloc_buffers(struct net_device *dev)
> > +{
> > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > +	struct switch_enet_private *fep = priv->fep;
> > +	struct sk_buff *skb;
> > +	struct cbd_t *bdp;
> > +	int i;
> > +
> > +	bdp = fep->rx_bd_base;
> > +	for (i = 0; i < RX_RING_SIZE; i++) {
> > +		skb = netdev_alloc_skb(dev, MTIP_SWITCH_RX_FRSIZE);
> > +		if (!skb)
> > +			goto err;
> > +
> > +		fep->rx_skbuff[i] = skb;
> > +
> > +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev,
> > skb->data,
> > +
> > MTIP_SWITCH_RX_FRSIZE,
> > +						  DMA_FROM_DEVICE);
> > +		if (unlikely(dma_mapping_error(&fep->pdev->dev,
> > +					       bdp->cbd_bufaddr)))
> > {
> > +			dev_err(&fep->pdev->dev,
> > +				"Failed to map descriptor rx
> > buffer\n");
> > +			dev_kfree_skb_any(skb);  
> 
> At this point fep->rx_skbuff[i] is still not NULL, and later
> mtip_free_buffers() will try to free it again. You should remove the
> above dev_kfree_skb_any(skb).

+1

> 
> > +static const struct ethtool_ops mtip_ethtool_ops = {
> > +	.get_link_ksettings     = phy_ethtool_get_link_ksettings,
> > +	.set_link_ksettings     = phy_ethtool_set_link_ksettings,
> > +	.get_drvinfo            = mtip_get_drvinfo,
> > +	.get_link               = ethtool_op_get_link,
> > +	.get_ts_info		= ethtool_op_get_ts_info,
> > +};
> > +
> > +static const struct net_device_ops mtip_netdev_ops = {
> > +	.ndo_open		= mtip_open,
> > +	.ndo_stop		= mtip_close,
> > +	.ndo_start_xmit	= mtip_start_xmit,
> > +	.ndo_set_rx_mode	= mtip_set_multicast_list,
> > +	.ndo_tx_timeout	= mtip_timeout,
> > +	.ndo_set_mac_address	= mtip_set_mac_address,
> > +};
> > +
> > +bool mtip_is_switch_netdev_port(const struct net_device *ndev)
> > +{
> > +	return ndev->netdev_ops == &mtip_netdev_ops;
> > +}
> > +
> > +static int mtip_switch_dma_init(struct switch_enet_private *fep)
> > +{
> > +	struct cbd_t *bdp, *cbd_base;
> > +	int ret, i;
> > +
> > +	/* Check mask of the streaming and coherent API */
> > +	ret = dma_set_mask_and_coherent(&fep->pdev->dev,
> > DMA_BIT_MASK(32));
> > +	if (ret < 0) {
> > +		dev_err(&fep->pdev->dev, "No suitable DMA
> > available\n");
> > +		return ret;
> > +	}
> > +
> > +	/* Allocate memory for buffer descriptors */
> > +	cbd_base = dma_alloc_coherent(&fep->pdev->dev, PAGE_SIZE,
> > &fep->bd_dma,
> > +				      GFP_KERNEL);
> > +	if (!cbd_base)
> > +		return -ENOMEM;
> > +
> > +	/* Set receive and transmit descriptor base */
> > +	fep->rx_bd_base = cbd_base;
> > +	fep->tx_bd_base = cbd_base + RX_RING_SIZE;
> > +
> > +	/* Initialize the receive buffer descriptors */
> > +	bdp = fep->rx_bd_base;
> > +	for (i = 0; i < RX_RING_SIZE; i++) {
> > +		bdp->cbd_sc = 0;
> > +		bdp++;
> > +	}
> > +
> > +	/* Set the last buffer to wrap */
> > +	bdp--;
> > +	bdp->cbd_sc |= BD_SC_WRAP;  
> 
> This is a recurring pattern, you should use an helper for it.
> 

Ok.

> > +/* FEC MII MMFR bits definition */
> > +#define FEC_MMFR_ST             BIT(30)
> > +#define FEC_MMFR_OP_READ        BIT(29)
> > +#define FEC_MMFR_OP_WRITE       BIT(28)
> > +#define FEC_MMFR_PA(v)          (((v) & 0x1F) << 23)
> > +#define FEC_MMFR_RA(v)          (((v) & 0x1F) << 18)  
> 
> Here and elsewhere it looks like you could use FIELD_PREP and friends

Ok, I will adjust the code.

> 
> This patch is really too big, I'm pretty sure I missed some relevant
> issues. You should split it in multiple ones: i.e. initialization and
> h/w access, rx/tx, others ndos.

It is quite hard to "scatter" this patch as:

1. I've already split it to several files (which correspond to
different "logical" entities - like mtipl2sw_br.c).
2. The mtipl2sw.c file is the smallest part of the "core" of the
driver.
3. If I split it, then at some point I would break bisectability for
imx28.

> 
> /P
> 

Big thanks for your comments.


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v12 4/7] net: mtip: The L2 switch driver for imx287
  2025-05-28 10:53     ` Lukasz Majewski
@ 2025-05-29  8:43       ` Paolo Abeni
  2025-06-09 23:02         ` Lukasz Majewski
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2025-05-29  8:43 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman, Andrew Lunn

On 5/28/25 12:53 PM, Lukasz Majewski wrote:
>> On 5/22/25 9:54 AM, Lukasz Majewski wrote:
>>> +/* dynamicms MAC address table learn and migration */
>>> +static void mtip_aging_timer(struct timer_list *t)
>>> +{
>>> +	struct switch_enet_private *fep = from_timer(fep, t,
>>> timer_aging); +
>>> +	fep->curr_time = mtip_timeincrement(fep->curr_time);
>>> +
>>> +	mod_timer(&fep->timer_aging,
>>> +		  jiffies +
>>> msecs_to_jiffies(LEARNING_AGING_INTERVAL)); +}  
>>
>> It's unclear to me why you need to maintain a timer just to update a
>> timestamp?!?
>>
> 
> This timestamp is afterwards used in:
> mtip_atable_dynamicms_learn_migration(), which in turn manages the
> entries in switch "dynamic" table (it is one of the fields in the
> record.
> 
>> (jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 <<
>> AT_DENTRY_TIMESTAMP_WIDTH) - 1)
>>
> 
> If I understood you correctly - I shall remove the timer and then just
> use the above line (based on jiffies) when
> mtip_atable_dynamicms_learn_migration() is called (and it requires the
> timestamp)?
> 
> Otherwise the mtip_timeincrement() seems like a nice wrapper on
> incrementing the timestamp.

Scheduling a timer to obtain a value you can have for free is not a good
resource usage strategy. Note that is a pending question/check above:
verify that the suggested expression yield the expected value in all the
possible use-case.

>>> +	if (!fep->link[0] && !fep->link[1]) {
>>> +		/* Link is down or autonegotiation is in progress.
>>> */
>>> +		netif_stop_queue(dev);
>>> +		spin_unlock_irqrestore(&fep->hw_lock, flags);
>>> +		return NETDEV_TX_BUSY;  
>>
>> Intead you should probably stop the queue when such events happen
> 
> Please correct me if I'm wrong - the netif_stop_queue(dev); is called
> before return. Shall something different be also done?

The xmit routine should assume the link is up and the tx ring has enough
free slot to enqueue a packet. After enqueueing it should check for
enough space availble for the next xmit and stop the queue, likely using
the netif_txq_maybe_stop() helper.

Documentation/networking/driver.rst

>>> +	}
>>> +
>>> +	/* Clear all of the status flags */
>>> +	status &= ~BD_ENET_TX_STATS;
>>> +
>>> +	/* Set buffer length and buffer pointer */
>>> +	bufaddr = skb->data;
>>> +	bdp->cbd_datlen = skb->len;
>>> +
>>> +	/* On some FEC implementations data must be aligned on
>>> +	 * 4-byte boundaries. Use bounce buffers to copy data
>>> +	 * and get it aligned.
>>> +	 */
>>> +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {
>>> +		unsigned int index;
>>> +
>>> +		index = bdp - fep->tx_bd_base;
>>> +		memcpy(fep->tx_bounce[index],
>>> +		       (void *)skb->data, skb->len);
>>> +		bufaddr = fep->tx_bounce[index];
>>> +	}
>>> +
>>> +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
>>> +		swap_buffer(bufaddr, skb->len);  
>>
>> Ouch, the above will kill performances.
> 
> This unfortunately must be done in such a way (the same approach is
> present on fec_main.c) as the IP block is implemented in such a way
> (explicit conversion from big endian to little endian).
> 
>> Also it looks like it will
>> access uninitialized memory if skb->len is not 4 bytes aligned.
>>
> 
> There is a few lines above a special code to prevent from such a
> situation ((unsigned long)bufaddr & MTIP_ALIGNMENT).

The problem here is not with memory buffer alignment, but with the
packet length, that can be not a multiple of 4. In such a case the last
swap will do an out-of-bound read touching uninitialized data.

>>> +	bdp->cbd_sc = status;
>>> +
>>> +	netif_trans_update(dev);
>>> +	skb_tx_timestamp(skb);
>>> +
>>> +	/* For port separation - force sending via specified port
>>> */
>>> +	if (!fep->br_offload && port != 0)
>>> +		mtip_forced_forward(fep, port, 1);
>>> +
>>> +	/* Trigger transmission start */
>>> +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);  
>>
>> Possibly you should check skb->xmit_more to avoid ringing the doorbell
>> when not needed.
> 
> I couldn't find skb->xmit_more in the current sources. Instead, there
> is netdev_xmit_more().

Yeah, I referred to the old code, sorry.

> However, the TX code just is supposed to setup one frame transmission
> and hence there is no risk that we trigger "empty" transmission.

The point is that doorbell ringing is usually very expensive (slow) for
the H/W. And is not needed when netdev_xmit_more() is true, because the
another xmit operation will follow. If you care about performances you
should leverage such info.

> 
>>> +	/* First, grab all of the stats for the incoming packet.
>>> +	 * These get messed up if we get called due to a busy
>>> condition.
>>> +	 */
>>> +	bdp = fep->cur_rx;
>>> +
>>> +	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
>>> +		if (pkt_received >= budget)
>>> +			break;
>>> +
>>> +		pkt_received++;
>>> +		/* Since we have allocated space to hold a
>>> complete frame,
>>> +		 * the last indicator should be set.
>>> +		 */
>>> +		if ((status & BD_ENET_RX_LAST) == 0)
>>> +			dev_warn_ratelimited(&dev->dev,
>>> +					     "SWITCH ENET: rcv is
>>> not +last\n"); +
>>> +		if (!fep->usage_count)
>>> +			goto rx_processing_done;
>>> +
>>> +		/* Check for errors. */
>>> +		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH |
>>> BD_ENET_RX_NO |
>>> +			      BD_ENET_RX_CR | BD_ENET_RX_OV)) {
>>> +			dev->stats.rx_errors++;
>>> +			if (status & (BD_ENET_RX_LG |
>>> BD_ENET_RX_SH)) {
>>> +				/* Frame too long or too short. */
>>> +				dev->stats.rx_length_errors++;
>>> +			}
>>> +			if (status & BD_ENET_RX_NO)	/*
>>> Frame alignment */
>>> +				dev->stats.rx_frame_errors++;
>>> +			if (status & BD_ENET_RX_CR)	/* CRC
>>> Error */
>>> +				dev->stats.rx_crc_errors++;
>>> +			if (status & BD_ENET_RX_OV)	/* FIFO
>>> overrun */
>>> +				dev->stats.rx_fifo_errors++;
>>> +		}
>>> +
>>> +		/* Report late collisions as a frame error.
>>> +		 * On this error, the BD is closed, but we don't
>>> know what we
>>> +		 * have in the buffer.  So, just drop this frame
>>> on the floor.
>>> +		 */
>>> +		if (status & BD_ENET_RX_CL) {
>>> +			dev->stats.rx_errors++;
>>> +			dev->stats.rx_frame_errors++;
>>> +			goto rx_processing_done;
>>> +		}
>>> +
>>> +		/* Process the incoming frame */
>>> +		pkt_len = bdp->cbd_datlen;
>>> +		data = (__u8 *)__va(bdp->cbd_bufaddr);
>>> +
>>> +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
>>> +				 bdp->cbd_datlen,
>>> DMA_FROM_DEVICE);  
>>
>> I have read your explaination WRT unmap/map. Actually you don't need
>> to do any mapping here, 
> 
> There are 16 cbd_t descriptors allocated (as dma_alloc_coherent). Those
> descriptors contain pointer to data (being read in this case).

I'm referring to the actual packet payload, that is the buffer at
bdp-cbd_bufaddr with len bdp->cbd_datlen; I'm not discussing the
descriptors contents.

> Hence the need to perform dma_map_single() for each descriptor, 

You are not unmapping the descriptor, you are unmapping the packet payload.

>> since you are unconditionally copying the
>> whole buffer (why???)
> 
> Only the value of 
> pkt_len = bdp->cbd_datlen; is copied to SKB (after byte swap_buffer()).

The relevant line is:

		skb_copy_to_linear_data(skb, data, pkt_len);

AFAICS that copies whole packet contents, which is usually quite
sub-optimal from performance PoV.

>> and re-using it.
>>
>> Still you need a dma_sync_single() to ensure the CPUs see the correct
>> data.
> 
> The descriptors - i.e. struct cbd_t fields are allocated with
> dma_alloc_coherent(), so this is OK.

I'm talking about packets contents, not packet descriptors. Please
re-read the above and have a look at other drivers code.

An additional point that I missed in the previous review is that the rx
allocation schema is quite uncorrect. At ring initialization time you
allocate full skbs, while what you need and use is just raw buffers for
the packet payload. Instead you could/should use the page pool:

Documentation/networking/page_pool.rst

That will also help doing the right thing WRT DMA handling.

>> This patch is really too big, I'm pretty sure I missed some relevant
>> issues. You should split it in multiple ones: i.e. initialization and
>> h/w access, rx/tx, others ndos.
> 
> It is quite hard to "scatter" this patch as:
> 
> 1. I've already split it to several files (which correspond to
> different "logical" entities - like mtipl2sw_br.c).
> 2. The mtipl2sw.c file is the smallest part of the "core" of the
> driver.
> 3. If I split it, then at some point I would break bisectability for
> imx28.

Note that each patch don't need to provide complete functionality. i.e.
patch 1 could implement ndo_open()/close and related helper, leaving
ndo_start_xmit() and napi_poll empty and avoid allocating the rx
buffers. patch 2 could implement the rx patch, patch 3 the tx path.

The only constraint is that each patch will build successufully, which
is usually easy to achieve.

A 2K lines patches will probably lead to many more iterations and
unhappy (or no) reviewers.

/P



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v12 4/7] net: mtip: The L2 switch driver for imx287
  2025-05-29  8:43       ` Paolo Abeni
@ 2025-06-09 23:02         ` Lukasz Majewski
  0 siblings, 0 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-06-09 23:02 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman, Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 12281 bytes --]

Hi Paolo,

> On 5/28/25 12:53 PM, Lukasz Majewski wrote:
> >> On 5/22/25 9:54 AM, Lukasz Majewski wrote:  
> >>> +/* dynamicms MAC address table learn and migration */
> >>> +static void mtip_aging_timer(struct timer_list *t)
> >>> +{
> >>> +	struct switch_enet_private *fep = from_timer(fep, t,
> >>> timer_aging); +
> >>> +	fep->curr_time = mtip_timeincrement(fep->curr_time);
> >>> +
> >>> +	mod_timer(&fep->timer_aging,
> >>> +		  jiffies +
> >>> msecs_to_jiffies(LEARNING_AGING_INTERVAL)); +}    
> >>
> >> It's unclear to me why you need to maintain a timer just to update
> >> a timestamp?!?
> >>  
> > 
> > This timestamp is afterwards used in:
> > mtip_atable_dynamicms_learn_migration(), which in turn manages the
> > entries in switch "dynamic" table (it is one of the fields in the
> > record.
> >   
> >> (jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 <<
> >> AT_DENTRY_TIMESTAMP_WIDTH) - 1)
> >>  
> > 
> > If I understood you correctly - I shall remove the timer and then
> > just use the above line (based on jiffies) when
> > mtip_atable_dynamicms_learn_migration() is called (and it requires
> > the timestamp)?
> > 
> > Otherwise the mtip_timeincrement() seems like a nice wrapper on
> > incrementing the timestamp.  
> 
> Scheduling a timer to obtain a value you can have for free is not a
> good resource usage strategy. Note that is a pending question/check
> above: verify that the suggested expression yield the expected value
> in all the possible use-case.

This is a bit more tricky than just getting value from jiffies.

The current code provides a monotonic, starting from 0 time "base" for
learning and managing entries in internal routing tables for MTIP.

To be more specific - the fep->curr_time is a value incremented after
each ~10ms.

Simple masking of jiffies would not provide such features.

However, I've rewritten relevant portions where GENMASK() could be used
to simplify and make the code more readable.

> 
> >>> +	if (!fep->link[0] && !fep->link[1]) {
> >>> +		/* Link is down or autonegotiation is in
> >>> progress. */
> >>> +		netif_stop_queue(dev);
> >>> +		spin_unlock_irqrestore(&fep->hw_lock, flags);
> >>> +		return NETDEV_TX_BUSY;    
> >>
> >> Intead you should probably stop the queue when such events happen  
> > 
> > Please correct me if I'm wrong - the netif_stop_queue(dev); is
> > called before return. Shall something different be also done?  
> 
> The xmit routine should assume the link is up and the tx ring has
> enough free slot to enqueue a packet.

In the case of MTIP driver, there is a circular buffer of 16 "sets" of
descriptors (allocated as coherent) and corresponding buffer
(dma_map_single at start).

The size of each "buffer" is set to 2048B to accommodate at least single
packet.

> After enqueueing it should
> check for enough space availble for the next xmit and stop the queue,
> likely using the netif_txq_maybe_stop() helper.

The problem with not using the netif_txq_maybe_stop() is that I'm not
using the "txq" (netdev_queue).

With the current code it looks like netif_stop_queue() is the most
suitable one from the network API.

> 
> Documentation/networking/driver.rst
> 
> >>> +	}
> >>> +
> >>> +	/* Clear all of the status flags */
> >>> +	status &= ~BD_ENET_TX_STATS;
> >>> +
> >>> +	/* Set buffer length and buffer pointer */
> >>> +	bufaddr = skb->data;
> >>> +	bdp->cbd_datlen = skb->len;
> >>> +
> >>> +	/* On some FEC implementations data must be aligned on
> >>> +	 * 4-byte boundaries. Use bounce buffers to copy data
> >>> +	 * and get it aligned.
> >>> +	 */
> >>> +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {
> >>> +		unsigned int index;
> >>> +
> >>> +		index = bdp - fep->tx_bd_base;
> >>> +		memcpy(fep->tx_bounce[index],
> >>> +		       (void *)skb->data, skb->len);
> >>> +		bufaddr = fep->tx_bounce[index];
> >>> +	}
> >>> +
> >>> +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> >>> +		swap_buffer(bufaddr, skb->len);    
> >>
> >> Ouch, the above will kill performances.  
> > 
> > This unfortunately must be done in such a way (the same approach is
> > present on fec_main.c) as the IP block is implemented in such a way
> > (explicit conversion from big endian to little endian).
> >   
> >> Also it looks like it will
> >> access uninitialized memory if skb->len is not 4 bytes aligned.
> >>  
> > 
> > There is a few lines above a special code to prevent from such a
> > situation ((unsigned long)bufaddr & MTIP_ALIGNMENT).  
> 
> The problem here is not with memory buffer alignment, but with the
> packet length, that can be not a multiple of 4. In such a case the
> last swap will do an out-of-bound read touching uninitialized data.

On the init function the size of allocation for each buffer is set to
be 2048 bytes, so there is no such a thread.

> 
> >>> +	bdp->cbd_sc = status;
> >>> +
> >>> +	netif_trans_update(dev);
> >>> +	skb_tx_timestamp(skb);
> >>> +
> >>> +	/* For port separation - force sending via specified port
> >>> */
> >>> +	if (!fep->br_offload && port != 0)
> >>> +		mtip_forced_forward(fep, port, 1);
> >>> +
> >>> +	/* Trigger transmission start */
> >>> +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);
> >>>   
> >>
> >> Possibly you should check skb->xmit_more to avoid ringing the
> >> doorbell when not needed.  
> > 
> > I couldn't find skb->xmit_more in the current sources. Instead,
> > there is netdev_xmit_more().  
> 
> Yeah, I referred to the old code, sorry.
> 
> > However, the TX code just is supposed to setup one frame
> > transmission and hence there is no risk that we trigger "empty"
> > transmission.  
> 
> The point is that doorbell ringing is usually very expensive (slow)
> for the H/W. And is not needed when netdev_xmit_more() is true,
> because the another xmit operation will follow. If you care about
> performances you should leverage such info.

I do have an impression, that this is very important for network
devices having many queues with separate priorities.

In my case - I do have a single uDMA0 port with a single RX and TX
circular buffer (16 packets can be "queued").

> 
> >   
> >>> +	/* First, grab all of the stats for the incoming packet.
> >>> +	 * These get messed up if we get called due to a busy
> >>> condition.
> >>> +	 */
> >>> +	bdp = fep->cur_rx;
> >>> +
> >>> +	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> >>> +		if (pkt_received >= budget)
> >>> +			break;
> >>> +
> >>> +		pkt_received++;
> >>> +		/* Since we have allocated space to hold a
> >>> complete frame,
> >>> +		 * the last indicator should be set.
> >>> +		 */
> >>> +		if ((status & BD_ENET_RX_LAST) == 0)
> >>> +			dev_warn_ratelimited(&dev->dev,
> >>> +					     "SWITCH ENET: rcv is
> >>> not +last\n"); +
> >>> +		if (!fep->usage_count)
> >>> +			goto rx_processing_done;
> >>> +
> >>> +		/* Check for errors. */
> >>> +		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH |
> >>> BD_ENET_RX_NO |
> >>> +			      BD_ENET_RX_CR | BD_ENET_RX_OV)) {
> >>> +			dev->stats.rx_errors++;
> >>> +			if (status & (BD_ENET_RX_LG |
> >>> BD_ENET_RX_SH)) {
> >>> +				/* Frame too long or too short.
> >>> */
> >>> +				dev->stats.rx_length_errors++;
> >>> +			}
> >>> +			if (status & BD_ENET_RX_NO)	/*
> >>> Frame alignment */
> >>> +				dev->stats.rx_frame_errors++;
> >>> +			if (status & BD_ENET_RX_CR)	/* CRC
> >>> Error */
> >>> +				dev->stats.rx_crc_errors++;
> >>> +			if (status & BD_ENET_RX_OV)	/*
> >>> FIFO overrun */
> >>> +				dev->stats.rx_fifo_errors++;
> >>> +		}
> >>> +
> >>> +		/* Report late collisions as a frame error.
> >>> +		 * On this error, the BD is closed, but we don't
> >>> know what we
> >>> +		 * have in the buffer.  So, just drop this frame
> >>> on the floor.
> >>> +		 */
> >>> +		if (status & BD_ENET_RX_CL) {
> >>> +			dev->stats.rx_errors++;
> >>> +			dev->stats.rx_frame_errors++;
> >>> +			goto rx_processing_done;
> >>> +		}
> >>> +
> >>> +		/* Process the incoming frame */
> >>> +		pkt_len = bdp->cbd_datlen;
> >>> +		data = (__u8 *)__va(bdp->cbd_bufaddr);
> >>> +
> >>> +		dma_unmap_single(&fep->pdev->dev,
> >>> bdp->cbd_bufaddr,
> >>> +				 bdp->cbd_datlen,
> >>> DMA_FROM_DEVICE);    
> >>
> >> I have read your explaination WRT unmap/map. Actually you don't
> >> need to do any mapping here,   
> > 
> > There are 16 cbd_t descriptors allocated (as dma_alloc_coherent).
> > Those descriptors contain pointer to data (being read in this
> > case).  
> 
> I'm referring to the actual packet payload, that is the buffer at
> bdp-cbd_bufaddr with len bdp->cbd_datlen; I'm not discussing the
> descriptors contents.

+1

> 
> > Hence the need to perform dma_map_single() for each descriptor,   
> 
> You are not unmapping the descriptor, you are unmapping the packet
> payload.

+1

> 
> >> since you are unconditionally copying the
> >> whole buffer (why???)  
> > 
> > Only the value of 
> > pkt_len = bdp->cbd_datlen; is copied to SKB (after byte
> > swap_buffer()).  
> 
> The relevant line is:
> 
> 		skb_copy_to_linear_data(skb, data, pkt_len);
> 
> AFAICS that copies whole packet contents, which is usually quite
> sub-optimal from performance PoV.
> 

fec_main.c just assigns:
data·=·skb->data;

so I would prefer to keep the:
skb_copy_to_linear_data(skb, data, pkt_len);

> >> and re-using it.
> >>
> >> Still you need a dma_sync_single() to ensure the CPUs see the
> >> correct data.  
> > 
> > The descriptors - i.e. struct cbd_t fields are allocated with
> > dma_alloc_coherent(), so this is OK.  
> 
> I'm talking about packets contents, not packet descriptors. Please
> re-read the above and have a look at other drivers code.

The usage of dma_sync_single_for_cpu() works without issues in the
mtip_switch_rx().

> 
> An additional point that I missed in the previous review is that the
> rx allocation schema is quite uncorrect. At ring initialization time
> you allocate full skbs, while what you need and use is just raw
> buffers for the packet payload. Instead you could/should use the page
> pool:
> 
> Documentation/networking/page_pool.rst
> 

Yes, for RX packets payload the page of 2048 bytes is allocated. By
using dma page pool - I can state the same maximal size, but the usage
of memory can be much more flexible.

> That will also help doing the right thing WRT DMA handling.
> 

The dma_sync_single_for_cpu() shall work correctly with the current
approach as well.

> >> This patch is really too big, I'm pretty sure I missed some
> >> relevant issues. You should split it in multiple ones: i.e.
> >> initialization and h/w access, rx/tx, others ndos.  
> > 
> > It is quite hard to "scatter" this patch as:
> > 
> > 1. I've already split it to several files (which correspond to
> > different "logical" entities - like mtipl2sw_br.c).
> > 2. The mtipl2sw.c file is the smallest part of the "core" of the
> > driver.
> > 3. If I split it, then at some point I would break bisectability for
> > imx28.  
> 
> Note that each patch don't need to provide complete functionality.
> i.e. patch 1 could implement ndo_open()/close and related helper,
> leaving ndo_start_xmit() and napi_poll empty and avoid allocating the
> rx buffers. patch 2 could implement the rx patch, patch 3 the tx path.
> 

Yes, this seems to be a good idea... I will implement such approach.

> The only constraint is that each patch will build successufully, which
> is usually easy to achieve.

+1

> 
> A 2K lines patches will probably lead to many more iterations and
> unhappy (or no) reviewers.

The problem is that all the patches "around" this driver (like *yaml,
bindings, defconfig) would get outdated very fast if not pull to
mainline.

In such a way that already done work would need to be redo...

> 
> /P
> 




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-10  0:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22  7:54 [net-next v12 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
2025-05-22  7:54 ` [net-next v12 1/7] dt-bindings: net: Add MTIP L2 switch description Lukasz Majewski
2025-06-05 13:51   ` Rob Herring (Arm)
2025-05-22  7:54 ` [net-next v12 2/7] ARM: dts: nxp: mxs: Adjust the imx28.dtsi " Lukasz Majewski
2025-05-22  7:54 ` [net-next v12 3/7] ARM: dts: nxp: mxs: Adjust XEA board's DTS to support L2 switch Lukasz Majewski
2025-05-22  7:54 ` [net-next v12 5/7] ARM: mxs_defconfig: Enable CONFIG_NFS_FSCACHE Lukasz Majewski
2025-05-22  7:54 ` [net-next v12 6/7] ARM: mxs_defconfig: Update mxs_defconfig to 6.15-rc1 Lukasz Majewski
2025-05-22  7:54 ` [net-next v12 7/7] ARM: mxs_defconfig: Enable CONFIG_FEC_MTIP_L2SW to support MTIP L2 switch Lukasz Majewski
     [not found] ` <20250522075455.1723560-5-lukma@denx.de>
2025-05-27 11:39   ` [net-next v12 4/7] net: mtip: The L2 switch driver for imx287 Paolo Abeni
2025-05-28 10:53     ` Lukasz Majewski
2025-05-29  8:43       ` Paolo Abeni
2025-06-09 23:02         ` Lukasz Majewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).