[net-next v19 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [net-next v19 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver
@ 2025-08-24 22:07 Lukasz Majewski
       [not found] ` <20250824220736.1760482-5-lukasz.majewski@mailbox.org>
       [not found] ` <20250824220736.1760482-6-lukasz.majewski@mailbox.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Lukasz Majewski @ 2025-08-24 22:07 UTC (permalink / raw)
  To: Andrew Lunn, davem, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Shawn Guo
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	Richard Cochran, netdev, devicetree, linux-kernel, imx,
	linux-arm-kernel, Stefan Wahren, Simon Horman, Lukasz Majewski

This patch series adds support for More Than IP's L2 switch driver embedded
in some NXP's SoCs. This one has been tested on imx287, but is also available
in the vf610.

In the past there has been performed some attempts to upstream this driver:

1. The 4.19-cip based one [1]
2. DSA based one for 5.12 [2] - i.e. the switch itself was treat as a DSA switch
   with NO tag appended.
3. The extension for FEC driver for 5.12 [3] - the trick here was to fully reuse
   FEC when the in-HW switching is disabled. When bridge offloading is enabled,
   the driver uses already configured MAC and PHY to also configure PHY.

All three approaches were not accepted as eligible for upstreaming.

The driver from this series has floowing features:

1. It is fully separated from fec_main - i.e. can be used interchangeable
   with it. To be more specific - one can build them as modules and
   if required switch between them when e.g. bridge offloading is required.

   To be more specific:
        - Use FEC_MAIN: When one needs support for two ETH ports with separate
          uDMAs used for both and bridging can be realized in SW.

        - Use MTIPL2SW: When it is enough to support two ports with only uDMA0
          attached to switch and bridging shall be offloaded to HW. 

2. This driver uses MTIP's L2 switch internal VLAN feature to provide port
   separation at boot time. Port separation is disabled when bridging is
   required.

3. Example usage:
        Configuration:
        ip link set lan0 up; sleep 1;
        ip link set lan1 up; sleep 1;
        ip link add name br0 type bridge;
        ip link set br0 up; sleep 1;
        ip link set lan0 master br0;
        ip link set lan1 master br0;
        bridge link;
        ip addr add 192.168.2.17/24 dev br0;
        ping -c 5 192.168.2.222

        Removal:
        ip link set br0 down;
        ip link delete br0 type bridge;
        ip link set dev lan1 down
        ip link set dev lan0 down

4. Limitations:
        - Driver enables and disables switch operation with learning and ageing.
        - Missing is the advanced configuration (e.g. adding entries to FBD). This is
          on purpose, as up till now we didn't had consensus about how the driver
          shall be added to Linux.

5. Clang build:
	make LLVM_SUFFIX=-19 LLVM=1 mrproper
	cp ./arch/arm/configs/mxs_defconfig .config
	make ARCH=arm LLVM_SUFFIX=-19 LLVM=1 W=1 menuconfig
	make ARCH=arm LLVM_SUFFIX=-19 LLVM=1 W=1 -j8 LOADADDR=0x40008000 uImage dtbs

        make LLVM_SUFFIX=-19 LLVM=1 mrproper
        make LLVM_SUFFIX=-19 LLVM=1 allmodconfig
        make LLVM_SUFFIX=-19 LLVM=1 W=1 drivers/net/ethernet/freescale/mtipsw/ | tee llvm_build.log
        make LLVM_SUFFIX=-19 LLVM=1 W=1 -j8 | tee llvm_build.log

6. Kernel compliance checks:
	make coccicheck MODE=report J=4 M=drivers/net/ethernet/freescale/mtipsw/
	~/work/src/smatch/smatch_scripts/kchecker drivers/net/ethernet/freescale/mtipsw/

7. GCC
        make mrproper
        make allmodconfig
        make W=1 drivers/net/ethernet/freescale/mtipsw/

Links:
[1] - https://github.com/lmajewski/linux-imx28-l2switch/commits/master
[2] - https://github.com/lmajewski/linux-imx28-l2switch/tree/imx28-v5.12-L2-upstream-RFC_v1
[3] - https://source.denx.de/linux/linux-imx28-l2switch/-/tree/imx28-v5.12-L2-upstream-switchdev-RFC_v1?ref_type=heads

Lukasz Majewski (7):
  dt-bindings: net: Add MTIP L2 switch description
  net: mtip: The L2 switch driver for imx287
  net: mtip: Add buffers management functions to the L2 switch driver
  net: mtip: Add net_device_ops functions to the L2 switch driver
  net: mtip: Add mtip_switch_{rx|tx} functions to the L2 switch driver
  net: mtip: Extend the L2 switch driver with management operations
  net: mtip: Extend the L2 switch driver for imx287 with bridge
    operations

 .../bindings/net/nxp,imx28-mtip-switch.yaml   |  150 ++
 MAINTAINERS                                   |    7 +
 drivers/net/ethernet/freescale/Kconfig        |    1 +
 drivers/net/ethernet/freescale/Makefile       |    1 +
 drivers/net/ethernet/freescale/mtipsw/Kconfig |   13 +
 .../net/ethernet/freescale/mtipsw/Makefile    |    4 +
 .../net/ethernet/freescale/mtipsw/mtipl2sw.c  | 1984 +++++++++++++++++
 .../net/ethernet/freescale/mtipsw/mtipl2sw.h  |  651 ++++++
 .../ethernet/freescale/mtipsw/mtipl2sw_br.c   |  132 ++
 .../ethernet/freescale/mtipsw/mtipl2sw_mgnt.c |  443 ++++
 10 files changed, 3386 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/nxp,imx28-mtip-switch.yaml
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/Kconfig
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/Makefile
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw.c
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw.h
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw_br.c
 create mode 100644 drivers/net/ethernet/freescale/mtipsw/mtipl2sw_mgnt.c

-- 
2.39.5



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
       [not found] ` <20250824220736.1760482-5-lukasz.majewski@mailbox.org>
@ 2025-08-27 15:25   ` Jakub Kicinski
  2025-09-07 16:38     ` Łukasz Majewski
  2025-08-27 18:16   ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-08-27 15:25 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:
> +	/* Set buffer length and buffer pointer */
> +	bufaddr = skb->data;

You can't write (swap) skb->data if the skb is a clone..

> +	bdp->cbd_datlen = skb->len;
> +
> +	/* On some FEC implementations data must be aligned on
> +	 * 4-byte boundaries. Use bounce buffers to copy data
> +	 * and get it aligned.spin
> +	 */
> +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {

add 
	.. ||
	(fep->quirks & FEC_QUIRK_SWAP_FRAME && skb_cloned(skb))

here to switch to the local buffer for clones ?

> +		unsigned int index;
> +
> +		index = bdp - fep->tx_bd_base;
> +		memcpy(fep->tx_bounce[index], skb->data, skb->len);
> +		bufaddr = fep->tx_bounce[index];
> +	}
> +
> +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +		swap_buffer(bufaddr, skb->len);

> +	if (unlikely(dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr))) {
> +		dev_err(&fep->pdev->dev,
> +			"Failed to map descriptor tx buffer\n");

All per-packet prints must be rate limited

> +		/* Since we have freed up a buffer, the ring is no longer
> +		 * full.
> +		 */
> +		if (fep->tx_full) {
> +			fep->tx_full = 0;
> +			if (netif_queue_stopped(dev))
> +				netif_wake_queue(dev);
> +		}

I must say I'm still quite confused by the netdev management in this
driver. You seem to have 2 netdevs, one per port. There's one
set of queues and one NAPI. Whichever netdev gets up first gets the
NAPI. What makes my head spin is that you seem to record which
netdev/port was doing Rx _last_ and then pass that netdev to
mtip_switch_tx(). Why? Isn't the dev that we're completing Tx for is
best read from skb->dev packet by packet? Also this wake up logic looks
like it will wake up _one_ netdev's queue and then set tx_full = 0, so
presumably it will not wake the other port if both ports queues were
stopped. Why keep tx_full state in the first place? Just check if the
queues is stopped..?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 5/7] net: mtip: Add mtip_switch_{rx|tx} functions to the L2 switch driver
       [not found] ` <20250824220736.1760482-6-lukasz.majewski@mailbox.org>
@ 2025-08-27 15:25   ` Jakub Kicinski
  2025-09-07 16:48     ` Łukasz Majewski
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-08-27 15:25 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

On Mon, 25 Aug 2025 00:07:34 +0200 Lukasz Majewski wrote:
>  static void mtip_switch_tx(struct net_device *dev)
>  {
> +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> +	struct switch_enet_private *fep = priv->fep;
> +	unsigned short status;
> +	struct sk_buff *skb;
> +	struct cbd_t *bdp;

> +		} else {
> +			dev->stats.tx_packets++;
> +		}
> +
> +		if (status & BD_ENET_TX_READY)
> +			dev_err(&fep->pdev->dev,
> +				"Enet xmit interrupt and TX_READY.\n");

per-pkt print, needs rl

> +		/* Free the sk buffer associated with this last transmit */
> +		dev_consume_skb_irq(skb);

why _irq()? this now runs from NAPI, so it's in BH. Just stick 
to dev_comsume_skb_any(), it's the safest choice..


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
       [not found] ` <20250824220736.1760482-5-lukasz.majewski@mailbox.org>
  2025-08-27 15:25   ` [net-next v19 4/7] net: mtip: Add net_device_ops functions to the " Jakub Kicinski
@ 2025-08-27 18:16   ` Jakub Kicinski
  2025-09-07 16:52     ` Łukasz Majewski
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-08-27 18:16 UTC (permalink / raw)
  To: Lukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:
> +
> +	/* On some FEC implementations data must be aligned on
> +	 * 4-byte boundaries. Use bounce buffers to copy data
> +	 * and get it aligned.spin
> +	 */

Almost forgot, the word "spin" appears here rather out of place.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-08-27 15:25   ` [net-next v19 4/7] net: mtip: Add net_device_ops functions to the " Jakub Kicinski
@ 2025-09-07 16:38     ` Łukasz Majewski
  2025-09-09  1:05       ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Łukasz Majewski @ 2025-09-07 16:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

Hi Jakub,

Sorry for late reply.

> On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:
> > +	/* Set buffer length and buffer pointer */
> > +	bufaddr = skb->data;  
> 
> You can't write (swap) skb->data if the skb is a clone..

I do use skb = buld_skb() which, "builds" the SKB around the memory
page (from pool).

Then, I "pass" this data (and swap it) to upper layer of the network
stack.

The same approach is used in the fec_main.c driver:
https://elixir.bootlin.com/linux/v6.17-rc3/source/drivers/net/ethernet/freescale/fec_main.c#L1853

> 
> > +	bdp->cbd_datlen = skb->len;
> > +
> > +	/* On some FEC implementations data must be aligned on
> > +	 * 4-byte boundaries. Use bounce buffers to copy data
> > +	 * and get it aligned.spin
> > +	 */
> > +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {  
> 
> add 
> 	.. ||
> 	(fep->quirks & FEC_QUIRK_SWAP_FRAME && skb_cloned(skb))
> 
> here to switch to the local buffer for clones ?

Please see the above comment.

> 
> > +		unsigned int index;
> > +
> > +		index = bdp - fep->tx_bd_base;
> > +		memcpy(fep->tx_bounce[index], skb->data, skb->len);
> > +		bufaddr = fep->tx_bounce[index];
> > +	}
> > +
> > +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> > +		swap_buffer(bufaddr, skb->len);  
> 
> > +	if (unlikely(dma_mapping_error(&fep->pdev->dev,
> > bdp->cbd_bufaddr))) {
> > +		dev_err(&fep->pdev->dev,
> > +			"Failed to map descriptor tx buffer\n");  
> 
> All per-packet prints must be rate limited

Ok. I will update it globally.

> 
> > +		/* Since we have freed up a buffer, the ring is no
> > longer
> > +		 * full.
> > +		 */
> > +		if (fep->tx_full) {
> > +			fep->tx_full = 0;
> > +			if (netif_queue_stopped(dev))
> > +				netif_wake_queue(dev);
> > +		}  
> 
> I must say I'm still quite confused by the netdev management in this
> driver. You seem to have 2 netdevs, one per port.

Yes.

> There's one
> set of queues and one NAPI.

Yes.

> Whichever netdev gets up first gets the
> NAPI.

Yes.

What I'm trying to do - is to model the HW which I do have...

When switch is enabled I do have ONE uDMA0 which works for both eth
ports (lan0 and lan1).

That is why I do have only one NAPI queue.

> What makes my head spin is that you seem to record which
> netdev/port was doing Rx _last_ and then pass that netdev to
> mtip_switch_tx(). Why?

You may have port == 1 || port == 2 when you receive packet from ingres
ports.
You may also have port == 0xFF when you first time encounter the SA on
the port and port == 0 when you send/receive data from the "host"
interface.

When port 1/2 is "detected" then the net dev for this particular port
is used. In other cases the one for NAPI is used (which is one of those
two - please see comment above).

This was the approach from original NXP (Freescale) driver. It in some
way prevents from "starvation" from net devices when L2 switch is
disabled and I need to provide port separation.

(port separation in fact is achieved by programming L2 switch registers
and is realized in HW).

> Isn't the dev that we're completing Tx for is
> best read from skb->dev packet by packet?

It may be worth to try.... I think that the code, which we do have now,
tries to reuse some kind of "locality".

> Also this wake up logic
> looks like it will wake up _one_ netdev's queue and then set tx_full
> = 0, so presumably it will not wake the other port if both ports
> queues were stopped. Why keep tx_full state in the first place? Just
> check if the queues is stopped..?

As I said - we do have only ONE queue, which corresponds to uDMA0 when
the switch is enabled. This single queue is responsible for handling
transmission for both ports (this is how the HW is designed).

-- 
Best regards,

Łukasz Majewski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 5/7] net: mtip: Add mtip_switch_{rx|tx} functions to the L2 switch driver
  2025-08-27 15:25   ` [net-next v19 5/7] net: mtip: Add mtip_switch_{rx|tx} " Jakub Kicinski
@ 2025-09-07 16:48     ` Łukasz Majewski
  0 siblings, 0 replies; 12+ messages in thread
From: Łukasz Majewski @ 2025-09-07 16:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

Hi Jakub,

> On Mon, 25 Aug 2025 00:07:34 +0200 Lukasz Majewski wrote:
> >  static void mtip_switch_tx(struct net_device *dev)
> >  {
> > +	struct mtip_ndev_priv *priv = netdev_priv(dev);
> > +	struct switch_enet_private *fep = priv->fep;
> > +	unsigned short status;
> > +	struct sk_buff *skb;
> > +	struct cbd_t *bdp;  
> 
> > +		} else {
> > +			dev->stats.tx_packets++;
> > +		}
> > +
> > +		if (status & BD_ENET_TX_READY)
> > +			dev_err(&fep->pdev->dev,
> > +				"Enet xmit interrupt and
> > TX_READY.\n");  
> 
> per-pkt print, needs rl

+1

> 
> > +		/* Free the sk buffer associated with this last
> > transmit */
> > +		dev_consume_skb_irq(skb);  
> 
> why _irq()? this now runs from NAPI, so it's in BH. Just stick 
> to dev_comsume_skb_any(), it's the safest choice..

Ok, I will change it.

-- 
Best regards,

Łukasz Majewski


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-08-27 18:16   ` Jakub Kicinski
@ 2025-09-07 16:52     ` Łukasz Majewski
  0 siblings, 0 replies; 12+ messages in thread
From: Łukasz Majewski @ 2025-09-07 16:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

Hi Jakub,

> On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:
> > +
> > +	/* On some FEC implementations data must be aligned on
> > +	 * 4-byte boundaries. Use bounce buffers to copy data
> > +	 * and get it aligned.spin
> > +	 */  
> 
> Almost forgot, the word "spin" appears here rather out of place.

Thanks for pointing this out.

-- 
Best regards,

Łukasz Majewski


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-09-07 16:38     ` Łukasz Majewski
@ 2025-09-09  1:05       ` Jakub Kicinski
  2025-09-10 21:15         ` Łukasz Majewski
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-09-09  1:05 UTC (permalink / raw)
  To: Łukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

On Sun, 7 Sep 2025 18:38:54 +0200 Łukasz Majewski wrote:
> > On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:  
> > > +	/* Set buffer length and buffer pointer */
> > > +	bufaddr = skb->data;    
> > 
> > You can't write (swap) skb->data if the skb is a clone..  
> 
> I do use skb = buld_skb() which, "builds" the SKB around the memory
> page (from pool).
> 
> Then, I "pass" this data (and swap it) to upper layer of the network
> stack.
> 
> The same approach is used in the fec_main.c driver:
> https://elixir.bootlin.com/linux/v6.17-rc3/source/drivers/net/ethernet/freescale/fec_main.c#L1853

I probably cut out too much context. I think I was quoting from Tx,
indeed on Rx this is not an issue.

> What I'm trying to do - is to model the HW which I do have...
> 
> When switch is enabled I do have ONE uDMA0 which works for both eth
> ports (lan0 and lan1).
> 
> That is why I do have only one NAPI queue.
> 
> > What makes my head spin is that you seem to record which
> > netdev/port was doing Rx _last_ and then pass that netdev to
> > mtip_switch_tx(). Why?  
> 
> You may have port == 1 || port == 2 when you receive packet from ingres
> ports.
> You may also have port == 0xFF when you first time encounter the SA on
> the port and port == 0 when you send/receive data from the "host"
> interface.
> 
> When port 1/2 is "detected" then the net dev for this particular port
> is used. In other cases the one for NAPI is used (which is one of those
> two - please see comment above).
> 
> This was the approach from original NXP (Freescale) driver. It in some
> way prevents from "starvation" from net devices when L2 switch is
> disabled and I need to provide port separation.
> 
> (port separation in fact is achieved by programming L2 switch registers
> and is realized in HW).

But what if we have mixed traffic from port 1 and 2?
Does the current code correctly Rx the packets from port 1 on the
netdev from port 1 and packets from port 2 on the netdev from port 2?

> > Isn't the dev that we're completing Tx for is
> > best read from skb->dev packet by packet?  
> 
> It may be worth to try.... I think that the code, which we do have now,
> tries to reuse some kind of "locality".
> 
> > Also this wake up logic
> > looks like it will wake up _one_ netdev's queue and then set tx_full
> > = 0, so presumably it will not wake the other port if both ports
> > queues were stopped. Why keep tx_full state in the first place? Just
> > check if the queues is stopped..?  
> 
> As I said - we do have only ONE queue, which corresponds to uDMA0 when
> the switch is enabled. This single queue is responsible for handling
> transmission for both ports (this is how the HW is designed).

Right but kernel has two SW queues. Which can be independently stopped.
So my concerns is that for example port 1 is very busy so the queue is
full of packets for port 1, port 1's netdev's queue gets stopped.
Then port 2 tries to Tx, queue is shared, and is full, so netdev 2's
SW queue is also stopped. Then we complete the packets, because packets
were for port 1 we wake the queue for port 1. But port 2 also got
stopped, even tho it never put a packet on the ring..

Long story short I think you need to always stop and start queues from
both netdevs.. There's just 2 so not too bad of a hack.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-09-09  1:05       ` Jakub Kicinski
@ 2025-09-10 21:15         ` Łukasz Majewski
  2025-09-11  0:22           ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Łukasz Majewski @ 2025-09-10 21:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

Hi Jakub,

> On Sun, 7 Sep 2025 18:38:54 +0200 Łukasz Majewski wrote:
> > > On Mon, 25 Aug 2025 00:07:33 +0200 Lukasz Majewski wrote:    
> > > > +	/* Set buffer length and buffer pointer */
> > > > +	bufaddr = skb->data;      
> > > 
> > > You can't write (swap) skb->data if the skb is a clone..    
> > 
> > I do use skb = buld_skb() which, "builds" the SKB around the memory
> > page (from pool).
> > 
> > Then, I "pass" this data (and swap it) to upper layer of the network
> > stack.
> > 
> > The same approach is used in the fec_main.c driver:
> > https://elixir.bootlin.com/linux/v6.17-rc3/source/drivers/net/ethernet/freescale/fec_main.c#L1853
> >  
> 
> I probably cut out too much context. I think I was quoting from Tx,
> indeed on Rx this is not an issue.

Ok. No adjustments needed then. Good :)

> 
> > What I'm trying to do - is to model the HW which I do have...
> > 
> > When switch is enabled I do have ONE uDMA0 which works for both eth
> > ports (lan0 and lan1).
> > 
> > That is why I do have only one NAPI queue.
> >   
> > > What makes my head spin is that you seem to record which
> > > netdev/port was doing Rx _last_ and then pass that netdev to
> > > mtip_switch_tx(). Why?    
> > 
> > You may have port == 1 || port == 2 when you receive packet from
> > ingres ports.
> > You may also have port == 0xFF when you first time encounter the SA
> > on the port and port == 0 when you send/receive data from the "host"
> > interface.
> > 
> > When port 1/2 is "detected" then the net dev for this particular
> > port is used. In other cases the one for NAPI is used (which is one
> > of those two - please see comment above).
> > 
> > This was the approach from original NXP (Freescale) driver. It in
> > some way prevents from "starvation" from net devices when L2 switch
> > is disabled and I need to provide port separation.
> > 
> > (port separation in fact is achieved by programming L2 switch
> > registers and is realized in HW).  
> 
> But what if we have mixed traffic from port 1 and 2?
> Does the current code correctly Rx the packets from port 1 on the
> netdev from port 1 and packets from port 2 on the netdev from port 2?

Yes, it does.

In the mtip_rx_napi() you have call to mtip_switch_rx() which accepts
pointer to port variable.

It sets it according to the information provided by the switch IP block
HW and also adjust the skb's ndev.

I'm just wondering if the snippet from mtip_rx_napi():
-------8<--------
if ((port == 1 || port == 2) && fep->ndev[port - 1]
	mtip_switch_tx(fep->ndev[port - 1]);
else
	mtip_switch_tx(napi->dev);
------->8--------

could be replaced just with mtip_switch_tx(napi->dev);
as TX via napi->dev shall be forward to both ports if required.

I will check if this can be done in such a way.

> 
> > > Isn't the dev that we're completing Tx for is
> > > best read from skb->dev packet by packet?    
> > 
> > It may be worth to try.... I think that the code, which we do have
> > now, tries to reuse some kind of "locality".
> >   
> > > Also this wake up logic
> > > looks like it will wake up _one_ netdev's queue and then set
> > > tx_full = 0, so presumably it will not wake the other port if
> > > both ports queues were stopped. Why keep tx_full state in the
> > > first place? Just check if the queues is stopped..?    
> > 
> > As I said - we do have only ONE queue, which corresponds to uDMA0
> > when the switch is enabled. This single queue is responsible for
> > handling transmission for both ports (this is how the HW is
> > designed).  
> 
> Right but kernel has two SW queues.

You mean a separate SW queues for each devices? This is not supported
in the MTIP L2 switch driver. Maybe such high level SW queues
management is available in the upper layers?

> Which can be independently
> stopped.

It supports separate RX and TX HW queues (i.e. ring buffers for
descriptors) for the uDMA0 when switch is enabled.

When you want to send data (no matter from which lan[01] device) the
same mtip_start_xmit() is called, the HW TX descriptor is setup and is
passed via uDMA0 to L2 switch IP block.

For next TX transmission (even from different port) we assign another
descriptor from the ring buffer.

> So my concerns is that for example port 1 is very busy so
> the queue is full of packets for port 1, port 1's netdev's queue gets
> stopped. Then port 2 tries to Tx, queue is shared, and is full, so
> netdev 2's SW queue is also stopped. Then we complete the packets,
> because packets were for port 1 we wake the queue for port 1. But
> port 2 also got stopped, even tho it never put a packet on the ring..
> 

As fair as I can tell - both ports call mtip_start_xmit(), their data
is serialized to the TX queue (via descriptors).

Queued descriptors are sent always at some point (or overridden if
critical error encountered).

> Long story short I think you need to always stop and start queues from
> both netdevs.. There's just 2 so not too bad of a hack.

Maybe there are some peculiarities in for example bridge code (or upper
network stack layers in general), but I think, that I don't need any
extra "queue" management for TX code of MTIP L2 switch.

-- 
Best regards,

Łukasz Majewski


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-09-10 21:15         ` Łukasz Majewski
@ 2025-09-11  0:22           ` Jakub Kicinski
  2025-09-11 21:55             ` Łukasz Majewski
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2025-09-11  0:22 UTC (permalink / raw)
  To: Łukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman

On Wed, 10 Sep 2025 23:15:52 +0200 Łukasz Majewski wrote:
> > > I do use skb = buld_skb() which, "builds" the SKB around the memory
> > > page (from pool).
> > > 
> > > Then, I "pass" this data (and swap it) to upper layer of the network
> > > stack.
> > > 
> > > The same approach is used in the fec_main.c driver:
> > > https://elixir.bootlin.com/linux/v6.17-rc3/source/drivers/net/ethernet/freescale/fec_main.c#L1853
> > >    
> > 
> > I probably cut out too much context. I think I was quoting from Tx,
> > indeed on Rx this is not an issue.  
> 
> Ok. No adjustments needed then. Good :)

No, you were talking about build_skb() which is Rx.
This is the patch that adds Tx. Tx is wrong.
You can't modify the skb unless you call skb_cow().
Or just copy the data out to your local buffer.

> > > You may have port == 1 || port == 2 when you receive packet from
> > > ingres ports.
> > > You may also have port == 0xFF when you first time encounter the SA
> > > on the port and port == 0 when you send/receive data from the "host"
> > > interface.
> > > 
> > > When port 1/2 is "detected" then the net dev for this particular
> > > port is used. In other cases the one for NAPI is used (which is one
> > > of those two - please see comment above).
> > > 
> > > This was the approach from original NXP (Freescale) driver. It in
> > > some way prevents from "starvation" from net devices when L2 switch
> > > is disabled and I need to provide port separation.
> > > 
> > > (port separation in fact is achieved by programming L2 switch
> > > registers and is realized in HW).    
> > 
> > But what if we have mixed traffic from port 1 and 2?
> > Does the current code correctly Rx the packets from port 1 on the
> > netdev from port 1 and packets from port 2 on the netdev from port 2?  
> 
> Yes, it does.
> 
> In the mtip_rx_napi() you have call to mtip_switch_rx() which accepts
> pointer to port variable.
> 
> It sets it according to the information provided by the switch IP block
> HW and also adjust the skb's ndev.
> 
> I'm just wondering if the snippet from mtip_rx_napi():
> -------8<--------
> if ((port == 1 || port == 2) && fep->ndev[port - 1]
> 	mtip_switch_tx(fep->ndev[port - 1]);
> else
> 	mtip_switch_tx(napi->dev);
> ------->8--------  
> 
> could be replaced just with mtip_switch_tx(napi->dev);
> as TX via napi->dev shall be forward to both ports if required.
> 
> I will check if this can be done in such a way.

Not napi->dev. You have to attribute sent packets to the right netdev.

> > > As I said - we do have only ONE queue, which corresponds to uDMA0
> > > when the switch is enabled. This single queue is responsible for
> > > handling transmission for both ports (this is how the HW is
> > > designed).    
> > 
> > Right but kernel has two SW queues.  
> 
> You mean a separate SW queues for each devices? This is not supported
> in the MTIP L2 switch driver. Maybe such high level SW queues
> management is available in the upper layers?

Not possible, each netdev has it's own private qdisc tree.

> > Which can be independently
> > stopped.  
> 
> It supports separate RX and TX HW queues (i.e. ring buffers for
> descriptors) for the uDMA0 when switch is enabled.
> 
> When you want to send data (no matter from which lan[01] device) the
> same mtip_start_xmit() is called, the HW TX descriptor is setup and is
> passed via uDMA0 to L2 switch IP block.
> 
> For next TX transmission (even from different port) we assign another
> descriptor from the ring buffer.
> 
> > So my concerns is that for example port 1 is very busy so
> > the queue is full of packets for port 1, port 1's netdev's queue gets
> > stopped. Then port 2 tries to Tx, queue is shared, and is full, so
> > netdev 2's SW queue is also stopped. Then we complete the packets,
> > because packets were for port 1 we wake the queue for port 1. But
> > port 2 also got stopped, even tho it never put a packet on the ring..
> >   
> 
> As fair as I can tell - both ports call mtip_start_xmit(), their data
> is serialized to the TX queue (via descriptors).
> 
> Queued descriptors are sent always at some point (or overridden if
> critical error encountered).
> 
> > Long story short I think you need to always stop and start queues from
> > both netdevs.. There's just 2 so not too bad of a hack.  
> 
> Maybe there are some peculiarities in for example bridge code (or upper
> network stack layers in general), but I think, that I don't need any
> extra "queue" management for TX code of MTIP L2 switch.

I think I explained this enough times. Next version is v20.
If it's not significantly better than this one, I'm going to have 
to ask you to stop posting this driver.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-09-11  0:22           ` Jakub Kicinski
@ 2025-09-11 21:55             ` Łukasz Majewski
  2025-09-12  0:17               ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Łukasz Majewski @ 2025-09-11 21:55 UTC (permalink / raw)
  To: Jakub Kicinski, Andrew Lunn
  Cc: davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman, Vladimir Oltean

Hi Jakub,

> On Wed, 10 Sep 2025 23:15:52 +0200 Łukasz Majewski wrote:
> > > > I do use skb = buld_skb() which, "builds" the SKB around the
> > > > memory page (from pool).
> > > > 
> > > > Then, I "pass" this data (and swap it) to upper layer of the
> > > > network stack.
> > > > 
> > > > The same approach is used in the fec_main.c driver:
> > > > https://elixir.bootlin.com/linux/v6.17-rc3/source/drivers/net/ethernet/freescale/fec_main.c#L1853
> > > >      
> > > 
> > > I probably cut out too much context. I think I was quoting from
> > > Tx, indeed on Rx this is not an issue.    
> > 
> > Ok. No adjustments needed then. Good :)  
> 
> No, you were talking about build_skb() which is Rx.
> This is the patch that adds Tx. Tx is wrong.

The same approach is taken in fec_main.c (@ fec_enet_txq_submit_skb()
function).

> You can't modify the skb unless you call skb_cow().
> Or just copy the data out to your local buffer.

In the mtip_start_xmit_port() I do assign the address to skb->data to
bufaddr pointer.

If alignment is wrong the we copy it to bounce buffer.
Then we do swap the buffer if needed.

Last step is to dma map this memory and assign the pointer to the
descriptor for L2 switch transmission.

> 
> > > > You may have port == 1 || port == 2 when you receive packet from
> > > > ingres ports.
> > > > You may also have port == 0xFF when you first time encounter
> > > > the SA on the port and port == 0 when you send/receive data
> > > > from the "host" interface.
> > > > 
> > > > When port 1/2 is "detected" then the net dev for this particular
> > > > port is used. In other cases the one for NAPI is used (which is
> > > > one of those two - please see comment above).
> > > > 
> > > > This was the approach from original NXP (Freescale) driver. It
> > > > in some way prevents from "starvation" from net devices when L2
> > > > switch is disabled and I need to provide port separation.
> > > > 
> > > > (port separation in fact is achieved by programming L2 switch
> > > > registers and is realized in HW).      
> > > 
> > > But what if we have mixed traffic from port 1 and 2?
> > > Does the current code correctly Rx the packets from port 1 on the
> > > netdev from port 1 and packets from port 2 on the netdev from
> > > port 2?    
> > 
> > Yes, it does.
> > 
> > In the mtip_rx_napi() you have call to mtip_switch_rx() which
> > accepts pointer to port variable.
> > 
> > It sets it according to the information provided by the switch IP
> > block HW and also adjust the skb's ndev.
> > 
> > I'm just wondering if the snippet from mtip_rx_napi():
> > -------8<--------
> > if ((port == 1 || port == 2) && fep->ndev[port - 1]
> > 	mtip_switch_tx(fep->ndev[port - 1]);
> > else
> > 	mtip_switch_tx(napi->dev);  
> > ------->8--------    
> > 
> > could be replaced just with mtip_switch_tx(napi->dev);
> > as TX via napi->dev shall be forward to both ports if required.
> > 
> > I will check if this can be done in such a way.  
> 
> Not napi->dev. You have to attribute sent packets to the right netdev.

And then we do have some issue to solve. To be more specific -
fec_main.c to avoid starvation just from fec_enet_rx_napi() calls
fec_enet_tx() with only one net device (which it supports).

I wanted to mimic such behaviour with L2 switch driver (at
mtip_rx_napi()), but then the question - which network device (from
available two) shall be assigned?

The net device passed to mtip_switch_tx() is only relevant for
"housekeeping/statistical data" as in fact we just provide another
descriptor to the HW to be sent.

Maybe I shall extract the net device pointer from the skb structure?

> 
> > > > As I said - we do have only ONE queue, which corresponds to
> > > > uDMA0 when the switch is enabled. This single queue is
> > > > responsible for handling transmission for both ports (this is
> > > > how the HW is designed).      
> > > 
> > > Right but kernel has two SW queues.    
> > 
> > You mean a separate SW queues for each devices? This is not
> > supported in the MTIP L2 switch driver. Maybe such high level SW
> > queues management is available in the upper layers?  
> 
> Not possible, each netdev has it's own private qdisc tree.

Please correct me if I'm wrong, but aren't packets from those queues
end up with calling ->ndo_start_xmit() function?

> 
> > > Which can be independently
> > > stopped.    
> > 
> > It supports separate RX and TX HW queues (i.e. ring buffers for
> > descriptors) for the uDMA0 when switch is enabled.
> > 
> > When you want to send data (no matter from which lan[01] device) the
> > same mtip_start_xmit() is called, the HW TX descriptor is setup and
> > is passed via uDMA0 to L2 switch IP block.
> > 
> > For next TX transmission (even from different port) we assign
> > another descriptor from the ring buffer.
> >   
> > > So my concerns is that for example port 1 is very busy so
> > > the queue is full of packets for port 1, port 1's netdev's queue
> > > gets stopped. Then port 2 tries to Tx, queue is shared, and is
> > > full, so netdev 2's SW queue is also stopped. Then we complete
> > > the packets, because packets were for port 1 we wake the queue
> > > for port 1. But port 2 also got stopped, even tho it never put a
> > > packet on the ring.. 
> > 
> > As fair as I can tell - both ports call mtip_start_xmit(), their
> > data is serialized to the TX queue (via descriptors).
> > 
> > Queued descriptors are sent always at some point (or overridden if
> > critical error encountered).
> >   
> > > Long story short I think you need to always stop and start queues
> > > from both netdevs.. There's just 2 so not too bad of a hack.    
> > 
> > Maybe there are some peculiarities in for example bridge code (or
> > upper network stack layers in general), but I think, that I don't
> > need any extra "queue" management for TX code of MTIP L2 switch.  
> 
> I think I explained this enough times. Next version is v20.
> If it's not significantly better than this one, I'm going to have 
> to ask you to stop posting this driver.

I don't know how to reply to this comment, really. 

I've spent many hours of my spare time to upstream this driver.
I'm just disappointed (and maybe I will not say more because of high
level of my frustration).




Could you point me to the driver example which provides such queues
management for switchdev driver? Just to show what you expect from me.

One example.


-- 
Best regards,

Łukasz Majewski


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next v19 4/7] net: mtip: Add net_device_ops functions to the L2 switch driver
  2025-09-11 21:55             ` Łukasz Majewski
@ 2025-09-12  0:17               ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2025-09-12  0:17 UTC (permalink / raw)
  To: Łukasz Majewski
  Cc: Andrew Lunn, davem, Eric Dumazet, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Richard Cochran, netdev,
	devicetree, linux-kernel, imx, linux-arm-kernel, Stefan Wahren,
	Simon Horman, Vladimir Oltean

On Thu, 11 Sep 2025 23:55:47 +0200 Łukasz Majewski wrote:
> > > Ok. No adjustments needed then. Good :)    
> > 
> > No, you were talking about build_skb() which is Rx.
> > This is the patch that adds Tx. Tx is wrong.  
> 
> The same approach is taken in fec_main.c (@ fec_enet_txq_submit_skb()
> function).

FWIW I'm 99% sure we were once investigating a bug in FEC related to
modifying timestamped packets, leading to crashes. Maybe there is more.

> > > could be replaced just with mtip_switch_tx(napi->dev);
> > > as TX via napi->dev shall be forward to both ports if required.
> > > 
> > > I will check if this can be done in such a way.    
> > 
> > Not napi->dev. You have to attribute sent packets to the right netdev.  
> 
> And then we do have some issue to solve. To be more specific -
> fec_main.c to avoid starvation just from fec_enet_rx_napi() calls
> fec_enet_tx() with only one net device (which it supports).
> 
> I wanted to mimic such behaviour with L2 switch driver (at
> mtip_rx_napi()), but then the question - which network device (from
> available two) shall be assigned?
> 
> The net device passed to mtip_switch_tx() is only relevant for
> "housekeeping/statistical data" as in fact we just provide another
> descriptor to the HW to be sent.
> 
> Maybe I shall extract the net device pointer from the skb structure?

Exactly :)

> > > You mean a separate SW queues for each devices? This is not
> > > supported in the MTIP L2 switch driver. Maybe such high level SW
> > > queues management is available in the upper layers?    
> > 
> > Not possible, each netdev has it's own private qdisc tree.  
> 
> Please correct me if I'm wrong, but aren't packets from those queues
> end up with calling ->ndo_start_xmit() function?

Right. I think I'm lost, why does this matter?

> > I think I explained this enough times. Next version is v20.
> > If it's not significantly better than this one, I'm going to have 
> > to ask you to stop posting this driver.  
> 
> I don't know how to reply to this comment, really. 
> 
> I've spent many hours of my spare time to upstream this driver.
> I'm just disappointed (and maybe I will not say more because of high
> level of my frustration).

I believe mlxsw has fewer DMA queues than ports. But TBH I'm not sure
how they handle the congestion. In your case since you only have two
ports (at most) I think you can trivially just always stop and start
both.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-12  0:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-24 22:07 [net-next v19 0/7] net: mtip: Add support for MTIP imx287 L2 switch driver Lukasz Majewski
     [not found] ` <20250824220736.1760482-5-lukasz.majewski@mailbox.org>
2025-08-27 15:25   ` [net-next v19 4/7] net: mtip: Add net_device_ops functions to the " Jakub Kicinski
2025-09-07 16:38     ` Łukasz Majewski
2025-09-09  1:05       ` Jakub Kicinski
2025-09-10 21:15         ` Łukasz Majewski
2025-09-11  0:22           ` Jakub Kicinski
2025-09-11 21:55             ` Łukasz Majewski
2025-09-12  0:17               ` Jakub Kicinski
2025-08-27 18:16   ` Jakub Kicinski
2025-09-07 16:52     ` Łukasz Majewski
     [not found] ` <20250824220736.1760482-6-lukasz.majewski@mailbox.org>
2025-08-27 15:25   ` [net-next v19 5/7] net: mtip: Add mtip_switch_{rx|tx} " Jakub Kicinski
2025-09-07 16:48     ` Łukasz Majewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).