[PATCH net 0/3] net: fec: fix some issues of ndo_xdp

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net 0/3] net: fec: fix some issues of ndo_xdp_xmit()
@ 2023-07-04  8:29 wei.fang
  2023-07-04  8:29 ` [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP wei.fang
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: wei.fang @ 2023-07-04  8:29 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev
  Cc: linux-imx, linux-kernel, bpf

From: Wei Fang <wei.fang@nxp.com>

We encountered some issues when testing the ndo_xdp_xmit() interface
of the fec driver on i.MX8MP and i.MX93 platforms. These issues are
easy to reproduce, and the specific reproduction steps are as follows.

step1: The ethernet port of a board (board A) is connected to the EQOS
port of i.MX8MP/i.MX93, and the FEC port of i.MX8MP/i.MX93 is connected
to another ethernet port, such as a switch port.

step2: Board A uses the pktgen_sample03_burst_single_flow.sh to generate
and send packets to i.MX8MP/i.MX93. The command is shown below.
./pktgen_sample03_burst_single_flow.sh -i eth0 -d 192.168.6.8 -m \
56:bf:0d:68:b0:9e -s 1500

step3: i.MX8MP/i.MX93 use the xdp_redirect bfp program to redirect the
XDP frames from EQOS port to FEC port. The command is shown below.
./xdp_redirect eth1 eth0

After a few moments, the warning or error logs will be printed in the
console, for more details, please refer to the commit message of each
patch.

Wei Fang (3):
  net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP
  net: fec: recycle pages for transmitted XDP frames
  net: fec: increase the size of tx ring and update thresholds of tx
    ring

 drivers/net/ethernet/freescale/fec.h      |  17 ++-
 drivers/net/ethernet/freescale/fec_main.c | 168 +++++++++++++++-------
 2 files changed, 128 insertions(+), 57 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP
  2023-07-04  8:29 [PATCH net 0/3] net: fec: fix some issues of ndo_xdp_xmit() wei.fang
@ 2023-07-04  8:29 ` wei.fang
  2023-07-04 23:41   ` Andrew Lunn
  2023-07-04  8:29 ` [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames wei.fang
  2023-07-04  8:29 ` [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring wei.fang
  2 siblings, 1 reply; 18+ messages in thread
From: wei.fang @ 2023-07-04  8:29 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev
  Cc: linux-imx, linux-kernel, bpf

From: Wei Fang <wei.fang@nxp.com>

When a XDP program is installed or uninstalled, fec_restart() will
be invoked to reset MAC and buffer descriptor rings. It's reasonable
not to transmit any packet during the process of reset. However, the
NETDEV_XDP_ACT_NDO_XMIT bit of xdp_features is enabled by default,
that is to say, it's possible that the fec_enet_xdp_xmit() will be
invoked even if the process of reset is not finished. In this case,
the redirected XDP frames might be dropped and available transmit BDs
may be incorrectly deemed insufficient. So this patch disable the
NETDEV_XDP_ACT_NDO_XMIT feature by default and dynamically configure
this feature when the bpf program is installed or uninstalled.

Fixes: e4ac7cc6e5a4 ("net: fec: turn on XDP features")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 8fbe47703d47..9ce0319b33c3 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3732,12 +3732,18 @@ static int fec_enet_bpf(struct net_device *dev, struct netdev_bpf *bpf)
 		if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
 			return -EOPNOTSUPP;
 
+		if (!bpf->prog)
+			xdp_features_clear_redirect_target(dev);
+
 		if (is_run) {
 			napi_disable(&fep->napi);
 			netif_tx_disable(dev);
 		}
 
 		old_prog = xchg(&fep->xdp_prog, bpf->prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+
 		fec_restart(dev);
 
 		if (is_run) {
@@ -3745,8 +3751,8 @@ static int fec_enet_bpf(struct net_device *dev, struct netdev_bpf *bpf)
 			netif_tx_start_all_queues(dev);
 		}
 
-		if (old_prog)
-			bpf_prog_put(old_prog);
+		if (bpf->prog)
+			xdp_features_set_redirect_target(dev, false);
 
 		return 0;
 
@@ -4016,8 +4022,7 @@ static int fec_enet_init(struct net_device *ndev)
 
 	if (!(fep->quirks & FEC_QUIRK_SWAP_FRAME))
 		ndev->xdp_features = NETDEV_XDP_ACT_BASIC |
-				     NETDEV_XDP_ACT_REDIRECT |
-				     NETDEV_XDP_ACT_NDO_XMIT;
+				     NETDEV_XDP_ACT_REDIRECT;
 
 	fec_restart(ndev);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP
  2023-07-04  8:29 ` [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP wei.fang
@ 2023-07-04 23:41   ` Andrew Lunn
  2023-07-04 23:54     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2023-07-04 23:41 UTC (permalink / raw)
  To: wei.fang
  Cc: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev, linux-imx, linux-kernel, bpf

On Tue, Jul 04, 2023 at 04:29:14PM +0800, wei.fang@nxp.com wrote:
> From: Wei Fang <wei.fang@nxp.com>
> 
> When a XDP program is installed or uninstalled, fec_restart() will
> be invoked to reset MAC and buffer descriptor rings. It's reasonable
> not to transmit any packet during the process of reset. However, the
> NETDEV_XDP_ACT_NDO_XMIT bit of xdp_features is enabled by default,
> that is to say, it's possible that the fec_enet_xdp_xmit() will be
> invoked even if the process of reset is not finished. In this case,
> the redirected XDP frames might be dropped and available transmit BDs
> may be incorrectly deemed insufficient. So this patch disable the
> NETDEV_XDP_ACT_NDO_XMIT feature by default and dynamically configure
> this feature when the bpf program is installed or uninstalled.

I don't know much about XDP, so please excuse what might be a stupid
question.

Is this a generic issue? Should this
xdp_features_clear_redirect_target(dev) /
xdp_features_set_redirect_target(dev, false) be done in the core?

	Andrew				      

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP
  2023-07-04 23:41   ` Andrew Lunn
@ 2023-07-04 23:54     ` Toke Høiland-Jørgensen
  2023-07-05  6:31       ` Wei Fang
  0 siblings, 1 reply; 18+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-04 23:54 UTC (permalink / raw)
  To: Andrew Lunn, wei.fang
  Cc: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev, linux-imx, linux-kernel, bpf

Andrew Lunn <andrew@lunn.ch> writes:

> On Tue, Jul 04, 2023 at 04:29:14PM +0800, wei.fang@nxp.com wrote:
>> From: Wei Fang <wei.fang@nxp.com>
>> 
>> When a XDP program is installed or uninstalled, fec_restart() will
>> be invoked to reset MAC and buffer descriptor rings. It's reasonable
>> not to transmit any packet during the process of reset. However, the
>> NETDEV_XDP_ACT_NDO_XMIT bit of xdp_features is enabled by default,
>> that is to say, it's possible that the fec_enet_xdp_xmit() will be
>> invoked even if the process of reset is not finished. In this case,
>> the redirected XDP frames might be dropped and available transmit BDs
>> may be incorrectly deemed insufficient. So this patch disable the
>> NETDEV_XDP_ACT_NDO_XMIT feature by default and dynamically configure
>> this feature when the bpf program is installed or uninstalled.
>
> I don't know much about XDP, so please excuse what might be a stupid
> question.
>
> Is this a generic issue? Should this
> xdp_features_clear_redirect_target(dev) /
> xdp_features_set_redirect_target(dev, false) be done in the core?

No, because not all drivers require an XDP program to be attached to
support being a redirect target (which is one of the reasons we
introduced these feature bits in the first place :)).

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP
  2023-07-04 23:54     ` Toke Høiland-Jørgensen
@ 2023-07-05  6:31       ` Wei Fang
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Fang @ 2023-07-05  6:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Andrew Lunn
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> -----Original Message-----
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Sent: 2023年7月5日 7:54
> To: Andrew Lunn <andrew@lunn.ch>; Wei Fang <wei.fang@nxp.com>
> Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; ast@kernel.org; daniel@iogearbox.net;
> hawk@kernel.org; john.fastabend@gmail.com; Shenwei Wang
> <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org; bpf@vger.kernel.org
> Subject: Re: [PATCH net 1/3] net: fec: dynamically set the
> NETDEV_XDP_ACT_NDO_XMIT feature of XDP
> 
> Andrew Lunn <andrew@lunn.ch> writes:
> 
> > On Tue, Jul 04, 2023 at 04:29:14PM +0800, wei.fang@nxp.com wrote:
> >> From: Wei Fang <wei.fang@nxp.com>
> >>
> >> When a XDP program is installed or uninstalled, fec_restart() will be
> >> invoked to reset MAC and buffer descriptor rings. It's reasonable not
> >> to transmit any packet during the process of reset. However, the
> >> NETDEV_XDP_ACT_NDO_XMIT bit of xdp_features is enabled by default,
> >> that is to say, it's possible that the fec_enet_xdp_xmit() will be
> >> invoked even if the process of reset is not finished. In this case,
> >> the redirected XDP frames might be dropped and available transmit BDs
> >> may be incorrectly deemed insufficient. So this patch disable the
> >> NETDEV_XDP_ACT_NDO_XMIT feature by default and dynamically
> configure
> >> this feature when the bpf program is installed or uninstalled.
> >
> > I don't know much about XDP, so please excuse what might be a stupid
> > question.
> >
> > Is this a generic issue? Should this
> > xdp_features_clear_redirect_target(dev) /
> > xdp_features_set_redirect_target(dev, false) be done in the core?
> 
> No, because not all drivers require an XDP program to be attached to support
> being a redirect target (which is one of the reasons we introduced these
> feature bits in the first place :)).
> 

Hi Toke,
Thanks for your explanation so much. :)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames
  2023-07-04  8:29 [PATCH net 0/3] net: fec: fix some issues of ndo_xdp_xmit() wei.fang
  2023-07-04  8:29 ` [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP wei.fang
@ 2023-07-04  8:29 ` wei.fang
  2023-07-04 23:48   ` Andrew Lunn
  2023-07-04  8:29 ` [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring wei.fang
  2 siblings, 1 reply; 18+ messages in thread
From: wei.fang @ 2023-07-04  8:29 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev
  Cc: linux-imx, linux-kernel, bpf

From: Wei Fang <wei.fang@nxp.com>

Once the XDP frames have been successfully transmitted through the
ndo_xdp_xmit() interface, it's the driver responsibility to free
the frames so that the page_pool can recycle the pages and reuse
them. However, this action is not implemented in the fec driver.
This leads to a user-visible problem that the console will print
the following warning log.

[  157.568851] page_pool_release_retry() stalled pool shutdown 1389 inflight 60 sec
[  217.983446] page_pool_release_retry() stalled pool shutdown 1389 inflight 120 sec
[  278.399006] page_pool_release_retry() stalled pool shutdown 1389 inflight 181 sec
[  338.812885] page_pool_release_retry() stalled pool shutdown 1389 inflight 241 sec
[  399.226946] page_pool_release_retry() stalled pool shutdown 1389 inflight 302 sec

Therefore, to solve this issue, we free XDP frames via xdp_return_frame()
while cleaning the tx BD ring.

Fixes: 6d6b39f180b8 ("net: fec: add initial XDP support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/fec.h      |  15 ++-
 drivers/net/ethernet/freescale/fec_main.c | 148 +++++++++++++++-------
 2 files changed, 115 insertions(+), 48 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 9939ccafb556..8c0226d061fe 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -544,10 +544,23 @@ enum {
 	XDP_STATS_TOTAL,
 };
 
+enum fec_txbuf_type {
+	FEC_TXBUF_T_SKB,
+	FEC_TXBUF_T_XDP_NDO,
+};
+
+struct fec_tx_buffer {
+	union {
+		struct sk_buff *skb;
+		struct xdp_frame *xdp;
+	};
+	enum fec_txbuf_type type;
+};
+
 struct fec_enet_priv_tx_q {
 	struct bufdesc_prop bd;
 	unsigned char *tx_bounce[TX_RING_SIZE];
-	struct  sk_buff *tx_skbuff[TX_RING_SIZE];
+	struct fec_tx_buffer tx_buf[TX_RING_SIZE];
 
 	unsigned short tx_stop_threshold;
 	unsigned short tx_wake_threshold;
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 9ce0319b33c3..940d3afe1d24 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -397,7 +397,7 @@ static void fec_dump(struct net_device *ndev)
 			fec16_to_cpu(bdp->cbd_sc),
 			fec32_to_cpu(bdp->cbd_bufaddr),
 			fec16_to_cpu(bdp->cbd_datlen),
-			txq->tx_skbuff[index]);
+			txq->tx_buf[index].skb);
 		bdp = fec_enet_get_nextdesc(bdp, &txq->bd);
 		index++;
 	} while (bdp != txq->bd.base);
@@ -654,7 +654,7 @@ static int fec_enet_txq_submit_skb(struct fec_enet_priv_tx_q *txq,
 
 	index = fec_enet_get_bd_index(last_bdp, &txq->bd);
 	/* Save skb pointer */
-	txq->tx_skbuff[index] = skb;
+	txq->tx_buf[index].skb = skb;
 
 	/* Make sure the updates to rest of the descriptor are performed before
 	 * transferring ownership.
@@ -672,9 +672,7 @@ static int fec_enet_txq_submit_skb(struct fec_enet_priv_tx_q *txq,
 
 	skb_tx_timestamp(skb);
 
-	/* Make sure the update to bdp and tx_skbuff are performed before
-	 * txq->bd.cur.
-	 */
+	/* Make sure the update to bdp is performed before txq->bd.cur. */
 	wmb();
 	txq->bd.cur = bdp;
 
@@ -862,7 +860,7 @@ static int fec_enet_txq_submit_tso(struct fec_enet_priv_tx_q *txq,
 	}
 
 	/* Save skb pointer */
-	txq->tx_skbuff[index] = skb;
+	txq->tx_buf[index].skb = skb;
 
 	skb_tx_timestamp(skb);
 	txq->bd.cur = bdp;
@@ -952,16 +950,33 @@ static void fec_enet_bd_init(struct net_device *dev)
 		for (i = 0; i < txq->bd.ring_size; i++) {
 			/* Initialize the BD for every fragment in the page. */
 			bdp->cbd_sc = cpu_to_fec16(0);
-			if (bdp->cbd_bufaddr &&
-			    !IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
-				dma_unmap_single(&fep->pdev->dev,
-						 fec32_to_cpu(bdp->cbd_bufaddr),
-						 fec16_to_cpu(bdp->cbd_datlen),
-						 DMA_TO_DEVICE);
-			if (txq->tx_skbuff[i]) {
-				dev_kfree_skb_any(txq->tx_skbuff[i]);
-				txq->tx_skbuff[i] = NULL;
+			if (txq->tx_buf[i].type == FEC_TXBUF_T_SKB) {
+				if (bdp->cbd_bufaddr &&
+				    !IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
+					dma_unmap_single(&fep->pdev->dev,
+							 fec32_to_cpu(bdp->cbd_bufaddr),
+							 fec16_to_cpu(bdp->cbd_datlen),
+							 DMA_TO_DEVICE);
+				if (txq->tx_buf[i].skb) {
+					dev_kfree_skb_any(txq->tx_buf[i].skb);
+					txq->tx_buf[i].skb = NULL;
+				}
+			} else {
+				if (bdp->cbd_bufaddr)
+					dma_unmap_single(&fep->pdev->dev,
+							 fec32_to_cpu(bdp->cbd_bufaddr),
+							 fec16_to_cpu(bdp->cbd_datlen),
+							 DMA_TO_DEVICE);
+
+				if (txq->tx_buf[i].xdp) {
+					xdp_return_frame(txq->tx_buf[i].xdp);
+					txq->tx_buf[i].xdp = NULL;
+				}
+
+				/* restore default tx buffer type: FEC_TXBUF_T_SKB */
+				txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
 			}
+
 			bdp->cbd_bufaddr = cpu_to_fec32(0);
 			bdp = fec_enet_get_nextdesc(bdp, &txq->bd);
 		}
@@ -1360,6 +1375,7 @@ static void
 fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
 {
 	struct	fec_enet_private *fep;
+	struct xdp_frame *xdpf;
 	struct bufdesc *bdp;
 	unsigned short status;
 	struct	sk_buff	*skb;
@@ -1387,16 +1403,31 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
 
 		index = fec_enet_get_bd_index(bdp, &txq->bd);
 
-		skb = txq->tx_skbuff[index];
-		txq->tx_skbuff[index] = NULL;
-		if (!IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
-			dma_unmap_single(&fep->pdev->dev,
-					 fec32_to_cpu(bdp->cbd_bufaddr),
-					 fec16_to_cpu(bdp->cbd_datlen),
-					 DMA_TO_DEVICE);
-		bdp->cbd_bufaddr = cpu_to_fec32(0);
-		if (!skb)
-			goto skb_done;
+		if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB) {
+			skb = txq->tx_buf[index].skb;
+			txq->tx_buf[index].skb = NULL;
+			if (bdp->cbd_bufaddr &&
+			    !IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
+				dma_unmap_single(&fep->pdev->dev,
+						 fec32_to_cpu(bdp->cbd_bufaddr),
+						 fec16_to_cpu(bdp->cbd_datlen),
+						 DMA_TO_DEVICE);
+			bdp->cbd_bufaddr = cpu_to_fec32(0);
+			if (!skb)
+				goto tx_buf_done;
+		} else {
+			xdpf = txq->tx_buf[index].xdp;
+			if (bdp->cbd_bufaddr)
+				dma_unmap_single(&fep->pdev->dev,
+						 fec32_to_cpu(bdp->cbd_bufaddr),
+						 fec16_to_cpu(bdp->cbd_datlen),
+						 DMA_TO_DEVICE);
+			bdp->cbd_bufaddr = cpu_to_fec32(0);
+			if (!xdpf) {
+				txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
+				goto tx_buf_done;
+			}
+		}
 
 		/* Check for errors. */
 		if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
@@ -1415,21 +1446,11 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
 				ndev->stats.tx_carrier_errors++;
 		} else {
 			ndev->stats.tx_packets++;
-			ndev->stats.tx_bytes += skb->len;
-		}
 
-		/* NOTE: SKBTX_IN_PROGRESS being set does not imply it's we who
-		 * are to time stamp the packet, so we still need to check time
-		 * stamping enabled flag.
-		 */
-		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS &&
-			     fep->hwts_tx_en) &&
-		    fep->bufdesc_ex) {
-			struct skb_shared_hwtstamps shhwtstamps;
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-
-			fec_enet_hwtstamp(fep, fec32_to_cpu(ebdp->ts), &shhwtstamps);
-			skb_tstamp_tx(skb, &shhwtstamps);
+			if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB)
+				ndev->stats.tx_bytes += skb->len;
+			else
+				ndev->stats.tx_bytes += xdpf->len;
 		}
 
 		/* Deferred means some collisions occurred during transmit,
@@ -1438,10 +1459,32 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
 		if (status & BD_ENET_TX_DEF)
 			ndev->stats.collisions++;
 
-		/* Free the sk buffer associated with this last transmit */
-		dev_kfree_skb_any(skb);
-skb_done:
-		/* Make sure the update to bdp and tx_skbuff are performed
+		if (txq->tx_buf[index].type == FEC_TXBUF_T_SKB) {
+			/* NOTE: SKBTX_IN_PROGRESS being set does not imply it's we who
+			 * are to time stamp the packet, so we still need to check time
+			 * stamping enabled flag.
+			 */
+			if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS &&
+				     fep->hwts_tx_en) && fep->bufdesc_ex) {
+				struct skb_shared_hwtstamps shhwtstamps;
+				struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
+
+				fec_enet_hwtstamp(fep, fec32_to_cpu(ebdp->ts), &shhwtstamps);
+				skb_tstamp_tx(skb, &shhwtstamps);
+			}
+
+			/* Free the sk buffer associated with this last transmit */
+			dev_kfree_skb_any(skb);
+		} else {
+			xdp_return_frame(xdpf);
+
+			txq->tx_buf[index].xdp = NULL;
+			/* restore default tx buffer type: FEC_TXBUF_T_SKB */
+			txq->tx_buf[index].type = FEC_TXBUF_T_SKB;
+		}
+
+tx_buf_done:
+		/* Make sure the update to bdp and tx_buf are performed
 		 * before dirty_tx
 		 */
 		wmb();
@@ -3249,9 +3292,19 @@ static void fec_enet_free_buffers(struct net_device *ndev)
 		for (i = 0; i < txq->bd.ring_size; i++) {
 			kfree(txq->tx_bounce[i]);
 			txq->tx_bounce[i] = NULL;
-			skb = txq->tx_skbuff[i];
-			txq->tx_skbuff[i] = NULL;
-			dev_kfree_skb(skb);
+
+			if (txq->tx_buf[i].type == FEC_TXBUF_T_SKB) {
+				skb = txq->tx_buf[i].skb;
+				txq->tx_buf[i].skb = NULL;
+				dev_kfree_skb(skb);
+			} else {
+				if (txq->tx_buf[i].xdp) {
+					xdp_return_frame(txq->tx_buf[i].xdp);
+					txq->tx_buf[i].xdp = NULL;
+				}
+
+				txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
+			}
 		}
 	}
 }
@@ -3817,7 +3870,8 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
 		ebdp->cbd_esc = cpu_to_fec32(estatus);
 	}
 
-	txq->tx_skbuff[index] = NULL;
+	txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
+	txq->tx_buf[index].xdp = frame;
 
 	/* Make sure the updates to rest of the descriptor are performed before
 	 * transferring ownership.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames
  2023-07-04  8:29 ` [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames wei.fang
@ 2023-07-04 23:48   ` Andrew Lunn
  2023-07-05  1:56     ` Wei Fang
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2023-07-04 23:48 UTC (permalink / raw)
  To: wei.fang
  Cc: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev, linux-imx, linux-kernel, bpf

>  	/* Save skb pointer */
> -	txq->tx_skbuff[index] = skb;
> +	txq->tx_buf[index].skb = skb;

What about txq->tx_buf[index].type ?

> @@ -862,7 +860,7 @@ static int fec_enet_txq_submit_tso(struct fec_enet_priv_tx_q *txq,
>  	}
>  
>  	/* Save skb pointer */
> -	txq->tx_skbuff[index] = skb;
> +	txq->tx_buf[index].skb = skb;

here as well.

> +				/* restore default tx buffer type: FEC_TXBUF_T_SKB */
> +				txq->tx_buf[i].type = FEC_TXBUF_T_SKB;

Seems error prone. It would be safer to explicitly set it next to
assigning .skb/.xdp.

	  Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames
  2023-07-04 23:48   ` Andrew Lunn
@ 2023-07-05  1:56     ` Wei Fang
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Fang @ 2023-07-05  1:56 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org


> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 2023年7月5日 7:48
> To: Wei Fang <wei.fang@nxp.com>
> Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; ast@kernel.org; daniel@iogearbox.net;
> hawk@kernel.org; john.fastabend@gmail.com; Shenwei Wang
> <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org; bpf@vger.kernel.org
> Subject: Re: [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames
> 
> >  	/* Save skb pointer */
> > -	txq->tx_skbuff[index] = skb;
> > +	txq->tx_buf[index].skb = skb;
> 
> What about txq->tx_buf[index].type ?
> 
we restore the buffer type to FEC_TXBUF_T_SKB when recycling the buffer descriptor,
so there is no need to set it again here.

> > @@ -862,7 +860,7 @@ static int fec_enet_txq_submit_tso(struct
> fec_enet_priv_tx_q *txq,
> >  	}
> >
> >  	/* Save skb pointer */
> > -	txq->tx_skbuff[index] = skb;
> > +	txq->tx_buf[index].skb = skb;
> 
> here as well.
> 
> > +				/* restore default tx buffer type: FEC_TXBUF_T_SKB */
> > +				txq->tx_buf[i].type = FEC_TXBUF_T_SKB;
> 
> Seems error prone. It would be safer to explicitly set it next to
> assigning .skb/.xdp.
I also considered this method, but in the case when skb has frags or TSO,
we only need to set skb pointer for the first txq->tx_buf, but we need to
explicitly set the type for all txq->tx_buf corresponding to the skb. This may
confuse others.
As for your concern about being error-prone. I don't think it should be, I
reset the type to FEC_TXBUF_T_SKB in all places where buffers are recycled
or cleared.

> 
> 	  Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-04  8:29 [PATCH net 0/3] net: fec: fix some issues of ndo_xdp_xmit() wei.fang
  2023-07-04  8:29 ` [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP wei.fang
  2023-07-04  8:29 ` [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames wei.fang
@ 2023-07-04  8:29 ` wei.fang
  2023-07-05  0:25   ` Andrew Lunn
  2 siblings, 1 reply; 18+ messages in thread
From: wei.fang @ 2023-07-04  8:29 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev
  Cc: linux-imx, linux-kernel, bpf

From: Wei Fang <wei.fang@nxp.com>

When the XDP feature is enabled and with heavy XDP frames to be
transmitted, there is a considerable probability that available
tx BDs are insufficient. This will lead to some XDP frames to be
discarded and the "NOT enough BD for SG!" error log will appear
in the console (as shown below).

[  160.013112] fec 30be0000.ethernet eth0: NOT enough BD for SG!
[  160.023116] fec 30be0000.ethernet eth0: NOT enough BD for SG!
[  160.028926] fec 30be0000.ethernet eth0: NOT enough BD for SG!
[  160.038946] fec 30be0000.ethernet eth0: NOT enough BD for SG!
[  160.044758] fec 30be0000.ethernet eth0: NOT enough BD for SG!

In the case of heavy XDP traffic, sometimes the speed of recycling
tx BDs may be slower than the speed of sending XDP frames. There
may be several specific reasons, such as the interrupt is not
responsed in time, the efficiency of the NAPI callback function is
too low due to all the queues (tx queues and rx queues) share the
same NAPI, and so on.

After trying various methods, I think that increase the size of tx
BD ring is simple and effective. Maybe the best resolution is that
allocate NAPI for each queue to improve the efficiency of the NAPI
callback, but this change is a bit big and I didn't try this method.
Perheps this method will be implemented in a future patch.

In addtion, this patch also updates the tx_stop_threshold and the
tx_wake_threshold of the tx ring. In previous logic, the value of
tx_stop_threshold is 217, however, the value of tx_wake_threshold
is 147, it does not make sense that tx_wake_threshold is less than
tx_stop_threshold. Besides, both XDP path and 'slow path' share the
tx BD rings. So if tx_stop_threshold is 217, in the case of heavy
XDP traffic, the slow path is easily to be stopped, this will have
a serious impact on the slow path.

Fixes: 6d6b39f180b8 ("net: fec: add initial XDP support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/fec.h      | 2 +-
 drivers/net/ethernet/freescale/fec_main.c | 7 +++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 8c0226d061fe..63a053dea819 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -355,7 +355,7 @@ struct bufdesc_ex {
 #define RX_RING_SIZE		(FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
 #define FEC_ENET_TX_FRSIZE	2048
 #define FEC_ENET_TX_FRPPG	(PAGE_SIZE / FEC_ENET_TX_FRSIZE)
-#define TX_RING_SIZE		512	/* Must be power of two */
+#define TX_RING_SIZE		1024	/* Must be power of two */
 #define TX_RING_MOD_MASK	511	/*   for this to work */

 #define BD_ENET_RX_INT		0x00800000
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 940d3afe1d24..6c255c0d6f41 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3348,9 +3348,8 @@ static int fec_enet_alloc_queue(struct net_device *ndev)
 		txq->bd.ring_size = TX_RING_SIZE;
 		fep->total_tx_ring_size += fep->tx_queue[i]->bd.ring_size;

-		txq->tx_stop_threshold = FEC_MAX_SKB_DESCS;
-		txq->tx_wake_threshold =
-			(txq->bd.ring_size - txq->tx_stop_threshold) / 2;
+		txq->tx_stop_threshold = MAX_SKB_FRAGS;
+		txq->tx_wake_threshold = MAX_SKB_FRAGS + 1;

 		txq->tso_hdrs = dma_alloc_coherent(&fep->pdev->dev,
 					txq->bd.ring_size * TSO_HEADER_SIZE,
@@ -3837,7 +3836,7 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,

 	entries_free = fec_enet_get_free_txdesc_num(txq);
 	if (entries_free < MAX_SKB_FRAGS + 1) {
-		netdev_err(fep->netdev, "NOT enough BD for SG!\n");
+		netdev_err_once(fep->netdev, "NOT enough BD for SG!\n");
 		return -EBUSY;
 	}

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-04  8:29 ` [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring wei.fang
@ 2023-07-05  0:25   ` Andrew Lunn
  2023-07-05  6:20     ` Wei Fang
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2023-07-05  0:25 UTC (permalink / raw)
  To: wei.fang
  Cc: davem, edumazet, kuba, pabeni, ast, daniel, hawk, john.fastabend,
	shenwei.wang, xiaoning.wang, netdev, linux-imx, linux-kernel, bpf

> After trying various methods, I think that increase the size of tx
> BD ring is simple and effective. Maybe the best resolution is that
> allocate NAPI for each queue to improve the efficiency of the NAPI
> callback, but this change is a bit big and I didn't try this method.
> Perheps this method will be implemented in a future patch.

How does this affect platforms like Vybrid with its fast Ethernet?
Does the burst latency go up?

> In addtion, this patch also updates the tx_stop_threshold and the
> tx_wake_threshold of the tx ring. In previous logic, the value of
> tx_stop_threshold is 217, however, the value of tx_wake_threshold
> is 147, it does not make sense that tx_wake_threshold is less than
> tx_stop_threshold.

What do these actually mean? I could imagine that as the ring fills
you don't want to stop until it is 217/512 full. There is then some
hysteresis, such that it has to drop below 147/512 before more can be
added? 

> Besides, both XDP path and 'slow path' share the
> tx BD rings. So if tx_stop_threshold is 217, in the case of heavy
> XDP traffic, the slow path is easily to be stopped, this will have
> a serious impact on the slow path.

Please post your iperf results for various platforms, so we can see
the effects of this. We generally don't accept tuning patches without
benchmarks which prove the improvements, and also show there is no
regression. And given the wide variety of SoCs using the FEC, i expect
testing on a number of SoCs, but Fast and 1G.

	Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-05  0:25   ` Andrew Lunn
@ 2023-07-05  6:20     ` Wei Fang
  2023-07-05 18:11       ` Jakub Kicinski
  2023-07-08 19:03       ` Andrew Lunn
  0 siblings, 2 replies; 18+ messages in thread
From: Wei Fang @ 2023-07-05  6:20 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 2023年7月5日 8:25
> To: Wei Fang <wei.fang@nxp.com>
> Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; ast@kernel.org; daniel@iogearbox.net;
> hawk@kernel.org; john.fastabend@gmail.com; Shenwei Wang
> <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org; bpf@vger.kernel.org
> Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update
> thresholds of tx ring
> 
> > After trying various methods, I think that increase the size of tx BD
> > ring is simple and effective. Maybe the best resolution is that
> > allocate NAPI for each queue to improve the efficiency of the NAPI
> > callback, but this change is a bit big and I didn't try this method.
> > Perheps this method will be implemented in a future patch.
> 
> How does this affect platforms like Vybrid with its fast Ethernet?
Sorry, I don't have the Vybrid platform, but I think I don't think it has much
impact, at most it just takes up some more memory.

> Does the burst latency go up?
No, for fec, when a packet is attached to the BDs, the software will immediately
trigger the hardware to send the packet. In addition, I think it may improve the
latency, because the size of the tx ring becomes larger, and more packets can be
attached to the BD ring for burst traffic.

> 
> > In addtion, this patch also updates the tx_stop_threshold and the
> > tx_wake_threshold of the tx ring. In previous logic, the value of
> > tx_stop_threshold is 217, however, the value of tx_wake_threshold is
> > 147, it does not make sense that tx_wake_threshold is less than
> > tx_stop_threshold.
> 
> What do these actually mean? I could imagine that as the ring fills you don't
> want to stop until it is 217/512 full. There is then some hysteresis, such that it
> has to drop below 147/512 before more can be added?
> 
You must have misunderstood, let me explain more clearly, the queue will be
stopped when the available BDs are less than tx_stop_threshold (217 BDs). And
the queue will be waked when the available BDs are greater than tx_wake_threshold
(147 BDs). So in most cases, the available BDs are greater than tx_wake_threshold
when the queue is stopped, the only effect is to delay packet sending.
In my opinion, tx_wake_threshold should be greater than tx_stop_threshold, we
should stop queue when the available BDs are not enough for a skb to be attached.
And wake the queue when the available BDs are sufficient for a skb.

> > Besides, both XDP path and 'slow path' share the tx BD rings. So if
> > tx_stop_threshold is 217, in the case of heavy XDP traffic, the slow
> > path is easily to be stopped, this will have a serious impact on the
> > slow path.
> 
> Please post your iperf results for various platforms, so we can see the effects of
> this. We generally don't accept tuning patches without benchmarks which
> prove the improvements, and also show there is no regression. And given the
> wide variety of SoCs using the FEC, i expect testing on a number of SoCs, but
> Fast and 1G.
> 
Below are the results on i.MX6UL/8MM/8MP/8ULP/93 platforms, i.MX6UL and
8ULP only support Fast ethernet. Others support 1G.
1.1 i.MX6UL with tx ring size 512
root@imx6ul7d:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 47148 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  10.1 MBytes  84.9 Mbits/sec    0    103 KBytes
[  5]   1.00-2.00   sec  9.88 MBytes  82.6 Mbits/sec    0    103 KBytes
[  5]   2.00-3.00   sec  9.82 MBytes  82.7 Mbits/sec    0    103 KBytes
[  5]   3.00-4.00   sec  9.82 MBytes  82.4 Mbits/sec    0    103 KBytes
[  5]   4.00-5.00   sec  9.88 MBytes  82.9 Mbits/sec    0    103 KBytes
[  5]   5.00-6.00   sec  9.94 MBytes  83.4 Mbits/sec    0    103 KBytes
[  5]   6.00-7.00   sec  10.1 MBytes  84.3 Mbits/sec    0    103 KBytes
[  5]   7.00-8.00   sec  9.82 MBytes  82.4 Mbits/sec    0    103 KBytes
[  5]   8.00-9.00   sec  9.82 MBytes  82.4 Mbits/sec    0    103 KBytes
[  5]   9.00-10.00  sec  9.88 MBytes  82.9 Mbits/sec    0    103 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  99.0 MBytes  83.1 Mbits/sec    0             sender
[  5]   0.00-10.01  sec  98.8 MBytes  82.8 Mbits/sec                  receiver

1.2 i.MX6UL with tx ring size 1024
root@imx6ul7d:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 55670 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  10.2 MBytes  85.4 Mbits/sec    0    236 KBytes
[  5]   1.00-2.00   sec  10.1 MBytes  84.6 Mbits/sec    0    236 KBytes
[  5]   2.00-3.00   sec  10.2 MBytes  85.5 Mbits/sec    0    249 KBytes
[  5]   3.00-4.00   sec  10.1 MBytes  85.1 Mbits/sec    0    249 KBytes
[  5]   4.00-5.00   sec  10.1 MBytes  84.7 Mbits/sec    0    249 KBytes
[  5]   5.00-6.00   sec  10.0 MBytes  84.1 Mbits/sec    0    249 KBytes
[  5]   6.00-7.00   sec  10.1 MBytes  85.1 Mbits/sec    0    249 KBytes
[  5]   7.00-8.00   sec  10.1 MBytes  84.9 Mbits/sec    0    249 KBytes
[  5]   8.00-9.00   sec  10.3 MBytes  85.9 Mbits/sec    0    249 KBytes
[  5]   9.00-10.00  sec  10.2 MBytes  85.6 Mbits/sec    0    249 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   101 MBytes  85.1 Mbits/sec    0             sender
[  5]   0.00-10.01  sec   101 MBytes  84.5 Mbits/sec                  receiver

2.1 i.MX8ULP with tx ring size 512
root@imx8ulpevk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 54342 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  10.8 MBytes  90.3 Mbits/sec    0   99.0 KBytes
[  5]   1.00-2.00   sec  9.94 MBytes  83.4 Mbits/sec    0   99.0 KBytes
[  5]   2.00-3.00   sec  10.2 MBytes  85.5 Mbits/sec    0   99.0 KBytes
[  5]   3.00-4.00   sec  9.94 MBytes  83.4 Mbits/sec    0   99.0 KBytes
[  5]   4.00-5.00   sec  10.2 MBytes  85.5 Mbits/sec    0   99.0 KBytes
[  5]   5.00-6.00   sec  9.94 MBytes  83.4 Mbits/sec    0   99.0 KBytes
[  5]   6.00-7.00   sec  9.69 MBytes  81.3 Mbits/sec    0   99.0 KBytes
[  5]   7.00-8.00   sec  9.94 MBytes  83.4 Mbits/sec    0   99.0 KBytes
[  5]   8.00-9.00   sec  9.69 MBytes  81.3 Mbits/sec    0   99.0 KBytes
[  5]   9.00-10.00  sec  10.2 MBytes  85.5 Mbits/sec    0   99.0 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   100 MBytes  84.3 Mbits/sec    0             sender
[  5]   0.00-9.90   sec   100 MBytes  84.7 Mbits/sec                  receiver

2.1 i.MX8ULP with tx ring size 1024
root@imx8ulpevk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 44770 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  10.7 MBytes  90.1 Mbits/sec    0   93.3 KBytes
[  5]   1.00-2.00   sec  9.94 MBytes  83.4 Mbits/sec    0   93.3 KBytes
[  5]   2.00-3.00   sec  10.2 MBytes  85.5 Mbits/sec    0   93.3 KBytes
[  5]   3.00-4.00   sec  10.1 MBytes  85.0 Mbits/sec    0   93.3 KBytes
[  5]   4.00-5.00   sec  9.94 MBytes  83.4 Mbits/sec    0    100 KBytes
[  5]   5.00-6.00   sec  10.2 MBytes  85.5 Mbits/sec    0    100 KBytes
[  5]   6.00-7.00   sec  9.69 MBytes  81.3 Mbits/sec    0    100 KBytes
[  5]   7.00-8.00   sec  9.94 MBytes  83.4 Mbits/sec    0    100 KBytes
[  5]   8.00-9.00   sec  10.2 MBytes  85.5 Mbits/sec    0    100 KBytes
[  5]   9.00-10.00  sec  9.69 MBytes  81.3 Mbits/sec    0    100 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   101 MBytes  84.4 Mbits/sec    0             sender
[  5]   0.00-9.92   sec   100 MBytes  84.8 Mbits/sec                  receiver

3.1 i.MX8MM with tx ring size 512
root@imx8mmevk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 55734 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   111 MBytes   934 Mbits/sec    0    577 KBytes
[  5]   1.00-2.00   sec   112 MBytes   937 Mbits/sec    0    577 KBytes
[  5]   2.00-3.00   sec   112 MBytes   942 Mbits/sec    0    609 KBytes
[  5]   3.00-4.00   sec   113 MBytes   945 Mbits/sec    0    638 KBytes
[  5]   4.00-5.00   sec   112 MBytes   941 Mbits/sec    0    638 KBytes
[  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec    0    638 KBytes
[  5]   6.00-7.00   sec   112 MBytes   942 Mbits/sec    0    638 KBytes
[  5]   7.00-8.00   sec   112 MBytes   943 Mbits/sec    0    638 KBytes
[  5]   8.00-9.00   sec   112 MBytes   943 Mbits/sec    0    638 KBytes
[  5]   9.00-10.00  sec   112 MBytes   942 Mbits/sec    0    638 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec                  receiver

3.2 i.MX8MM with tx ring size 1024
root@imx8mmevk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 53350 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   952 Mbits/sec    0    585 KBytes
[  5]   1.00-2.00   sec   112 MBytes   942 Mbits/sec    0    585 KBytes
[  5]   2.00-3.00   sec   113 MBytes   947 Mbits/sec    0    585 KBytes
[  5]   3.00-4.00   sec   112 MBytes   940 Mbits/sec    0    648 KBytes
[  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec    0    648 KBytes
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    648 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    648 KBytes
[  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    648 KBytes
[  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    648 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    648 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   940 Mbits/sec                  receiver

4.1 i.MX8MP with tx ring size 512
root@imx8mpevk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 51892 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   959 Mbits/sec    0    594 KBytes
[  5]   1.00-2.00   sec   112 MBytes   940 Mbits/sec    0    626 KBytes
[  5]   2.00-3.00   sec   113 MBytes   946 Mbits/sec    0    626 KBytes
[  5]   3.00-4.00   sec   112 MBytes   937 Mbits/sec    0    626 KBytes
[  5]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0    626 KBytes
[  5]   5.00-6.00   sec   112 MBytes   940 Mbits/sec    0    626 KBytes
[  5]   6.00-7.00   sec   113 MBytes   946 Mbits/sec    0    626 KBytes
[  5]   7.00-8.00   sec   112 MBytes   939 Mbits/sec    0    626 KBytes
[  5]   8.00-9.00   sec   111 MBytes   935 Mbits/sec    0    626 KBytes
[  5]   9.00-10.00  sec   112 MBytes   943 Mbits/sec    0    626 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   940 Mbits/sec                  receiver

4.2 i.MX8MP with tx ring size 1024
root@imx8mpevk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 37922 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec    0    608 KBytes
[  5]   1.00-2.00   sec   112 MBytes   937 Mbits/sec    0    608 KBytes
[  5]   2.00-3.00   sec   113 MBytes   947 Mbits/sec    0    608 KBytes
[  5]   3.00-4.00   sec   111 MBytes   934 Mbits/sec    0    608 KBytes
[  5]   4.00-5.00   sec   112 MBytes   942 Mbits/sec    0    608 KBytes
[  5]   5.00-6.00   sec   112 MBytes   939 Mbits/sec    0    608 KBytes
[  5]   6.00-7.00   sec   113 MBytes   949 Mbits/sec    0    608 KBytes
[  5]   7.00-8.00   sec   112 MBytes   942 Mbits/sec    0    608 KBytes
[  5]   8.00-9.00   sec   112 MBytes   936 Mbits/sec    0    608 KBytes
[  5]   9.00-10.00  sec   112 MBytes   942 Mbits/sec    0    608 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   942 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   939 Mbits/sec                  receiver

5.1 i.MX93 with tx ring size 512
root@imx93evk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 44216 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   115 MBytes   965 Mbits/sec    0    656 KBytes
[  5]   1.00-2.00   sec   111 MBytes   934 Mbits/sec    0    656 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    0    656 KBytes
[  5]   3.00-4.00   sec   112 MBytes   944 Mbits/sec    0    656 KBytes
[  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec    0    656 KBytes
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec    0    656 KBytes
[  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    0    656 KBytes
[  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    656 KBytes
[  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    656 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    656 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   944 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

5.2 i.MX93 with tx ring size 1024
root@imx93evk:~# iperf3 -c 192.168.3.6
Connecting to host 192.168.3.6, port 5201
[  5] local 192.168.3.9 port 51058 connected to 192.168.3.6 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   959 Mbits/sec    0    588 KBytes
[  5]   1.00-2.00   sec   112 MBytes   935 Mbits/sec    0    649 KBytes
[  5]   2.00-3.00   sec   112 MBytes   944 Mbits/sec    0    649 KBytes
[  5]   3.00-4.00   sec   112 MBytes   944 Mbits/sec    0    649 KBytes
[  5]   4.00-5.00   sec   112 MBytes   944 Mbits/sec    0    649 KBytes
[  5]   5.00-6.00   sec   112 MBytes   944 Mbits/sec    0    649 KBytes
[  5]   6.00-7.00   sec   111 MBytes   933 Mbits/sec    0    649 KBytes
[  5]   7.00-8.00   sec   112 MBytes   944 Mbits/sec    0    649 KBytes
[  5]   8.00-9.00   sec   112 MBytes   944 Mbits/sec    0    649 KBytes
[  5]   9.00-10.00  sec   112 MBytes   944 Mbits/sec    0    649 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.10 GBytes   940 Mbits/sec                  receiver

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-05  6:20     ` Wei Fang
@ 2023-07-05 18:11       ` Jakub Kicinski
  2023-07-06  1:44         ` Wei Fang
  2023-07-08 19:03       ` Andrew Lunn
  1 sibling, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2023-07-05 18:11 UTC (permalink / raw)
  To: Wei Fang
  Cc: Andrew Lunn, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

On Wed, 5 Jul 2023 06:20:26 +0000 Wei Fang wrote:
> > > In addtion, this patch also updates the tx_stop_threshold and the
> > > tx_wake_threshold of the tx ring. In previous logic, the value of
> > > tx_stop_threshold is 217, however, the value of tx_wake_threshold is
> > > 147, it does not make sense that tx_wake_threshold is less than
> > > tx_stop_threshold.  
> > 
> > What do these actually mean? I could imagine that as the ring fills you don't
> > want to stop until it is 217/512 full. There is then some hysteresis, such that it
> > has to drop below 147/512 before more can be added?
> >   
> You must have misunderstood, let me explain more clearly, the queue will be
> stopped when the available BDs are less than tx_stop_threshold (217 BDs). And
> the queue will be waked when the available BDs are greater than tx_wake_threshold
> (147 BDs). So in most cases, the available BDs are greater than tx_wake_threshold
> when the queue is stopped, the only effect is to delay packet sending.
> In my opinion, tx_wake_threshold should be greater than tx_stop_threshold, we
> should stop queue when the available BDs are not enough for a skb to be attached.
> And wake the queue when the available BDs are sufficient for a skb.

But you shouldn't restart the queue for a single packet either.
Restarting for a single packet wastes CPU cycles as there will 
be much more stop / start operations. Two large packets seem 
like the absolute minimum reasonable wake threshold.

Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right either,
as you won't be able to accept a full TSO frame.

Please split the change, the netdev_err_once() should be one patch
and then the change to wake thresh a separate one.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-05 18:11       ` Jakub Kicinski
@ 2023-07-06  1:44         ` Wei Fang
  2023-07-06  3:11           ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Fang @ 2023-07-06  1:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: 2023年7月6日 2:11
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Andrew Lunn <andrew@lunn.ch>; davem@davemloft.net;
> edumazet@google.com; pabeni@redhat.com; ast@kernel.org;
> daniel@iogearbox.net; hawk@kernel.org; john.fastabend@gmail.com;
> Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> <xiaoning.wang@nxp.com>; netdev@vger.kernel.org; dl-linux-imx
> <linux-imx@nxp.com>; linux-kernel@vger.kernel.org; bpf@vger.kernel.org
> Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update
> thresholds of tx ring
> 
> On Wed, 5 Jul 2023 06:20:26 +0000 Wei Fang wrote:
> > > > In addtion, this patch also updates the tx_stop_threshold and the
> > > > tx_wake_threshold of the tx ring. In previous logic, the value of
> > > > tx_stop_threshold is 217, however, the value of tx_wake_threshold
> > > > is 147, it does not make sense that tx_wake_threshold is less than
> > > > tx_stop_threshold.
> > >
> > > What do these actually mean? I could imagine that as the ring fills
> > > you don't want to stop until it is 217/512 full. There is then some
> > > hysteresis, such that it has to drop below 147/512 before more can be
> added?
> > >
> > You must have misunderstood, let me explain more clearly, the queue
> > will be stopped when the available BDs are less than tx_stop_threshold
> > (217 BDs). And the queue will be waked when the available BDs are
> > greater than tx_wake_threshold
> > (147 BDs). So in most cases, the available BDs are greater than
> > tx_wake_threshold when the queue is stopped, the only effect is to delay
> packet sending.
> > In my opinion, tx_wake_threshold should be greater than
> > tx_stop_threshold, we should stop queue when the available BDs are not
> enough for a skb to be attached.
> > And wake the queue when the available BDs are sufficient for a skb.
> 
> But you shouldn't restart the queue for a single packet either.
> Restarting for a single packet wastes CPU cycles as there will be much more
> stop / start operations. Two large packets seem like the absolute minimum
> reasonable wake threshold.
> 
> Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right either, as
> you won't be able to accept a full TSO frame.
> 
Maybe I should keep the tx_stop_threshold unchanged, so that the queue is
to be stopped if the available BDs is not enough for a full TSO frame to be attached.
And then just change tx_wake_threshold to tx_stop_threshold + 1, which I think it's
more reasonable.

> Please split the change, the netdev_err_once() should be one patch and then
> the change to wake thresh a separate one.
Okay, thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-06  1:44         ` Wei Fang
@ 2023-07-06  3:11           ` Jakub Kicinski
  2023-07-06  6:14             ` Wei Fang
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2023-07-06  3:11 UTC (permalink / raw)
  To: Wei Fang
  Cc: Andrew Lunn, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

On Thu, 6 Jul 2023 01:44:49 +0000 Wei Fang wrote:
> > But you shouldn't restart the queue for a single packet either.
> > Restarting for a single packet wastes CPU cycles as there will be much more
> > stop / start operations. Two large packets seem like the absolute minimum
> > reasonable wake threshold.
> > 
> > Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right either, as
> > you won't be able to accept a full TSO frame.
> >   
> Maybe I should keep the tx_stop_threshold unchanged, so that the queue is
> to be stopped if the available BDs is not enough for a full TSO frame to be attached.
> And then just change tx_wake_threshold to tx_stop_threshold + 1, which I think it's
> more reasonable.

How about at least tx_stop_threshold + 2 * MAX_SKB_FRAGS ?
If a queue of hundreds of entries is overflowing, we should be able 
to apply a hysteresis of a few tens of entries. Do you see a
difference in drops? The packets from the stack should preferably stay
in the qdiscs instead of the driver queue, where AQM and scheduling can
be applied.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-06  3:11           ` Jakub Kicinski
@ 2023-07-06  6:14             ` Wei Fang
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Fang @ 2023-07-06  6:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: 2023年7月6日 11:12
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Andrew Lunn <andrew@lunn.ch>; davem@davemloft.net;
> edumazet@google.com; pabeni@redhat.com; ast@kernel.org;
> daniel@iogearbox.net; hawk@kernel.org; john.fastabend@gmail.com;
> Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> <xiaoning.wang@nxp.com>; netdev@vger.kernel.org; dl-linux-imx
> <linux-imx@nxp.com>; linux-kernel@vger.kernel.org; bpf@vger.kernel.org
> Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update
> thresholds of tx ring
> 
> On Thu, 6 Jul 2023 01:44:49 +0000 Wei Fang wrote:
> > > But you shouldn't restart the queue for a single packet either.
> > > Restarting for a single packet wastes CPU cycles as there will be
> > > much more stop / start operations. Two large packets seem like the
> > > absolute minimum reasonable wake threshold.
> > >
> > > Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right
> > > either, as you won't be able to accept a full TSO frame.
> > >
> > Maybe I should keep the tx_stop_threshold unchanged, so that the queue
> > is to be stopped if the available BDs is not enough for a full TSO frame to be
> attached.
> > And then just change tx_wake_threshold to tx_stop_threshold + 1, which
> > I think it's more reasonable.
> 
> How about at least tx_stop_threshold + 2 * MAX_SKB_FRAGS ?
It's okay. The iperf performance is well as before.

> If a queue of hundreds of entries is overflowing, we should be able to apply a
> hysteresis of a few tens of entries. Do you see a difference in drops? The
I didn't see there was any packet loss.

> packets from the stack should preferably stay in the qdiscs instead of the driver
> queue, where AQM and scheduling can be applied.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-05  6:20     ` Wei Fang
  2023-07-05 18:11       ` Jakub Kicinski
@ 2023-07-08 19:03       ` Andrew Lunn
  2023-07-10  6:16         ` Wei Fang
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2023-07-08 19:03 UTC (permalink / raw)
  To: Wei Fang
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> > How does this affect platforms like Vybrid with its fast Ethernet?
> Sorry, I don't have the Vybrid platform, but I think I don't think it has much
> impact, at most it just takes up some more memory.

It has been 6 months since the page pool patches were posted and i
asked about benchmark results for other platforms like Vybrid. Is it
so hard to get hold or reference platforms? Fugang Duan used to have a
test farm of all sorts of boards and reported to me the regressions i
introduced with MDIO changes and PM changes. As somebody who seems to
be an NXP FEC Maintainer i would expect you to have access to a range
of hardware. Especially since XDP and eBPF is a bit of a niche for
embedded processes which NXP produce. You want to be sure your changes
don't regress the main use cases which i guess are plain networking.

> > Does the burst latency go up?
> No, for fec, when a packet is attached to the BDs, the software will immediately
> trigger the hardware to send the packet. In addition, I think it may improve the
> latency, because the size of the tx ring becomes larger, and more packets can be
> attached to the BD ring for burst traffic.

And a bigger burst means more latency. Read about Buffer
bloat.

While you have iperf running saturating the link, try a ping as
well. How does ping latency change with more TX buffers?

Ideally you want enough transmit buffers to keep the link full, but
not more. If the driver is using BQL, the network stack will help with
this.

> Below are the results on i.MX6UL/8MM/8MP/8ULP/93 platforms, i.MX6UL and
> 8ULP only support Fast ethernet. Others support 1G.

Thanks for the benchmark numbers. Please get into the habit of
including them. We like to see justification for any sort of
performance tweaks.

	Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-08 19:03       ` Andrew Lunn
@ 2023-07-10  6:16         ` Wei Fang
  2023-07-12  1:16           ` Andrew Lunn
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Fang @ 2023-07-10  6:16 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 2023年7月9日 3:04
> To: Wei Fang <wei.fang@nxp.com>
> Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; ast@kernel.org; daniel@iogearbox.net;
> hawk@kernel.org; john.fastabend@gmail.com; Shenwei Wang
> <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org; bpf@vger.kernel.org
> Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update
> thresholds of tx ring
> 
> > > How does this affect platforms like Vybrid with its fast Ethernet?
> > Sorry, I don't have the Vybrid platform, but I think I don't think it
> > has much impact, at most it just takes up some more memory.
> 
> It has been 6 months since the page pool patches were posted and i asked
> about benchmark results for other platforms like Vybrid. Is it so hard to get
> hold or reference platforms? Fugang Duan used to have a test farm of all sorts
> of boards and reported to me the regressions i introduced with MDIO changes
> and PM changes. As somebody who seems to be an NXP FEC Maintainer i
> would expect you to have access to a range of hardware. Especially since XDP
> and eBPF is a bit of a niche for embedded processes which NXP produce. You
> want to be sure your changes don't regress the main use cases which i guess
> are plain networking.
> 
Sorry, the Vybrid platform is not currently maintained by us, and Vybrid is also not
included in our usual Yocto SDK RC version. Even I find a Vybrid board, I think it
probably cann't run with the latest kernel image, because the latest kernel image
does not match with the old Yocto SDK, and new Yocto SDK does not support Vybrid
platform. I also asked my colleague in test team who is in charge of ethernet testing,
she hadn't even heard of Vybrid platform.

> > > Does the burst latency go up?
> > No, for fec, when a packet is attached to the BDs, the software will
> > immediately trigger the hardware to send the packet. In addition, I
> > think it may improve the latency, because the size of the tx ring
> > becomes larger, and more packets can be attached to the BD ring for burst
> traffic.
> 
> And a bigger burst means more latency. Read about Buffer bloat.
> 
Perhaps, but not necessarily. For example, in some cases that some burst packets
maybe stay in Qdisc instead of hardware queue if ring size is small.

> While you have iperf running saturating the link, try a ping as well. How does
> ping latency change with more TX buffers?
> 
Per your suggestions, I tested on i.MX8ULP, i.MX8MM and i.MX93 platforms. The
results do not appear to be very different.
i.MX93 with ring size 1024:
fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.114 -c 16
PING 10.193.102.114 (10.193.102.114) 56(84) bytes of data.
64 bytes from 10.193.102.114: icmp_seq=1 ttl=63 time=3.78 ms
64 bytes from 10.193.102.114: icmp_seq=2 ttl=63 time=2.16 ms
64 bytes from 10.193.102.114: icmp_seq=3 ttl=63 time=3.31 ms
64 bytes from 10.193.102.114: icmp_seq=4 ttl=63 time=2.11 ms
64 bytes from 10.193.102.114: icmp_seq=5 ttl=63 time=3.43 ms
64 bytes from 10.193.102.114: icmp_seq=6 ttl=63 time=3.20 ms
64 bytes from 10.193.102.114: icmp_seq=7 ttl=63 time=3.20 ms
64 bytes from 10.193.102.114: icmp_seq=8 ttl=63 time=3.75 ms
64 bytes from 10.193.102.114: icmp_seq=9 ttl=63 time=3.21 ms
64 bytes from 10.193.102.114: icmp_seq=10 ttl=63 time=3.76 ms
64 bytes from 10.193.102.114: icmp_seq=11 ttl=63 time=2.16 ms
64 bytes from 10.193.102.114: icmp_seq=12 ttl=63 time=2.67 ms
64 bytes from 10.193.102.114: icmp_seq=13 ttl=63 time=3.59 ms
64 bytes from 10.193.102.114: icmp_seq=14 ttl=63 time=2.55 ms
64 bytes from 10.193.102.114: icmp_seq=15 ttl=63 time=2.54 ms
64 bytes from 10.193.102.114: icmp_seq=16 ttl=63 time=3.88 ms

--- 10.193.102.114 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15027ms
rtt min/avg/max/mdev = 2.112/3.082/3.875/0.606 ms

i.MX93 with ring size 512:
fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.114 -c 16
PING 10.193.102.114 (10.193.102.114) 56(84) bytes of data.
64 bytes from 10.193.102.114: icmp_seq=1 ttl=63 time=2.74 ms
64 bytes from 10.193.102.114: icmp_seq=2 ttl=63 time=3.32 ms
64 bytes from 10.193.102.114: icmp_seq=3 ttl=63 time=2.72 ms
64 bytes from 10.193.102.114: icmp_seq=4 ttl=63 time=3.36 ms
64 bytes from 10.193.102.114: icmp_seq=5 ttl=63 time=3.41 ms
64 bytes from 10.193.102.114: icmp_seq=6 ttl=63 time=2.67 ms
64 bytes from 10.193.102.114: icmp_seq=7 ttl=63 time=2.77 ms
64 bytes from 10.193.102.114: icmp_seq=8 ttl=63 time=3.38 ms
64 bytes from 10.193.102.114: icmp_seq=9 ttl=63 time=2.54 ms
64 bytes from 10.193.102.114: icmp_seq=10 ttl=63 time=3.36 ms
64 bytes from 10.193.102.114: icmp_seq=11 ttl=63 time=3.44 ms
64 bytes from 10.193.102.114: icmp_seq=12 ttl=63 time=2.80 ms
64 bytes from 10.193.102.114: icmp_seq=13 ttl=63 time=2.86 ms
64 bytes from 10.193.102.114: icmp_seq=14 ttl=63 time=3.90 ms
64 bytes from 10.193.102.114: icmp_seq=15 ttl=63 time=2.50 ms
64 bytes from 10.193.102.114: icmp_seq=16 ttl=63 time=2.89 ms

--- 10.193.102.114 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15028ms
rtt min/avg/max/mdev = 2.496/3.040/3.898/0.394 ms

i.MX8MM with ring size 1024:
fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.126 -c 16
PING 10.193.102.126 (10.193.102.126) 56(84) bytes of data.
64 bytes from 10.193.102.126: icmp_seq=1 ttl=127 time=1.34 ms
64 bytes from 10.193.102.126: icmp_seq=2 ttl=127 time=2.07 ms
64 bytes from 10.193.102.126: icmp_seq=3 ttl=127 time=2.40 ms
64 bytes from 10.193.102.126: icmp_seq=4 ttl=127 time=1.48 ms
64 bytes from 10.193.102.126: icmp_seq=5 ttl=127 time=1.69 ms
64 bytes from 10.193.102.126: icmp_seq=6 ttl=127 time=1.54 ms
64 bytes from 10.193.102.126: icmp_seq=7 ttl=127 time=2.30 ms
64 bytes from 10.193.102.126: icmp_seq=8 ttl=127 time=1.94 ms
64 bytes from 10.193.102.126: icmp_seq=9 ttl=127 time=4.25 ms
64 bytes from 10.193.102.126: icmp_seq=10 ttl=127 time=1.75 ms
64 bytes from 10.193.102.126: icmp_seq=11 ttl=127 time=1.25 ms
64 bytes from 10.193.102.126: icmp_seq=12 ttl=127 time=2.04 ms
64 bytes from 10.193.102.126: icmp_seq=13 ttl=127 time=2.31 ms
64 bytes from 10.193.102.126: icmp_seq=14 ttl=127 time=2.18 ms
64 bytes from 10.193.102.126: icmp_seq=15 ttl=127 time=2.25 ms
64 bytes from 10.193.102.126: icmp_seq=16 ttl=127 time=1.37 ms

--- 10.193.102.126 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15026ms
rtt min/avg/max/mdev = 1.248/2.011/4.250/0.686 ms

i.MX8MM with ring size 512:
fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.126 -c 16
PING 10.193.102.126 (10.193.102.126) 56(84) bytes of data.
64 bytes from 10.193.102.126: icmp_seq=1 ttl=63 time=4.82 ms
64 bytes from 10.193.102.126: icmp_seq=2 ttl=63 time=4.67 ms
64 bytes from 10.193.102.126: icmp_seq=3 ttl=63 time=3.74 ms
64 bytes from 10.193.102.126: icmp_seq=4 ttl=63 time=3.87 ms
64 bytes from 10.193.102.126: icmp_seq=5 ttl=63 time=3.30 ms
64 bytes from 10.193.102.126: icmp_seq=6 ttl=63 time=3.79 ms
64 bytes from 10.193.102.126: icmp_seq=7 ttl=127 time=2.12 ms
64 bytes from 10.193.102.126: icmp_seq=8 ttl=127 time=1.99 ms
64 bytes from 10.193.102.126: icmp_seq=9 ttl=127 time=2.15 ms
64 bytes from 10.193.102.126: icmp_seq=10 ttl=127 time=1.82 ms
64 bytes from 10.193.102.126: icmp_seq=11 ttl=127 time=1.92 ms
64 bytes from 10.193.102.126: icmp_seq=12 ttl=127 time=1.23 ms
64 bytes from 10.193.102.126: icmp_seq=13 ttl=127 time=2.00 ms
64 bytes from 10.193.102.126: icmp_seq=14 ttl=127 time=1.66 ms
64 bytes from 10.193.102.126: icmp_seq=15 ttl=127 time=1.75 ms
64 bytes from 10.193.102.126: icmp_seq=16 ttl=127 time=2.24 ms

--- 10.193.102.126 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15026ms
rtt min/avg/max/mdev = 1.226/2.691/4.820/1.111 ms

i.MX8ULP with ring size 1024:
fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.216 -c 16
PING 10.193.102.216 (10.193.102.216) 56(84) bytes of data.
64 bytes from 10.193.102.216: icmp_seq=1 ttl=63 time=3.40 ms
64 bytes from 10.193.102.216: icmp_seq=2 ttl=63 time=5.46 ms
64 bytes from 10.193.102.216: icmp_seq=3 ttl=63 time=5.55 ms
64 bytes from 10.193.102.216: icmp_seq=4 ttl=63 time=5.97 ms
64 bytes from 10.193.102.216: icmp_seq=5 ttl=63 time=6.26 ms
64 bytes from 10.193.102.216: icmp_seq=6 ttl=63 time=0.963 ms
64 bytes from 10.193.102.216: icmp_seq=7 ttl=63 time=4.10 ms
64 bytes from 10.193.102.216: icmp_seq=8 ttl=63 time=4.55 ms
64 bytes from 10.193.102.216: icmp_seq=9 ttl=63 time=1.24 ms
64 bytes from 10.193.102.216: icmp_seq=10 ttl=63 time=6.96 ms
64 bytes from 10.193.102.216: icmp_seq=11 ttl=63 time=3.27 ms
64 bytes from 10.193.102.216: icmp_seq=12 ttl=63 time=6.57 ms
64 bytes from 10.193.102.216: icmp_seq=13 ttl=63 time=2.99 ms
64 bytes from 10.193.102.216: icmp_seq=14 ttl=63 time=1.70 ms
64 bytes from 10.193.102.216: icmp_seq=15 ttl=63 time=1.79 ms
64 bytes from 10.193.102.216: icmp_seq=16 ttl=63 time=1.42 ms

--- 10.193.102.216 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15026ms
rtt min/avg/max/mdev = 0.963/3.886/6.955/2.009 ms

i.MX8ULP with ring size 512:
fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.216 -c 16
PING 10.193.102.216 (10.193.102.216) 56(84) bytes of data.
64 bytes from 10.193.102.216: icmp_seq=1 ttl=63 time=5.70 ms
64 bytes from 10.193.102.216: icmp_seq=2 ttl=63 time=5.89 ms
64 bytes from 10.193.102.216: icmp_seq=3 ttl=63 time=3.37 ms
64 bytes from 10.193.102.216: icmp_seq=4 ttl=63 time=5.07 ms
64 bytes from 10.193.102.216: icmp_seq=5 ttl=63 time=1.47 ms
64 bytes from 10.193.102.216: icmp_seq=6 ttl=63 time=3.45 ms
64 bytes from 10.193.102.216: icmp_seq=7 ttl=63 time=1.35 ms
64 bytes from 10.193.102.216: icmp_seq=8 ttl=63 time=6.62 ms
64 bytes from 10.193.102.216: icmp_seq=9 ttl=63 time=1.41 ms
64 bytes from 10.193.102.216: icmp_seq=10 ttl=63 time=6.43 ms
64 bytes from 10.193.102.216: icmp_seq=11 ttl=63 time=1.41 ms
64 bytes from 10.193.102.216: icmp_seq=12 ttl=63 time=6.75 ms
64 bytes from 10.193.102.216: icmp_seq=13 ttl=63 time=4.76 ms
64 bytes from 10.193.102.216: icmp_seq=14 ttl=63 time=3.85 ms
64 bytes from 10.193.102.216: icmp_seq=15 ttl=63 time=3.50 ms
64 bytes from 10.193.102.216: icmp_seq=16 ttl=63 time=1.37 ms

--- 10.193.102.216 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15027ms
rtt min/avg/max/mdev = 1.349/3.900/6.749/1.985 ms

> Ideally you want enough transmit buffers to keep the link full, but not more. If
> the driver is using BQL, the network stack will help with this.
> 
> > Below are the results on i.MX6UL/8MM/8MP/8ULP/93 platforms, i.MX6UL
> > and 8ULP only support Fast ethernet. Others support 1G.
> 
> Thanks for the benchmark numbers. Please get into the habit of including
> them. We like to see justification for any sort of performance tweaks.
> 
> 	Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring
  2023-07-10  6:16         ` Wei Fang
@ 2023-07-12  1:16           ` Andrew Lunn
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Lunn @ 2023-07-12  1:16 UTC (permalink / raw)
  To: Wei Fang
  Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, Shenwei Wang,
	Clark Wang, netdev@vger.kernel.org, dl-linux-imx,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org

> Sorry, the Vybrid platform is not currently maintained by us, and Vybrid is also not
> included in our usual Yocto SDK RC version.

Your Yocto SDK version does not matter. We are talking about mainline
here... You are maintaining the mainline driver, and submitting
patches to extend the mainline driver.

> Even I find a Vybrid board, I think it
> probably cann't run with the latest kernel image, because the latest kernel image
> does not match with the old Yocto SDK, and new Yocto SDK does not support Vybrid
> platform. I also asked my colleague in test team who is in charge of ethernet testing,
> she hadn't even heard of Vybrid platform.

Well 6.5-rc1 does run on Vybrid, i have a number of boards with
it. You will also find that many inflight entertainment systems have a
box under each row of seats in a aeroplane with a Vybrid controlling a
Marvell switch. Take a look at arch/arm/boot/dts/vf610-zii-* You will
also find a number of DTS files for imx5i, imx6, imx7 used for ZII
products which are back of the seat displays, and make heavy use of
networking with the FEC.

So i have in interest in the FEC for all these platforms.

> > And a bigger burst means more latency. Read about Buffer bloat.
> > 
> Perhaps, but not necessarily. For example, in some cases that some burst packets
> maybe stay in Qdisc instead of hardware queue if ring size is small.

Which is what you want, so high priority packets can jump to the head
of the queue.

> > While you have iperf running saturating the link, try a ping as well. How does
> > ping latency change with more TX buffers?
> > 
> Per your suggestions, I tested on i.MX8ULP, i.MX8MM and i.MX93 platforms. The
> results do not appear to be very different.

Thanks for the benchmark numbers. They look O.K.

       Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-07-12  1:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-04  8:29 [PATCH net 0/3] net: fec: fix some issues of ndo_xdp_xmit() wei.fang
2023-07-04  8:29 ` [PATCH net 1/3] net: fec: dynamically set the NETDEV_XDP_ACT_NDO_XMIT feature of XDP wei.fang
2023-07-04 23:41   ` Andrew Lunn
2023-07-04 23:54     ` Toke Høiland-Jørgensen
2023-07-05  6:31       ` Wei Fang
2023-07-04  8:29 ` [PATCH net 2/3] net: fec: recycle pages for transmitted XDP frames wei.fang
2023-07-04 23:48   ` Andrew Lunn
2023-07-05  1:56     ` Wei Fang
2023-07-04  8:29 ` [PATCH net 3/3] net: fec: increase the size of tx ring and update thresholds of tx ring wei.fang
2023-07-05  0:25   ` Andrew Lunn
2023-07-05  6:20     ` Wei Fang
2023-07-05 18:11       ` Jakub Kicinski
2023-07-06  1:44         ` Wei Fang
2023-07-06  3:11           ` Jakub Kicinski
2023-07-06  6:14             ` Wei Fang
2023-07-08 19:03       ` Andrew Lunn
2023-07-10  6:16         ` Wei Fang
2023-07-12  1:16           ` Andrew Lunn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).