* [PATCH net-next v2 0/2] net: mvneta: improve rx performance @ 2017-02-17 10:02 Jisheng Zhang 2017-02-17 10:02 ` [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang ` (3 more replies) 0 siblings, 4 replies; 10+ messages in thread From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw) To: linux-arm-kernel In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may access fields of rx_desc. The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if the device isn't cache coherent, reading from uncached memory is fairly slow. patch1 reuses the read out status to getting status field of rx_desc again. patch2 uses cacheable memory to store the rx buffer DMA address. We get the following performance data on Marvell BG4CT Platforms (tested with iperf): before the patch: recving 1GB in mvneta_rx_swbm() costs 149265960 ns after the patch: recving 1GB in mvneta_rx_swbm() costs 1421565640 ns We saved 4.76% time. RFC: can we do similar modification for tx? If yes, I can prepare a v2. Basically, these two patches do what Arnd mentioned in [1]. Hi Arnd, I added "Suggested-by you" tag, I hope you don't mind ;) Thanks [1] https://www.spinics.net/lists/netdev/msg405889.html Since v1: - correct the performance data typo Jisheng Zhang (2): net: mvneta: avoid getting status from rx_desc as much as possible net: mvneta: Use cacheable memory to store the rx buffer DMA address drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-) -- 2.11.0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible 2017-02-17 10:02 [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang @ 2017-02-17 10:02 ` Jisheng Zhang 2017-02-17 13:35 ` Gregory CLEMENT 2017-02-17 10:02 ` [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang ` (2 subsequent siblings) 3 siblings, 1 reply; 10+ messages in thread From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw) To: linux-arm-kernel In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice. The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if the device isn't cache-coherent, reading from uncached memory is fairly slow. So reuse the read out rx_status to avoid the second reading from uncached memory. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> --- drivers/net/ethernet/marvell/mvneta.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 61dd4462411c..06df72b8da85 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -313,8 +313,8 @@ ((addr >= txq->tso_hdrs_phys) && \ (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE)) -#define MVNETA_RX_GET_BM_POOL_ID(rxd) \ - (((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT) +#define MVNETA_RX_GET_BM_POOL_ID(status) \ + (((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT) struct mvneta_statistic { unsigned short offset; @@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, for (i = 0; i < rx_done; i++) { struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq); - u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc); + u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status); struct mvneta_bm_pool *bm_pool; bm_pool = &pp->bm_priv->bm_pools[pool_id]; @@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); data = (u8 *)(uintptr_t)rx_desc->buf_cookie; phys_addr = rx_desc->buf_phys_addr; - pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc); + pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status); bm_pool = &pp->bm_priv->bm_pools[pool_id]; if (!mvneta_rxq_desc_is_first_last(rx_status) || -- 2.11.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible 2017-02-17 10:02 ` [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang @ 2017-02-17 13:35 ` Gregory CLEMENT 0 siblings, 0 replies; 10+ messages in thread From: Gregory CLEMENT @ 2017-02-17 13:35 UTC (permalink / raw) To: linux-arm-kernel Hi Jisheng, On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote: > In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice. > The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if > the device isn't cache-coherent, reading from uncached memory is > fairly slow. So reuse the read out rx_status to avoid the second > reading from uncached memory. > > Signed-off-by: Jisheng Zhang <jszhang@marvell.com> > Suggested-by: Arnd Bergmann <arnd@arndb.de> This one is OK and I didn't see a regression: Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Gregory > --- > drivers/net/ethernet/marvell/mvneta.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c > index 61dd4462411c..06df72b8da85 100644 > --- a/drivers/net/ethernet/marvell/mvneta.c > +++ b/drivers/net/ethernet/marvell/mvneta.c > @@ -313,8 +313,8 @@ > ((addr >= txq->tso_hdrs_phys) && \ > (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE)) > > -#define MVNETA_RX_GET_BM_POOL_ID(rxd) \ > - (((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT) > +#define MVNETA_RX_GET_BM_POOL_ID(status) \ > + (((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT) > > struct mvneta_statistic { > unsigned short offset; > @@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, > for (i = 0; i < rx_done; i++) { > struct mvneta_rx_desc *rx_desc = > mvneta_rxq_next_desc_get(rxq); > - u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc); > + u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status); > struct mvneta_bm_pool *bm_pool; > > bm_pool = &pp->bm_priv->bm_pools[pool_id]; > @@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, > rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); > data = (u8 *)(uintptr_t)rx_desc->buf_cookie; > phys_addr = rx_desc->buf_phys_addr; > - pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc); > + pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status); > bm_pool = &pp->bm_priv->bm_pools[pool_id]; > > if (!mvneta_rxq_desc_is_first_last(rx_status) || > -- > 2.11.0 > -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address 2017-02-17 10:02 [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang 2017-02-17 10:02 ` [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang @ 2017-02-17 10:02 ` Jisheng Zhang 2017-02-17 13:30 ` Gregory CLEMENT 2017-02-17 10:09 ` [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang 2017-02-17 10:37 ` Gregory CLEMENT 3 siblings, 1 reply; 10+ messages in thread From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw) To: linux-arm-kernel In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if the device isn't cache coherent, reading from uncached memory is fairly slow. This patch uses cacheable memory to store the rx buffer DMA address. We get the following performance data on Marvell BG4CT Platforms (tested with iperf): before the patch: recving 1GB in mvneta_rx_swbm() costs 1492659600 ns after the patch: recving 1GB in mvneta_rx_swbm() costs 1421565640 ns We saved 4.76% time. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> --- drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 06df72b8da85..e24c3028fe1d 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -580,6 +580,9 @@ struct mvneta_rx_queue { /* Virtual address of the RX buffer */ void **buf_virt_addr; + /* DMA address of the RX buffer */ + dma_addr_t *buf_dma_addr; + /* Virtual address of the RX DMA descriptors array */ struct mvneta_rx_desc *descs; @@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc, rx_desc->buf_phys_addr = phys_addr; i = rx_desc - rxq->descs; + rxq->buf_dma_addr[i] = phys_addr; rxq->buf_virt_addr[i] = virt_addr; } @@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, for (i = 0; i < rx_done; i++) { struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq); + int index = rx_desc - rxq->descs; u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status); struct mvneta_bm_pool *bm_pool; bm_pool = &pp->bm_priv->bm_pools[pool_id]; /* Return dropped buffer to the pool */ mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, - rx_desc->buf_phys_addr); + rxq->buf_dma_addr[index]); } return; } for (i = 0; i < rxq->size; i++) { - struct mvneta_rx_desc *rx_desc = rxq->descs + i; void *data = rxq->buf_virt_addr[i]; - dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr, + dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i], MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE); mvneta_frag_free(pp->frag_size, data); } @@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo, rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); index = rx_desc - rxq->descs; data = rxq->buf_virt_addr[index]; - phys_addr = rx_desc->buf_phys_addr; + phys_addr = rxq->buf_dma_addr[index]; if (!mvneta_rxq_desc_is_first_last(rx_status) || (rx_status & MVNETA_RXD_ERR_SUMMARY)) { @@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, /* Fairness NAPI loop */ while (rx_done < rx_todo) { struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq); + int index = rx_desc - rxq->descs; struct mvneta_bm_pool *bm_pool = NULL; struct sk_buff *skb; unsigned char *data; @@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, rx_status = rx_desc->status; rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); data = (u8 *)(uintptr_t)rx_desc->buf_cookie; - phys_addr = rx_desc->buf_phys_addr; + phys_addr = rxq->buf_dma_addr[index]; pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status); bm_pool = &pp->bm_priv->bm_pools[pool_id]; @@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, (rx_status & MVNETA_RXD_ERR_SUMMARY)) { err_drop_frame_ret_pool: /* Return the buffer to the pool */ - mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, - rx_desc->buf_phys_addr); + mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr); err_drop_frame: dev->stats.rx_errors++; mvneta_rx_error(pp, rx_desc); @@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, goto err_drop_frame_ret_pool; dma_sync_single_range_for_cpu(dev->dev.parent, - rx_desc->buf_phys_addr, + phys_addr, MVNETA_MH_SIZE + NET_SKB_PAD, rx_bytes, DMA_FROM_DEVICE); @@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, rcvd_bytes += rx_bytes; /* Return the buffer to the pool */ - mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, - rx_desc->buf_phys_addr); + mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr); /* leave the descriptor and buffer untouched */ continue; @@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp) rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent, rxq->size * sizeof(void *), GFP_KERNEL); - if (!rxq->buf_virt_addr) + rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent, + rxq->size * sizeof(dma_addr_t), + GFP_KERNEL); + if (!rxq->buf_virt_addr || !rxq->buf_dma_addr) return -ENOMEM; } -- 2.11.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address 2017-02-17 10:02 ` [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang @ 2017-02-17 13:30 ` Gregory CLEMENT 2017-02-17 13:55 ` Thomas Petazzoni 0 siblings, 1 reply; 10+ messages in thread From: Gregory CLEMENT @ 2017-02-17 13:30 UTC (permalink / raw) To: linux-arm-kernel Hi Jisheng, On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote: > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the > buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by > dma_alloc_coherent, it's uncacheable if the device isn't cache > coherent, reading from uncached memory is fairly slow. This patch uses > cacheable memory to store the rx buffer DMA address. We get the > following performance data on Marvell BG4CT Platforms (tested with > iperf): > > before the patch: > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns > > after the patch: > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns > > We saved 4.76% time. I have just tested it and as I feared, with HWBM enabled, a simple iperf just doesn't work. Gregory > > Signed-off-by: Jisheng Zhang <jszhang@marvell.com> > Suggested-by: Arnd Bergmann <arnd@arndb.de> > --- > drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++----------- > 1 file changed, 17 insertions(+), 11 deletions(-) > > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c > index 06df72b8da85..e24c3028fe1d 100644 > --- a/drivers/net/ethernet/marvell/mvneta.c > +++ b/drivers/net/ethernet/marvell/mvneta.c > @@ -580,6 +580,9 @@ struct mvneta_rx_queue { > /* Virtual address of the RX buffer */ > void **buf_virt_addr; > > + /* DMA address of the RX buffer */ > + dma_addr_t *buf_dma_addr; > + > /* Virtual address of the RX DMA descriptors array */ > struct mvneta_rx_desc *descs; > > @@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc, > > rx_desc->buf_phys_addr = phys_addr; > i = rx_desc - rxq->descs; > + rxq->buf_dma_addr[i] = phys_addr; > rxq->buf_virt_addr[i] = virt_addr; > } > > @@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, > for (i = 0; i < rx_done; i++) { > struct mvneta_rx_desc *rx_desc = > mvneta_rxq_next_desc_get(rxq); > + int index = rx_desc - rxq->descs; > u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status); > struct mvneta_bm_pool *bm_pool; > > bm_pool = &pp->bm_priv->bm_pools[pool_id]; > /* Return dropped buffer to the pool */ > mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, > - rx_desc->buf_phys_addr); > + rxq->buf_dma_addr[index]); > } > return; > } > > for (i = 0; i < rxq->size; i++) { > - struct mvneta_rx_desc *rx_desc = rxq->descs + i; > void *data = rxq->buf_virt_addr[i]; > > - dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr, > + dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i], > MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE); > mvneta_frag_free(pp->frag_size, data); > } > @@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo, > rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); > index = rx_desc - rxq->descs; > data = rxq->buf_virt_addr[index]; > - phys_addr = rx_desc->buf_phys_addr; > + phys_addr = rxq->buf_dma_addr[index]; > > if (!mvneta_rxq_desc_is_first_last(rx_status) || > (rx_status & MVNETA_RXD_ERR_SUMMARY)) { > @@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, > /* Fairness NAPI loop */ > while (rx_done < rx_todo) { > struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq); > + int index = rx_desc - rxq->descs; > struct mvneta_bm_pool *bm_pool = NULL; > struct sk_buff *skb; > unsigned char *data; > @@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, > rx_status = rx_desc->status; > rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); > data = (u8 *)(uintptr_t)rx_desc->buf_cookie; > - phys_addr = rx_desc->buf_phys_addr; > + phys_addr = rxq->buf_dma_addr[index]; > pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status); > bm_pool = &pp->bm_priv->bm_pools[pool_id]; > > @@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, > (rx_status & MVNETA_RXD_ERR_SUMMARY)) { > err_drop_frame_ret_pool: > /* Return the buffer to the pool */ > - mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, > - rx_desc->buf_phys_addr); > + mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr); > err_drop_frame: > dev->stats.rx_errors++; > mvneta_rx_error(pp, rx_desc); > @@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, > goto err_drop_frame_ret_pool; > > dma_sync_single_range_for_cpu(dev->dev.parent, > - rx_desc->buf_phys_addr, > + phys_addr, > MVNETA_MH_SIZE + NET_SKB_PAD, > rx_bytes, > DMA_FROM_DEVICE); > @@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo, > rcvd_bytes += rx_bytes; > > /* Return the buffer to the pool */ > - mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, > - rx_desc->buf_phys_addr); > + mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr); > > /* leave the descriptor and buffer untouched */ > continue; > @@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp) > rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent, > rxq->size * sizeof(void *), > GFP_KERNEL); > - if (!rxq->buf_virt_addr) > + rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent, > + rxq->size * sizeof(dma_addr_t), > + GFP_KERNEL); > + if (!rxq->buf_virt_addr || !rxq->buf_dma_addr) > return -ENOMEM; > } > > -- > 2.11.0 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address 2017-02-17 13:30 ` Gregory CLEMENT @ 2017-02-17 13:55 ` Thomas Petazzoni 2017-02-17 15:20 ` Gregory CLEMENT 0 siblings, 1 reply; 10+ messages in thread From: Thomas Petazzoni @ 2017-02-17 13:55 UTC (permalink / raw) To: linux-arm-kernel Hello, On Fri, 17 Feb 2017 14:30:03 +0100, Gregory CLEMENT wrote: > I have just tested it and as I feared, with HWBM enabled, a simple iperf > just doesn't work. And that's expected: the whole point of HWBM is that the buffer into which a RX packet is placed is allocated by the HW, and its address stored in the RX descriptor. So the following code: > > rx_desc->buf_phys_addr = phys_addr; > > i = rx_desc - rxq->descs; > > + rxq->buf_dma_addr[i] = phys_addr; Does not make sense, because it's not the SW that refills the RX descriptors with the address of the RX buffers. It's done by the HW. With HWBM, I believe you have no choice but to read the physical address from the RX descriptor. But you can probably optimize things a little bit by reading it only once, and then storing it into a cacheable variable. So maybe: - For SWBM, use the strategy proposed by Jisheng - For HWBM, at the beginning of the RX completion path, read once the rx_desc->buf_phys_addr, and store it in rxq->buf_dma_addr[index] Of course that's just a very rough proposal. I've been looking mainly at mvpp2 lately, and I'm not sure I still remember how mvneta works in the details. Best regards, Thomas -- Thomas Petazzoni, CTO, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address 2017-02-17 13:55 ` Thomas Petazzoni @ 2017-02-17 15:20 ` Gregory CLEMENT 0 siblings, 0 replies; 10+ messages in thread From: Gregory CLEMENT @ 2017-02-17 15:20 UTC (permalink / raw) To: linux-arm-kernel Hi Thomas, On ven., f?vr. 17 2017, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Does not make sense, because it's not the SW that refills the RX > descriptors with the address of the RX buffers. It's done by the HW. > > With HWBM, I believe you have no choice but to read the physical > address from the RX descriptor. But you can probably optimize things a > little bit by reading it only once, and then storing it into a > cacheable variable. > > So maybe: > > - For SWBM, use the strategy proposed by Jisheng > - For HWBM, at the beginning of the RX completion path, read once the > rx_desc->buf_phys_addr, and store it in rxq->buf_dma_addr[index] For the HWBM path storing rx_desc->buf_phys_addr in rxq->buf_dma_addr[index] is not useful as we only use it in a single function. But a quick improvement could be to use the phys_addr variable. Indeed we store the value of rx_desc->buf_phys_addr in it and we never used it, instead we always use rx_desc->buf_phys_addr. Gregory > > Of course that's just a very rough proposal. I've been looking mainly > at mvpp2 lately, and I'm not sure I still remember how mvneta works in > the details. > > Best regards, > > Thomas > -- > Thomas Petazzoni, CTO, Free Electrons > Embedded Linux and Kernel engineering > http://free-electrons.com -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 0/2] net: mvneta: improve rx performance 2017-02-17 10:02 [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang 2017-02-17 10:02 ` [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang 2017-02-17 10:02 ` [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang @ 2017-02-17 10:09 ` Jisheng Zhang 2017-02-17 10:37 ` Gregory CLEMENT 3 siblings, 0 replies; 10+ messages in thread From: Jisheng Zhang @ 2017-02-17 10:09 UTC (permalink / raw) To: linux-arm-kernel On Fri, 17 Feb 2017 18:02:31 +0800 Jisheng Zhang <jszhang@marvell.com> wrote: > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may > access fields of rx_desc. The rx_desc is allocated by > dma_alloc_coherent, it's uncacheable if the device isn't cache > coherent, reading from uncached memory is fairly slow. > > patch1 reuses the read out status to getting status field of rx_desc > again. > > patch2 uses cacheable memory to store the rx buffer DMA address. > > We get the following performance data on Marvell BG4CT Platforms > (tested with iperf): > > before the patch: > recving 1GB in mvneta_rx_swbm() costs 149265960 ns oops, I still didn't correct the typo here, it should be 1492659600 ns Sorry, but I think there must be comments, I'll fix this typo in v3 when address comments. > > after the patch: > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns > > We saved 4.76% time. > > RFC: can we do similar modification for tx? If yes, I can prepare a v2. > > > Basically, these two patches do what Arnd mentioned in [1]. > > Hi Arnd, > > I added "Suggested-by you" tag, I hope you don't mind ;) > > Thanks > > [1] https://www.spinics.net/lists/netdev/msg405889.html > > Since v1: > - correct the performance data typo > > Jisheng Zhang (2): > net: mvneta: avoid getting status from rx_desc as much as possible > net: mvneta: Use cacheable memory to store the rx buffer DMA address > > drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++--------------- > 1 file changed, 21 insertions(+), 15 deletions(-) > ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 0/2] net: mvneta: improve rx performance 2017-02-17 10:02 [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang ` (2 preceding siblings ...) 2017-02-17 10:09 ` [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang @ 2017-02-17 10:37 ` Gregory CLEMENT 2017-02-17 10:44 ` Jisheng Zhang 3 siblings, 1 reply; 10+ messages in thread From: Gregory CLEMENT @ 2017-02-17 10:37 UTC (permalink / raw) To: linux-arm-kernel Hi Jisheng, On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote: > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may > access fields of rx_desc. The rx_desc is allocated by > dma_alloc_coherent, it's uncacheable if the device isn't cache > coherent, reading from uncached memory is fairly slow. Did you test it with HWBM support? I am not sure ti will work in this case. Gregory > > patch1 reuses the read out status to getting status field of rx_desc > again. > > patch2 uses cacheable memory to store the rx buffer DMA address. > > We get the following performance data on Marvell BG4CT Platforms > (tested with iperf): > > before the patch: > recving 1GB in mvneta_rx_swbm() costs 149265960 ns > > after the patch: > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns > > We saved 4.76% time. > > RFC: can we do similar modification for tx? If yes, I can prepare a v2. > > > Basically, these two patches do what Arnd mentioned in [1]. > > Hi Arnd, > > I added "Suggested-by you" tag, I hope you don't mind ;) > > Thanks > > [1] https://www.spinics.net/lists/netdev/msg405889.html > > Since v1: > - correct the performance data typo > > Jisheng Zhang (2): > net: mvneta: avoid getting status from rx_desc as much as possible > net: mvneta: Use cacheable memory to store the rx buffer DMA address > > drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++--------------- > 1 file changed, 21 insertions(+), 15 deletions(-) > > -- > 2.11.0 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- Gregory Clement, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v2 0/2] net: mvneta: improve rx performance 2017-02-17 10:37 ` Gregory CLEMENT @ 2017-02-17 10:44 ` Jisheng Zhang 0 siblings, 0 replies; 10+ messages in thread From: Jisheng Zhang @ 2017-02-17 10:44 UTC (permalink / raw) To: linux-arm-kernel On Fri, 17 Feb 2017 11:37:21 +0100 Gregory CLEMENT wrote: > Hi Jisheng, > > On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote: > > > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may > > access fields of rx_desc. The rx_desc is allocated by > > dma_alloc_coherent, it's uncacheable if the device isn't cache > > coherent, reading from uncached memory is fairly slow. > > Did you test it with HWBM support? No I didn't test it for lacking of such HW, so it's appreciated if someone can test with HWBM capable HW. > > I am not sure ti will work in this case. IMHO, if mvneta HW doesn't update rx_desc->buf_phys_addr, it can still work. I don't have HWBM background, so above may be wrong. If this case doesn't work for HWBM, I'll submit v3 to modify mvneta_rx_swbm() only. Thanks, Jisheng > > Gregory > > > > > patch1 reuses the read out status to getting status field of rx_desc > > again. > > > > patch2 uses cacheable memory to store the rx buffer DMA address. > > > > We get the following performance data on Marvell BG4CT Platforms > > (tested with iperf): > > > > before the patch: > > recving 1GB in mvneta_rx_swbm() costs 149265960 ns > > > > after the patch: > > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns > > > > We saved 4.76% time. > > > > RFC: can we do similar modification for tx? If yes, I can prepare a v2. > > > > > > Basically, these two patches do what Arnd mentioned in [1]. > > > > Hi Arnd, > > > > I added "Suggested-by you" tag, I hope you don't mind ;) > > > > Thanks > > > > [1] https://www.spinics.net/lists/netdev/msg405889.html > > > > Since v1: > > - correct the performance data typo > > > > Jisheng Zhang (2): > > net: mvneta: avoid getting status from rx_desc as much as possible > > net: mvneta: Use cacheable memory to store the rx buffer DMA address > > > > drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++--------------- > > 1 file changed, 21 insertions(+), 15 deletions(-) > > > > -- > > 2.11.0 > > > > > > _______________________________________________ > > linux-arm-kernel mailing list > > linux-arm-kernel at lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-02-17 15:20 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-02-17 10:02 [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang 2017-02-17 10:02 ` [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang 2017-02-17 13:35 ` Gregory CLEMENT 2017-02-17 10:02 ` [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang 2017-02-17 13:30 ` Gregory CLEMENT 2017-02-17 13:55 ` Thomas Petazzoni 2017-02-17 15:20 ` Gregory CLEMENT 2017-02-17 10:09 ` [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang 2017-02-17 10:37 ` Gregory CLEMENT 2017-02-17 10:44 ` Jisheng Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).