From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Saeed Mahameed <saeedm@mellanox.com>
Cc: "David S. Miller" <davem@davemloft.net>,
netdev@vger.kernel.org, Or Gerlitz <ogerlitz@mellanox.com>,
Eran Ben Elisha <eranbe@mellanox.com>,
Tal Alon <talal@mellanox.com>, Tariq Toukan <tariqt@mellanox.com>,
Achiad Shochat <achiad@mellanox.com>,
brouer@redhat.com
Subject: Re: [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Date: Mon, 14 Mar 2016 22:33:44 +0100 [thread overview]
Message-ID: <20160314223344.48621fa7@redhat.com> (raw)
In-Reply-To: <1457703594-9482-7-git-send-email-saeedm@mellanox.com>
On Fri, 11 Mar 2016 15:39:47 +0200 Saeed Mahameed <saeedm@mellanox.com> wrote:
> From: Tariq Toukan <tariqt@mellanox.com>
>
> Introduce the feature of multi-packet WQE (RX Work Queue Element)
> referred to as (MPWQE or Striding RQ), in which WQEs are larger
> and serve multiple packets each.
>
> Every WQE consists of many strides of the same size, every received
> packet is aligned to a beginning of a stride and is written to
> consecutive strides within a WQE.
I really like this HW support! :-)
I noticed the "Multi-Packet WQE" send format, but I could not find the
receive part in the programmers ref doc, until I started looking after
"stride".
> In the regular approach, each regular WQE is big enough to be capable
> of serving one received packet of any size up to MTU or 64K in case of
> device LRO is enabeled, making it very wasteful when dealing with
> small packets or device LRO is enabeled.
>
> For its flexibility, MPWQE allows a better memory utilization (implying
> improvements in CPU utilization and packet rate) as packets consume
> strides according to their size, preserving the rest of the WQE to be
> available for other packets.
It does allow significant better memory utilization (even if Eric
cannot see it, I can).
One issue with this approach is that we no-longer can use the
packet-data as the skb->data pointer. (AFAIK because we cannot use
dma_unmap any longer, and instead we need to use dma_sync).
Thus, for every single packet you are now allocating a new memory area
for skb->data.
> MPWQE default configuration:
> NUM WQEs = 16
> Strides Per WQE = 1024
> Stride Size = 128
> Performance tested on ConnectX4-Lx 50G.
>
> * Netperf single TCP stream:
> - message size = 1024, bw raised from ~12300 mbps to 14900 mbps (+20%)
> - message size = 65536, bw raised from ~21800 mbps to 33500 mbps (+50%)
> - with other message sized we saw some gain or no degradation.
>
> * Netperf multi TCP stream:
> - No degradation, line rate reached.
>
> * Pktgen: packet loss in bursts of N small messages (64byte), single
> stream
> - | num packets | packets loss before | packets loss after
> | 2K | ~ 1K | 0
> | 16K | ~13K | 0
> | 32K | ~29K | 14K
>
> As expected as the driver can recive as many small packets (<=128) as
> the number of total strides in the ring (default = 1024 * 16) vs. 1024
> (default ring size regardless of packets size) before this feautre.
>
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Achiad Shochat <achiad@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en.h | 71 +++++++++++-
> .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 15 ++-
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 109 +++++++++++++----
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 126 ++++++++++++++++++--
> include/linux/mlx5/device.h | 39 ++++++-
> include/linux/mlx5/mlx5_ifc.h | 13 ++-
> 6 files changed, 327 insertions(+), 46 deletions(-)
>
[...]
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -76,6 +76,33 @@ err_free_skb:
> return -ENOMEM;
> }
>
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> + int ret = 0;
> +
> + wi->dma_info.page = alloc_pages(GFP_ATOMIC | __GFP_COMP | __GFP_COLD,
> + MLX5_MPWRQ_WQE_PAGE_ORDER);
Order 5 page = 131072 bytes, but we only alloc 16 of them.
> + if (unlikely(!wi->dma_info.page))
> + return -ENOMEM;
> +
> + wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> + rq->wqe_sz, PCI_DMA_FROMDEVICE);
Mapping the entire page is going to make PowerPC owners happy.
> + if (dma_mapping_error(rq->pdev, wi->dma_info.addr)) {
> + ret = -ENOMEM;
> + goto err_put_page;
> + }
> +
> + wi->consumed_strides = 0;
> + wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> +
> + return 0;
> +
> +err_put_page:
> + put_page(wi->dma_info.page);
> + return ret;
> +}
> +
[...]
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +{
> + u16 cstrides = mpwrq_get_cqe_consumed_strides(cqe);
> + u16 stride_ix = mpwrq_get_cqe_stride_index(cqe);
> + u32 consumed_bytes = cstrides * MLX5_MPWRQ_STRIDE_SIZE;
> + u32 stride_offset = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
> + u16 wqe_id = be16_to_cpu(cqe->wqe_id);
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
> + struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
> + struct sk_buff *skb;
> + u16 byte_cnt;
> + u16 cqe_bcnt;
> + u16 headlen;
> +
> + wi->consumed_strides += cstrides;
Ok, moving N strides, for next round.
> +
> + if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
> + rq->stats.wqe_err++;
> + goto mpwrq_cqe_out;
> + }
> +
> + if (mpwrq_is_filler_cqe(cqe)) {
> + rq->stats.mpwqe_filler++;
> + goto mpwrq_cqe_out;
> + }
> +
> + skb = netdev_alloc_skb(rq->netdev, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
> + if (unlikely(!skb))
> + goto mpwrq_cqe_out;
> +
> + dma_sync_single_for_cpu(rq->pdev, wi->dma_info.addr + stride_offset,
> + consumed_bytes, DMA_FROM_DEVICE);
> +
> + cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
> + headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, cqe_bcnt);
> + skb_copy_to_linear_data(skb,
> + page_address(wi->dma_info.page) + stride_offset,
> + headlen);
> + skb_put(skb, headlen);
> +
> + byte_cnt = cqe_bcnt - headlen;
> + if (byte_cnt) {
> + skb_frag_t *f0 = &skb_shinfo(skb)->frags[0];
> +
> + skb_shinfo(skb)->nr_frags = 1;
> +
> + skb->data_len = byte_cnt;
> + skb->len += byte_cnt;
> + skb->truesize = SKB_TRUESIZE(skb->len);
> +
> + get_page(wi->dma_info.page);
> + skb_frag_set_page(skb, 0, wi->dma_info.page);
> + skb_frag_size_set(f0, skb->data_len);
> + f0->page_offset = stride_offset + headlen;
> + }
> +
> + mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
> +
> +mpwrq_cqe_out:
> + if (likely(wi->consumed_strides < MLX5_MPWRQ_NUM_STRIDES))
> + return;
Due to return statement, we keep working on the same big page, only
dma_sync'ing what we need.
> +
> + dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> + PCI_DMA_FROMDEVICE);
Page is first fully dma_unmap'ed after all stride-entries have been
processed/consumed.
> + put_page(wi->dma_info.page);
> + mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
> +}
> +
> int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
> {
> struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2016-03-14 21:33 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 02/13] net/mlx5: Introduce device queue counters Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 03/13] net/mlx5e: Allocate set of queue counters per netdev Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS Saeed Mahameed
2016-03-11 14:08 ` Sergei Shtylyov
2016-03-11 19:29 ` Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 05/13] net/mlx5e: Use function pointers for RX data path handling Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ) Saeed Mahameed
2016-03-14 21:33 ` Jesper Dangaard Brouer [this message]
2016-03-11 13:39 ` [PATCH net-next 07/13] net/mlx5e: Added ICO SQs Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE Saeed Mahameed
2016-03-11 14:32 ` Eric Dumazet
2016-03-11 19:25 ` Saeed Mahameed
2016-03-11 19:58 ` Eric Dumazet
2016-03-13 10:29 ` achiad shochat
2016-03-14 18:16 ` Saeed Mahameed
2016-03-14 19:16 ` achiad shochat
2016-03-14 20:26 ` Eric Dumazet
2016-03-14 20:29 ` Eric Dumazet
2016-03-14 20:23 ` Eric Dumazet
2016-03-11 13:39 ` [PATCH net-next 09/13] net/mlx5e: Change RX moderation period to be based on CQE Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 10/13] net/mlx5e: Use napi_alloc_skb for RX SKB allocations Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 11/13] net/mlx5e: Prefetch next RX CQE Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 12/13] net/mlx5e: Remove redundant barrier Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 13/13] net/mlx5e: Add ethtool counter for RX SKB allocation failures Saeed Mahameed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160314223344.48621fa7@redhat.com \
--to=brouer@redhat.com \
--cc=achiad@mellanox.com \
--cc=davem@davemloft.net \
--cc=eranbe@mellanox.com \
--cc=netdev@vger.kernel.org \
--cc=ogerlitz@mellanox.com \
--cc=saeedm@mellanox.com \
--cc=talal@mellanox.com \
--cc=tariqt@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).