netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Saeed Mahameed <saeedm@mellanox.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	netdev@vger.kernel.org, Or Gerlitz <ogerlitz@mellanox.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Tal Alon <talal@mellanox.com>, Tariq Toukan <tariqt@mellanox.com>,
	Achiad Shochat <achiad@mellanox.com>,
	brouer@redhat.com
Subject: Re: [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Date: Mon, 14 Mar 2016 22:33:44 +0100	[thread overview]
Message-ID: <20160314223344.48621fa7@redhat.com> (raw)
In-Reply-To: <1457703594-9482-7-git-send-email-saeedm@mellanox.com>


On Fri, 11 Mar 2016 15:39:47 +0200 Saeed Mahameed <saeedm@mellanox.com> wrote:

> From: Tariq Toukan <tariqt@mellanox.com>
> 
> Introduce the feature of multi-packet WQE (RX Work Queue Element)
> referred to as (MPWQE or Striding RQ), in which WQEs are larger
> and serve multiple packets each.
> 
> Every WQE consists of many strides of the same size, every received
> packet is aligned to a beginning of a stride and is written to
> consecutive strides within a WQE.

I really like this HW support! :-)

I noticed the "Multi-Packet WQE" send format, but I could not find the
receive part in the programmers ref doc, until I started looking after
"stride".


> In the regular approach, each regular WQE is big enough to be capable
> of serving one received packet of any size up to MTU or 64K in case of
> device LRO is enabeled, making it very wasteful when dealing with
> small packets or device LRO is enabeled.
> 
> For its flexibility, MPWQE allows a better memory utilization (implying
> improvements in CPU utilization and packet rate) as packets consume
> strides according to their size, preserving the rest of the WQE to be
> available for other packets.

It does allow significant better memory utilization (even if Eric
cannot see it, I can).

One issue with this approach is that we no-longer can use the
packet-data as the skb->data pointer.  (AFAIK because we cannot use
dma_unmap any longer, and instead we need to use dma_sync).

Thus, for every single packet you are now allocating a new memory area
for skb->data.


> MPWQE default configuration:
> 	NUM WQEs = 16
> 	Strides Per WQE = 1024
> 	Stride Size = 128

> Performance tested on ConnectX4-Lx 50G.
> 
> * Netperf single TCP stream:
> - message size = 1024,  bw raised from ~12300 mbps to 14900 mbps (+20%)
> - message size = 65536, bw raised from ~21800 mbps to 33500 mbps (+50%)
> - with other message sized we saw some gain or no degradation.
> 
> * Netperf multi TCP stream:
> - No degradation, line rate reached.
> 
> * Pktgen: packet loss in bursts of N small messages (64byte), single
> stream
> - | num packets | packets loss before	| packets loss after
>   |	2K	|       ~ 1K		|	0
>   |	16K	|       ~13K 		|	0
>   |	32K	|	~29K		|      14K
> 
> As expected as the driver can recive as many small packets (<=128) as
> the number of total strides in the ring (default = 1024 * 16) vs. 1024
> (default ring size regardless of packets size) before this feautre.
> 
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Achiad Shochat <achiad@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       |   71 +++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   15 ++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  109 +++++++++++++----
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  126 ++++++++++++++++++--
>  include/linux/mlx5/device.h                        |   39 ++++++-
>  include/linux/mlx5/mlx5_ifc.h                      |   13 ++-
>  6 files changed, 327 insertions(+), 46 deletions(-)
> 
[...]
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -76,6 +76,33 @@ err_free_skb:
>  	return -ENOMEM;
>  }
>  
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +	int ret = 0;
> +
> +	wi->dma_info.page = alloc_pages(GFP_ATOMIC | __GFP_COMP | __GFP_COLD,
> +					MLX5_MPWRQ_WQE_PAGE_ORDER);

Order 5 page = 131072 bytes, but we only alloc 16 of them.

> +	if (unlikely(!wi->dma_info.page))
> +		return -ENOMEM;
> +
> +	wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> +					 rq->wqe_sz, PCI_DMA_FROMDEVICE);

Mapping the entire page is going to make PowerPC owners happy.

> +	if (dma_mapping_error(rq->pdev, wi->dma_info.addr)) {
> +		ret = -ENOMEM;
> +		goto err_put_page;
> +	}
> +
> +	wi->consumed_strides = 0;
> +	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> +
> +	return 0;
> +
> +err_put_page:
> +	put_page(wi->dma_info.page);
> +	return ret;
> +}
> +
[...]
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +{
> +	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
> +	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
> +	u32 consumed_bytes = cstrides  * MLX5_MPWRQ_STRIDE_SIZE;
> +	u32 stride_offset  = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
> +	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
> +	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
> +	struct sk_buff *skb;
> +	u16 byte_cnt;
> +	u16 cqe_bcnt;
> +	u16 headlen;
> +
> +	wi->consumed_strides += cstrides;

Ok, moving N strides, for next round.

> +
> +	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
> +		rq->stats.wqe_err++;
> +		goto mpwrq_cqe_out;
> +	}
> +
> +	if (mpwrq_is_filler_cqe(cqe)) {
> +		rq->stats.mpwqe_filler++;
> +		goto mpwrq_cqe_out;
> +	}
> +
> +	skb = netdev_alloc_skb(rq->netdev, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
> +	if (unlikely(!skb))
> +		goto mpwrq_cqe_out;
> +
> +	dma_sync_single_for_cpu(rq->pdev, wi->dma_info.addr + stride_offset,
> +				consumed_bytes, DMA_FROM_DEVICE);
> +
> +	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
> +	headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, cqe_bcnt);
> +	skb_copy_to_linear_data(skb,
> +				page_address(wi->dma_info.page) + stride_offset,
> +				headlen);
> +	skb_put(skb, headlen);
> +
> +	byte_cnt = cqe_bcnt - headlen;
> +	if (byte_cnt) {
> +		skb_frag_t *f0 = &skb_shinfo(skb)->frags[0];
> +
> +		skb_shinfo(skb)->nr_frags = 1;
> +
> +		skb->data_len  = byte_cnt;
> +		skb->len      += byte_cnt;
> +		skb->truesize  = SKB_TRUESIZE(skb->len);
> +
> +		get_page(wi->dma_info.page);
> +		skb_frag_set_page(skb, 0, wi->dma_info.page);
> +		skb_frag_size_set(f0, skb->data_len);
> +		f0->page_offset = stride_offset + headlen;
> +	}
> +
> +	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
> +
> +mpwrq_cqe_out:
> +	if (likely(wi->consumed_strides < MLX5_MPWRQ_NUM_STRIDES))
> +		return;

Due to return statement, we keep working on the same big page, only
dma_sync'ing what we need.

> +
> +	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> +		       PCI_DMA_FROMDEVICE);

Page is first fully dma_unmap'ed after all stride-entries have been
processed/consumed.

> +	put_page(wi->dma_info.page);
> +	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
> +}
> +
>  int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  {
>  	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2016-03-14 21:33 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 02/13] net/mlx5: Introduce device queue counters Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 03/13] net/mlx5e: Allocate set of queue counters per netdev Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS Saeed Mahameed
2016-03-11 14:08   ` Sergei Shtylyov
2016-03-11 19:29     ` Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 05/13] net/mlx5e: Use function pointers for RX data path handling Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ) Saeed Mahameed
2016-03-14 21:33   ` Jesper Dangaard Brouer [this message]
2016-03-11 13:39 ` [PATCH net-next 07/13] net/mlx5e: Added ICO SQs Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE Saeed Mahameed
2016-03-11 14:32   ` Eric Dumazet
2016-03-11 19:25     ` Saeed Mahameed
2016-03-11 19:58       ` Eric Dumazet
2016-03-13 10:29         ` achiad shochat
2016-03-14 18:16         ` Saeed Mahameed
2016-03-14 19:16           ` achiad shochat
2016-03-14 20:26             ` Eric Dumazet
2016-03-14 20:29             ` Eric Dumazet
2016-03-14 20:23           ` Eric Dumazet
2016-03-11 13:39 ` [PATCH net-next 09/13] net/mlx5e: Change RX moderation period to be based on CQE Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 10/13] net/mlx5e: Use napi_alloc_skb for RX SKB allocations Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 11/13] net/mlx5e: Prefetch next RX CQE Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 12/13] net/mlx5e: Remove redundant barrier Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 13/13] net/mlx5e: Add ethtool counter for RX SKB allocation failures Saeed Mahameed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160314223344.48621fa7@redhat.com \
    --to=brouer@redhat.com \
    --cc=achiad@mellanox.com \
    --cc=davem@davemloft.net \
    --cc=eranbe@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=saeedm@mellanox.com \
    --cc=talal@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).