Re: [net-next PATCH RFC] mlx4: RX prefetch loop

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: netdev@vger.kernel.org
Cc: kafai@fb.com, daniel@iogearbox.net, tom@herbertland.com,
	bblanco@plumgrid.com, john.fastabend@gmail.com,
	gerlitz.or@gmail.com, hannes@stressinduktion.org,
	rana.shahot@gmail.com, tgraf@suug.ch,
	"David S. Miller" <davem@davemloft.net>,
	as754m@att.com, brouer@redhat.com, saeedm@mellanox.com,
	amira@mellanox.com, tzahio@mellanox.com,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [net-next PATCH RFC] mlx4: RX prefetch loop
Date: Mon, 11 Jul 2016 13:09:22 +0200	[thread overview]
Message-ID: <20160711130922.636ee4e6@redhat.com> (raw)
In-Reply-To: <20160708160135.27507.77711.stgit@firesoul>

On Fri, 08 Jul 2016 18:02:20 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> This patch is about prefetching without being opportunistic.
> The idea is only to start prefetching on packets that are marked as
> ready/completed in the RX ring.
> 
> This is acheived by splitting the napi_poll call mlx4_en_process_rx_cq()
> loop into two.  The first loop extract completed CQEs and start
> prefetching on data and RX descriptors. The second loop process the
> real packets.
> 
> Details: The batching of CQEs are limited to 8 in-order to avoid
> stressing the LFB (Line Fill Buffer) and cache usage.
> 
> I've left some opportunities for prefetching CQE descriptors.
> 
> 
> The performance improvements on my platform are huge, as I tested this
> on a CPU without DDIO.  The performance for XDP is the same as with
> Brendens prefetch hack.

This patch is based on top of Brenden's patch 11/12, and is mean to
replace patch 12/12.

Prefetching is very important for XDP, especially when using a CPU
without DDIO (here i7-4790K CPU @ 4.00GHz).

Program xdp1: touching-data and dropping packets:
 * 11,363,925 pkt/s == no-prefetch
 * 21,031,096 pkt/s == brenden's-prefetch
 * 21,062,728 pkt/s == this-prefetch-patch

Program xdp2: write-data (swap src_dst_mac) TX-bounce out same interface:
 *  6,726,482 pkt/s == no-prefetch
 * 10,378,163 pkt/s == brenden's-prefetch
 * 10,622,350 pkt/s == this-prefetch-patch

This patch also benefits the normal network stack (the XDP specific
prefetch patch does not).

Dropping packets in iptables -t raw:
 * 4,432,519 pps drop == no-prefetch
 * 5,919,690 pps drop == this-prefetch-patch

Dropping packets in iptables -t filter:
 * 2,768,053 pps drop == no-prefetch
 * 4,038,247 pps drop == this-prefetch-patch


To please Eric, I also ran many different variations of netperf and
didn't see any regressions, only small improvements.  The variation
between runs for netperf is too high to be statistically significant.

The worst-case test for this patchset should be netperf TCP_RR as it
should only have single packet in the queue.  When running 32 parallel
TCP_RR (netserver sink have 8 cores), I actually saw a small 2%
improvement (again with high variation, as we also test CPU sched).

Investigating the TCP_RR case, as patch is constructed to not affect
the case of a single packet in the RX queue.  Using my recent
tracepoint change, we can see that with 32 parallel TCP_RR we do have
situations where napi_poll had several packets in the RX ring:

 # perf record -a -e napi:napi_poll sleep 3
 # perf script | awk '{print $5,$14,$15,$16,$17,$18}' | sort -k3n | uniq -c

 521655 napi:napi_poll: mlx4p1 work 0 budget 64
1477872 napi:napi_poll: mlx4p1 work 1 budget 64
 189081 napi:napi_poll: mlx4p1 work 2 budget 64
  12552 napi:napi_poll: mlx4p1 work 3 budget 64
    464 napi:napi_poll: mlx4p1 work 4 budget 64
     16 napi:napi_poll: mlx4p1 work 5 budget 64
      4 napi:napi_poll: mlx4p1 work 6 budget 64

I do find the "work 0" case a little strange... that cause that?


 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   70 +++++++++++++++++++++++++---
>  1 file changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 41c76fe00a7f..c5efe03e31ce 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -782,7 +782,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	int doorbell_pending;
>  	struct sk_buff *skb;
>  	int tx_index;
> -	int index;
> +	int index, saved_index, i;
>  	int nr;
>  	unsigned int length;
>  	int polled = 0;
> @@ -790,6 +790,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	int factor = priv->cqe_factor;
>  	u64 timestamp;
>  	bool l2_tunnel;
> +#define PREFETCH_BATCH 8
> +	struct mlx4_cqe *cqe_array[PREFETCH_BATCH];
> +	int cqe_idx;
> +	bool cqe_more;
>  
>  	if (!priv->port_up)
>  		return 0;
> @@ -801,24 +805,75 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	doorbell_pending = 0;
>  	tx_index = (priv->tx_ring_num - priv->rsv_tx_rings) + cq->ring;
>  
> +next_prefetch_batch:
> +	cqe_idx = 0;
> +	cqe_more = false;
> +
>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
>  	index = cq->mcq.cons_index & ring->size_mask;
> +	saved_index = index;
>  	cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
>  
> -	/* Process all completed CQEs */
> +	/* Extract and prefetch completed CQEs */
>  	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
>  		    cq->mcq.cons_index & cq->size)) {
> +		void *data;
>  
>  		frags = ring->rx_info + (index << priv->log_rx_info);
>  		rx_desc = ring->buf + (index << ring->log_stride);
> +		prefetch(rx_desc);
>  
>  		/*
>  		 * make sure we read the CQE after we read the ownership bit
>  		 */
>  		dma_rmb();
>  
> +		cqe_array[cqe_idx++] = cqe;
> +
> +		/* Base error handling here, free handled in next loop */
> +		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
> +			     MLX4_CQE_OPCODE_ERROR))
> +			goto skip;
> +
> +		data = page_address(frags[0].page) + frags[0].page_offset;
> +		prefetch(data);
> +	skip:
> +		++cq->mcq.cons_index;
> +		index = (cq->mcq.cons_index) & ring->size_mask;
> +		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> +		/* likely too slow prefetching CQE here ... do look-a-head ? */
> +		//prefetch(cqe + priv->cqe_size * 3);
> +
> +		if (++polled == budget) {
> +			cqe_more = false;
> +			break;
> +		}
> +		if (cqe_idx == PREFETCH_BATCH) {
> +			cqe_more = true;
> +			// IDEA: Opportunistic prefetch CQEs for next_prefetch_batch?
> +			//for (i = 0; i < PREFETCH_BATCH; i++) {
> +			//	prefetch(cqe + priv->cqe_size * i);
> +			//}
> +			break;
> +		}
> +	}
> +	/* Hint: The cqe_idx will be number of packets, it can be used
> +	 * for bulk allocating SKBs
> +	 */
> +
> +	/* Now, index function as index for rx_desc */
> +	index = saved_index;
> +
> +	/* Process completed CQEs in cqe_array */
> +	for (i = 0; i < cqe_idx; i++) {
> +
> +		cqe = cqe_array[i];
> +
> +		frags = ring->rx_info + (index << priv->log_rx_info);
> +		rx_desc = ring->buf + (index << ring->log_stride);
> +
>  		/* Drop packet on bad receive or bad checksum */
>  		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
>  						MLX4_CQE_OPCODE_ERROR)) {
> @@ -1065,14 +1120,13 @@ next:
>  			mlx4_en_free_frag(priv, frags, nr);
>  
>  consumed:
> -		++cq->mcq.cons_index;
> -		index = (cq->mcq.cons_index) & ring->size_mask;
> -		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> -		if (++polled == budget)
> -			goto out;
> +		++index;
> +		index = index & ring->size_mask;
>  	}
> +	/* Check for more completed CQEs */
> +	if (cqe_more)
> +		goto next_prefetch_batch;
>  
> -out:
>  	if (doorbell_pending)
>  		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);
>  
> 

p.s. for achieving 21Mpps drop the mlx4_core need param tuning:
 /etc/modprobe.d/mlx4.conf
 options mlx4_core log_num_mgm_entry_size=-2

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2016-07-11 11:09 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
2016-07-09  8:14   ` Jesper Dangaard Brouer
2016-07-09 13:47     ` Tom Herbert
2016-07-10 13:37       ` Jesper Dangaard Brouer
2016-07-10 17:09         ` Brenden Blanco
2016-07-10 20:30           ` Tom Herbert
2016-07-11 10:15             ` Daniel Borkmann
2016-07-11 12:58               ` Jesper Dangaard Brouer
2016-07-10 20:27         ` Tom Herbert
2016-07-11 11:36           ` Jesper Dangaard Brouer
2016-07-10 20:56   ` Tom Herbert
2016-07-11 16:51     ` Brenden Blanco
2016-07-11 21:21       ` Daniel Borkmann
2016-07-10 21:04   ` Tom Herbert
2016-07-11 13:53     ` Jesper Dangaard Brouer
2016-07-08  2:15 ` [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx Brenden Blanco
2016-07-10 20:59   ` Tom Herbert
2016-07-11 10:35     ` Daniel Borkmann
2016-07-08  2:15 ` [PATCH v6 03/12] rtnl: add option for setting link xdp prog Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
2016-07-09 14:07   ` Or Gerlitz
2016-07-10 15:40     ` Brenden Blanco
2016-07-10 16:38       ` Tariq Toukan
2016-07-09 19:58   ` Saeed Mahameed
2016-07-09 21:37     ` Or Gerlitz
2016-07-10 15:25     ` Tariq Toukan
2016-07-10 16:05       ` Brenden Blanco
2016-07-11 11:48         ` Saeed Mahameed
2016-07-11 21:49           ` Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 05/12] Add sample for adding simple drop program to link Brenden Blanco
2016-07-09 20:21   ` Saeed Mahameed
2016-07-11 11:09   ` Jamal Hadi Salim
2016-07-11 13:37     ` Jesper Dangaard Brouer
2016-07-16 14:55       ` Jamal Hadi Salim
2016-07-08  2:15 ` [PATCH v6 06/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 07/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 08/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 09/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 10/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 11/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path Brenden Blanco
2016-07-08  3:56   ` Eric Dumazet
2016-07-08  4:16     ` Alexei Starovoitov
2016-07-08  6:56       ` Eric Dumazet
2016-07-08 16:49         ` Brenden Blanco
2016-07-10 20:48           ` Tom Herbert
2016-07-10 20:50           ` Tom Herbert
2016-07-11 14:54             ` Jesper Dangaard Brouer
2016-07-08 15:20     ` Jesper Dangaard Brouer
2016-07-08 16:02       ` [net-next PATCH RFC] mlx4: RX prefetch loop Jesper Dangaard Brouer
2016-07-11 11:09         ` Jesper Dangaard Brouer [this message]
2016-07-11 16:00           ` Brenden Blanco
2016-07-11 23:05           ` Alexei Starovoitov
2016-07-12 12:45             ` Jesper Dangaard Brouer
2016-07-12 16:46               ` Alexander Duyck
2016-07-12 19:52                 ` Jesper Dangaard Brouer
2016-07-13  1:37                   ` Alexei Starovoitov
2016-07-10 16:14 ` [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160711130922.636ee4e6@redhat.com \
    --to=brouer@redhat.com \
    --cc=amira@mellanox.com \
    --cc=as754m@att.com \
    --cc=bblanco@plumgrid.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=gerlitz.or@gmail.com \
    --cc=hannes@stressinduktion.org \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=rana.shahot@gmail.com \
    --cc=saeedm@mellanox.com \
    --cc=tgraf@suug.ch \
    --cc=tom@herbertland.com \
    --cc=tzahio@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).