Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tariq Toukan <ttoukan.linux@gmail.com>
To: Eric Dumazet <edumazet@google.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Tariq Toukan <tariqt@mellanox.com>,
	Martin KaFai Lau <kafai@fb.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Willem de Bruijn <willemb@google.com>,
	Brenden Blanco <bblanco@plumgrid.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
Date: Tue, 14 Feb 2017 16:56:49 +0200	[thread overview]
Message-ID: <cd4f3d91-252b-4796-2bd2-3030c18d9ee6@gmail.com> (raw)
In-Reply-To: <CANn89i+udp6Y42D9wqmz7U6LGn1mtDRXpQGHAOAeX25eD0dGnQ@mail.gmail.com>



On 14/02/2017 3:45 PM, Eric Dumazet wrote:
> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>
>> It is important to understand that there are two cases for the cost of
>> an atomic op, which depend on the cache-coherency state of the
>> cacheline.
>>
>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>
>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>
> Okay, it seems you guys really want a patch that I said was not giving
> good results
>
> Let me publish the numbers I get , adding or not the last (and not
> official) patch.
>
> If I _force_ the user space process to run on the other node,
> then the results are not the ones Alex or you are expecting.
>
> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> and 3.5 Mpps without it.
>
> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> 0.00      0.00      4.50
>
> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>
>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>      31.07%  [kernel]      [k] skb_release_data
>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>       1.35%  [kernel]      [k] copy_page_to_iter
>       0.98%  [kernel]      [k] _raw_spin_lock
>       0.90%  [kernel]      [k] skb_release_head_state
>       0.60%  [kernel]      [k] tcp_transmit_skb
>       0.51%  [kernel]      [k] mlx4_en_xmit
>       0.33%  [kernel]      [k] ___cache_free
>       0.28%  [kernel]      [k] tcp_rcv_established
>
> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>
>
>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>      12.92%  [kernel]          [k] napi_gro_frags
>       7.28%  [kernel]          [k] inet_gro_receive
>       7.17%  [kernel]          [k] tcp_gro_receive
>       5.10%  [kernel]          [k] dev_gro_receive
>       4.87%  [kernel]          [k] skb_gro_receive
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       2.04%  [kernel]          [k] __build_skb
>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>       1.01%  [kernel]          [k] tcp4_gro_receive
>       0.65%  [kernel]          [k] kmem_cache_alloc
>       0.45%  [kernel]          [k] _raw_spin_lock
>
> Without the latest  patch (the exact patch series v3 I submitted),
> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.
>
> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> 0.00      0.00      4.50
>
> Profiles of the two cpus :
>
>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>       6.42%  [kernel]      [k] skb_release_data
>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>       1.83%  [kernel]      [k] copy_page_to_iter
>       1.59%  [kernel]      [k] _raw_spin_lock
>       1.48%  [kernel]      [k] skb_release_head_state
>       0.72%  [kernel]      [k] tcp_transmit_skb
>       0.68%  [kernel]      [k] mlx4_en_xmit
>       0.43%  [kernel]      [k] page_frag_free
>       0.38%  [kernel]      [k] ___cache_free
>       0.37%  [kernel]      [k] tcp_established_options
>       0.37%  [kernel]      [k] __ip_local_out
>
>
>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>      26.47%  [kernel]          [k] napi_gro_frags
>       7.02%  [kernel]          [k] inet_gro_receive
>       5.89%  [kernel]          [k] tcp_gro_receive
>       5.17%  [kernel]          [k] dev_gro_receive
>       4.80%  [kernel]          [k] skb_gro_receive
>       2.61%  [kernel]          [k] __build_skb
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>       0.95%  [kernel]          [k] tcp4_gro_receive
>       0.51%  [kernel]          [k] kmem_cache_alloc
>       0.42%  [kernel]          [k] __inet_lookup_established
>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>
>
> So probably this will need further analysis, outside of the scope of
> this patch series.
>
> Could we now please Ack this v3 and merge it ?
>
> Thanks.
Thanks Eric.

As the previous series caused hangs, we must run functional regression 
tests over this series as well.
Run has already started, and results will be available tomorrow morning.

In general, I really like this series. The re-factorization looks more 
elegant and more correct, functionally.

However, performance wise: we fear that the numbers will be drastically 
lower with this transition to order-0 pages,
because of the (becoming critical) page allocator and dma operations 
bottlenecks, especially on systems with costly
dma operations, such as ARM, iommu=on, etc...

We already have this exact issue in mlx5, where we moved to order-0 
allocations with a fixed size cache, but that was not enough.
Customers of mlx5 have already complained about the performance 
degradation, and currently this is hurting our business.
We get a clear nack from our performance regression team regarding doing 
the same in mlx4.
So, the question is, can we live with this degradation until those 
bottleneck challenges are addressed?
Following our perf experts feedback, I cannot just simply Ack. We need 
to have a clear plan to close the perf gap or reduce the impact.

Internally, I already implemented "dynamic page-cache" and "page-reuse" 
mechanisms in the driver,
and together they totally bridge the performance gap.
That's why I would like to hear from Jesper what is the status of his 
page_pool API, it is promising and could totally solve these issues.

Regards,
Tariq

>
>
>
>> Notice the huge difference. And in case 2, it is enough that the remote
>> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
>> the local CPU then does the atomic op.
>>
>> One key ideas behind the page_pool, is that remote CPUs read/detect
>> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
>> When array is full, it gets bulk returned to the shared-ptr-ring pool.
>> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
>> during it's bulk refill, to latency-hide the MESI transitions needed.
>>
>> --
>> Best regards,
>>    Jesper Dangaard Brouer
>>    MSc.CS, Principal Kernel Engineer at Red Hat
>>    LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-02-14 14:56 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 01/14] mlx4: use __skb_fill_page_desc() Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 02/14] mlx4: dma_dir is a mlx4_en_priv attribute Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 03/14] mlx4: remove order field from mlx4_en_frag_info Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 04/14] mlx4: get rid of frag_prefix_size Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 05/14] mlx4: rx_headroom is a per port attribute Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 06/14] mlx4: reduce rx ring page_cache size Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 07/14] mlx4: removal of frag_sizes[] Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Eric Dumazet
2017-02-13 20:51   ` Alexander Duyck
2017-02-13 21:09     ` Eric Dumazet
2017-02-13 23:16       ` Alexander Duyck
2017-02-13 23:22         ` Eric Dumazet
2017-02-13 23:26           ` Alexander Duyck
2017-02-13 23:29             ` Eric Dumazet
2017-02-13 23:47               ` Alexander Duyck
2017-02-14  0:22                 ` Eric Dumazet
2017-02-14  0:34                   ` Alexander Duyck
2017-02-14  0:46                     ` Eric Dumazet
2017-02-14  0:47                       ` Eric Dumazet
2017-02-14  0:57                       ` Eric Dumazet
2017-02-14  1:32                         ` Alexander Duyck
2017-02-14 12:12         ` Jesper Dangaard Brouer
2017-02-14 13:45           ` Eric Dumazet
2017-02-14 14:12             ` Eric Dumazet
2017-02-14 14:56             ` Tariq Toukan [this message]
2017-02-14 15:51               ` Eric Dumazet
2017-02-14 16:03                 ` Eric Dumazet
2017-02-14 16:03                   ` Eric Dumazet
2017-02-14 17:29                 ` Tom Herbert
2017-02-15 16:42                   ` Tariq Toukan
2017-02-15 16:57                     ` Eric Dumazet
2017-02-16 13:08                       ` Tariq Toukan
2017-02-16 15:47                         ` Eric Dumazet
2017-02-16 17:05                         ` Tom Herbert
2017-02-16 17:11                           ` Eric Dumazet
2017-02-16 17:11                             ` Eric Dumazet
2017-02-16 20:49                             ` Saeed Mahameed
2017-02-16 19:03                           ` David Miller
2017-02-16 19:03                             ` David Miller
2017-02-16 21:06                             ` Saeed Mahameed
2017-02-14 17:04               ` David Miller
2017-02-14 17:17                 ` David Laight
2017-02-14 17:22                   ` David Miller
2017-02-14 19:38                 ` Jesper Dangaard Brouer
2017-02-14 19:59                   ` David Miller
2017-02-14 17:29               ` Alexander Duyck
2017-02-14 18:46                 ` Jesper Dangaard Brouer
2017-02-14 19:02                   ` Eric Dumazet
2017-02-14 20:02                     ` Jesper Dangaard Brouer
2017-02-14 20:02                       ` Jesper Dangaard Brouer
2017-02-14 21:56                       ` Eric Dumazet
2017-02-14 21:56                         ` Eric Dumazet
2017-02-14 19:06                   ` Alexander Duyck
2017-02-14 19:06                     ` Alexander Duyck
2017-02-14 19:50                     ` Jesper Dangaard Brouer
2017-02-22 16:22   ` Eric Dumazet
2017-02-22 17:23     ` Alexander Duyck
2017-02-22 17:58       ` David Laight
2017-02-22 18:21       ` Eric Dumazet
2017-02-23  1:08         ` Alexander Duyck
2017-02-23  2:06           ` Eric Dumazet
2017-02-23  2:18             ` Alexander Duyck
2017-02-23 14:02               ` Tariq Toukan
2017-02-24  9:42             ` Jesper Dangaard Brouer
2017-03-12 14:57             ` Eric Dumazet
2017-03-12 15:29               ` Eric Dumazet
2017-03-12 15:49                 ` Saeed Mahameed
2017-03-12 16:49                   ` Eric Dumazet
2017-03-13  9:20                     ` Saeed Mahameed
2017-02-13 19:58 ` [PATCH v3 net-next 09/14] mlx4: add page recycling in receive path Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 10/14] mlx4: add rx_alloc_pages counter in ethtool -S Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 11/14] mlx4: do not access rx_desc from mlx4_en_process_rx_cq() Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 12/14] mlx4: factorize page_address() calls Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 13/14] mlx4: make validate_loopback() more generic Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 14/14] mlx4: remove duplicate code in mlx4_en_process_rx_cq() Eric Dumazet
2017-02-17 16:00 ` [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd4f3d91-252b-4796-2bd2-3030c18d9ee6@gmail.com \
    --to=ttoukan.linux@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=ast@kernel.org \
    --cc=bblanco@plumgrid.com \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=kafai@fb.com \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.