Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
       [not found]       ` <CAKgT0UeB_e_Z7LM1_r=en8JJdgLhoYFstWpCDQN6iawLYZJKDA@mail.gmail.com>
@ 2017-02-14 12:12         ` Jesper Dangaard Brouer
  2017-02-14 13:45           ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 12:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, brouer,
	linux-mm

On Mon, 13 Feb 2017 15:16:35 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

[...]
> ... As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

It is important to understand that there are two cases for the cost of
an atomic op, which depend on the cache-coherency state of the
cacheline.

Measured on Skylake CPU i7-6700K CPU @ 4.00GHz

(1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
(2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns

Notice the huge difference. And in case 2, it is enough that the remote
CPU reads the cacheline and brings it into "Shared" (MESI) state, and
the local CPU then does the atomic op.

One key ideas behind the page_pool, is that remote CPUs read/detect
refcnt==1 (Shared-state), and store the page in a small per-CPU array.
When array is full, it gets bulk returned to the shared-ptr-ring pool.
When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
during it's bulk refill, to latency-hide the MESI transitions needed.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 12:12         ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Jesper Dangaard Brouer
@ 2017-02-14 13:45           ` Eric Dumazet
  2017-02-14 14:12             ` Eric Dumazet
  2017-02-14 14:56             ` Tariq Toukan
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2017-02-14 13:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:

> It is important to understand that there are two cases for the cost of
> an atomic op, which depend on the cache-coherency state of the
> cacheline.
>
> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>
> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>

Okay, it seems you guys really want a patch that I said was not giving
good results

Let me publish the numbers I get , adding or not the last (and not
official) patch.

If I _force_ the user space process to run on the other node,
then the results are not the ones Alex or you are expecting.

I have with this patch about 2.7 Mpps of this silly single TCP flow,
and 3.5 Mpps without it.

lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
0.00      0.00      4.50

Profile of the cpu on NUMA node 1 ( netserver consuming data ) :

    54.73%  [kernel]      [k] copy_user_enhanced_fast_string
    31.07%  [kernel]      [k] skb_release_data
     4.24%  [kernel]      [k] skb_copy_datagram_iter
     1.35%  [kernel]      [k] copy_page_to_iter
     0.98%  [kernel]      [k] _raw_spin_lock
     0.90%  [kernel]      [k] skb_release_head_state
     0.60%  [kernel]      [k] tcp_transmit_skb
     0.51%  [kernel]      [k] mlx4_en_xmit
     0.33%  [kernel]      [k] ___cache_free
     0.28%  [kernel]      [k] tcp_rcv_established

Profile of cpu handling mlx4 softirqs (NUMA node 0)


    48.00%  [kernel]          [k] mlx4_en_process_rx_cq
    12.92%  [kernel]          [k] napi_gro_frags
     7.28%  [kernel]          [k] inet_gro_receive
     7.17%  [kernel]          [k] tcp_gro_receive
     5.10%  [kernel]          [k] dev_gro_receive
     4.87%  [kernel]          [k] skb_gro_receive
     2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
     2.04%  [kernel]          [k] __build_skb
     1.02%  [kernel]          [k] napi_reuse_skb.isra.95
     1.01%  [kernel]          [k] tcp4_gro_receive
     0.65%  [kernel]          [k] kmem_cache_alloc
     0.45%  [kernel]          [k] _raw_spin_lock

Without the latest  patch (the exact patch series v3 I submitted),
thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.

lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
0.00      0.00      4.50

Profiles of the two cpus :

    74.85%  [kernel]      [k] copy_user_enhanced_fast_string
     6.42%  [kernel]      [k] skb_release_data
     5.65%  [kernel]      [k] skb_copy_datagram_iter
     1.83%  [kernel]      [k] copy_page_to_iter
     1.59%  [kernel]      [k] _raw_spin_lock
     1.48%  [kernel]      [k] skb_release_head_state
     0.72%  [kernel]      [k] tcp_transmit_skb
     0.68%  [kernel]      [k] mlx4_en_xmit
     0.43%  [kernel]      [k] page_frag_free
     0.38%  [kernel]      [k] ___cache_free
     0.37%  [kernel]      [k] tcp_established_options
     0.37%  [kernel]      [k] __ip_local_out


   37.98%  [kernel]          [k] mlx4_en_process_rx_cq
    26.47%  [kernel]          [k] napi_gro_frags
     7.02%  [kernel]          [k] inet_gro_receive
     5.89%  [kernel]          [k] tcp_gro_receive
     5.17%  [kernel]          [k] dev_gro_receive
     4.80%  [kernel]          [k] skb_gro_receive
     2.61%  [kernel]          [k] __build_skb
     2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
     1.59%  [kernel]          [k] napi_reuse_skb.isra.95
     0.95%  [kernel]          [k] tcp4_gro_receive
     0.51%  [kernel]          [k] kmem_cache_alloc
     0.42%  [kernel]          [k] __inet_lookup_established
     0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu


So probably this will need further analysis, outside of the scope of
this patch series.

Could we now please Ack this v3 and merge it ?

Thanks.



> Notice the huge difference. And in case 2, it is enough that the remote
> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
> the local CPU then does the atomic op.
>
> One key ideas behind the page_pool, is that remote CPUs read/detect
> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
> When array is full, it gets bulk returned to the shared-ptr-ring pool.
> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
> during it's bulk refill, to latency-hide the MESI transitions needed.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 13:45           ` Eric Dumazet
@ 2017-02-14 14:12             ` Eric Dumazet
  2017-02-14 14:56             ` Tariq Toukan
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2017-02-14 14:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 5:45 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> Could we now please Ack this v3 and merge it ?
>

BTW I found the limitation on sender side.

After doing :

lpaa23:~# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000

rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

lpaa23:~# ethtool -C eth0 tx-usecs 8 tx-frames 8

-> 36 Gbit on a single TCP flow, this is the highest number I ever got.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 13:45           ` Eric Dumazet
  2017-02-14 14:12             ` Eric Dumazet
@ 2017-02-14 14:56             ` Tariq Toukan
  2017-02-14 15:51               ` Eric Dumazet
                                 ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Tariq Toukan @ 2017-02-14 14:56 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm



On 14/02/2017 3:45 PM, Eric Dumazet wrote:
> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>
>> It is important to understand that there are two cases for the cost of
>> an atomic op, which depend on the cache-coherency state of the
>> cacheline.
>>
>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>
>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>
> Okay, it seems you guys really want a patch that I said was not giving
> good results
>
> Let me publish the numbers I get , adding or not the last (and not
> official) patch.
>
> If I _force_ the user space process to run on the other node,
> then the results are not the ones Alex or you are expecting.
>
> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> and 3.5 Mpps without it.
>
> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> 0.00      0.00      4.50
>
> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>
>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>      31.07%  [kernel]      [k] skb_release_data
>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>       1.35%  [kernel]      [k] copy_page_to_iter
>       0.98%  [kernel]      [k] _raw_spin_lock
>       0.90%  [kernel]      [k] skb_release_head_state
>       0.60%  [kernel]      [k] tcp_transmit_skb
>       0.51%  [kernel]      [k] mlx4_en_xmit
>       0.33%  [kernel]      [k] ___cache_free
>       0.28%  [kernel]      [k] tcp_rcv_established
>
> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>
>
>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>      12.92%  [kernel]          [k] napi_gro_frags
>       7.28%  [kernel]          [k] inet_gro_receive
>       7.17%  [kernel]          [k] tcp_gro_receive
>       5.10%  [kernel]          [k] dev_gro_receive
>       4.87%  [kernel]          [k] skb_gro_receive
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       2.04%  [kernel]          [k] __build_skb
>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>       1.01%  [kernel]          [k] tcp4_gro_receive
>       0.65%  [kernel]          [k] kmem_cache_alloc
>       0.45%  [kernel]          [k] _raw_spin_lock
>
> Without the latest  patch (the exact patch series v3 I submitted),
> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.
>
> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> 0.00      0.00      4.50
>
> Profiles of the two cpus :
>
>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>       6.42%  [kernel]      [k] skb_release_data
>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>       1.83%  [kernel]      [k] copy_page_to_iter
>       1.59%  [kernel]      [k] _raw_spin_lock
>       1.48%  [kernel]      [k] skb_release_head_state
>       0.72%  [kernel]      [k] tcp_transmit_skb
>       0.68%  [kernel]      [k] mlx4_en_xmit
>       0.43%  [kernel]      [k] page_frag_free
>       0.38%  [kernel]      [k] ___cache_free
>       0.37%  [kernel]      [k] tcp_established_options
>       0.37%  [kernel]      [k] __ip_local_out
>
>
>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>      26.47%  [kernel]          [k] napi_gro_frags
>       7.02%  [kernel]          [k] inet_gro_receive
>       5.89%  [kernel]          [k] tcp_gro_receive
>       5.17%  [kernel]          [k] dev_gro_receive
>       4.80%  [kernel]          [k] skb_gro_receive
>       2.61%  [kernel]          [k] __build_skb
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>       0.95%  [kernel]          [k] tcp4_gro_receive
>       0.51%  [kernel]          [k] kmem_cache_alloc
>       0.42%  [kernel]          [k] __inet_lookup_established
>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>
>
> So probably this will need further analysis, outside of the scope of
> this patch series.
>
> Could we now please Ack this v3 and merge it ?
>
> Thanks.
Thanks Eric.

As the previous series caused hangs, we must run functional regression 
tests over this series as well.
Run has already started, and results will be available tomorrow morning.

In general, I really like this series. The re-factorization looks more 
elegant and more correct, functionally.

However, performance wise: we fear that the numbers will be drastically 
lower with this transition to order-0 pages,
because of the (becoming critical) page allocator and dma operations 
bottlenecks, especially on systems with costly
dma operations, such as ARM, iommu=on, etc...

We already have this exact issue in mlx5, where we moved to order-0 
allocations with a fixed size cache, but that was not enough.
Customers of mlx5 have already complained about the performance 
degradation, and currently this is hurting our business.
We get a clear nack from our performance regression team regarding doing 
the same in mlx4.
So, the question is, can we live with this degradation until those 
bottleneck challenges are addressed?
Following our perf experts feedback, I cannot just simply Ack. We need 
to have a clear plan to close the perf gap or reduce the impact.

Internally, I already implemented "dynamic page-cache" and "page-reuse" 
mechanisms in the driver,
and together they totally bridge the performance gap.
That's why I would like to hear from Jesper what is the status of his 
page_pool API, it is promising and could totally solve these issues.

Regards,
Tariq

>
>
>
>> Notice the huge difference. And in case 2, it is enough that the remote
>> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
>> the local CPU then does the atomic op.
>>
>> One key ideas behind the page_pool, is that remote CPUs read/detect
>> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
>> When array is full, it gets bulk returned to the shared-ptr-ring pool.
>> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
>> during it's bulk refill, to latency-hide the MESI transitions needed.
>>
>> --
>> Best regards,
>>    Jesper Dangaard Brouer
>>    MSc.CS, Principal Kernel Engineer at Red Hat
>>    LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 14:56             ` Tariq Toukan
@ 2017-02-14 15:51               ` Eric Dumazet
  2017-02-14 16:03                 ` Eric Dumazet
  2017-02-14 17:29                 ` Tom Herbert
  2017-02-14 17:04               ` David Miller
  2017-02-14 17:29               ` Alexander Duyck
  2 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2017-02-14 15:51 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:

> As the previous series caused hangs, we must run functional regression 
> tests over this series as well.
> Run has already started, and results will be available tomorrow morning.
> 
> In general, I really like this series. The re-factorization looks more 
> elegant and more correct, functionally.
> 
> However, performance wise: we fear that the numbers will be drastically 
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations 
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...
> 

So, again, performance after this patch series his higher,
once you have sensible RX queues parameters, for the expected workload.

Only in pathological cases, you might have some regression.

The old schem was _maybe_ better _when_ memory is not fragmented.

When you run hosts for months, memory _is_ fragmented.

You never see that on benchmarks, unless you force memory being
fragmented.



> We already have this exact issue in mlx5, where we moved to order-0 
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance 
> degradation, and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing 
> the same in mlx4.
> So, the question is, can we live with this degradation until those 
> bottleneck challenges are addressed?

Again, there is no degradation.

We have been using order-0 pages for years at Google.

Only when we made the mistake to rebase from the upstream driver and
order-3 pages we got horrible regressions, causing production outages.

I was silly to believe that mm layer got better.

> Following our perf experts feedback, I cannot just simply Ack. We need 
> to have a clear plan to close the perf gap or reduce the impact.

Your perf experts need to talk to me, or any experts at Google and
Facebook, really.

Anything _relying_ on order-3 pages being available to impress
friends/customers is a lie.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 15:51               ` Eric Dumazet
@ 2017-02-14 16:03                 ` Eric Dumazet
  2017-02-14 17:29                 ` Tom Herbert
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2017-02-14 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>

BTW, you do understand that on PowerPC right now, an Ethernet frame
holds 65536*8 =  half a MByte , right ?

So any PowerPC host using mlx4 NIC can easily be bringed down, by
using a few TCP flows and sending out of order packets.

That is in itself quite scary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 14:56             ` Tariq Toukan
  2017-02-14 15:51               ` Eric Dumazet
@ 2017-02-14 17:04               ` David Miller
  2017-02-14 17:17                 ` David Laight
  2017-02-14 19:38                 ` Jesper Dangaard Brouer
  2017-02-14 17:29               ` Alexander Duyck
  2 siblings, 2 replies; 28+ messages in thread
From: David Miller @ 2017-02-14 17:04 UTC (permalink / raw)
  To: ttoukan.linux
  Cc: edumazet, brouer, alexander.duyck, netdev, tariqt, kafai, saeedm,
	willemb, bblanco, ast, eric.dumazet, linux-mm

From: Tariq Toukan <ttoukan.linux@gmail.com>
Date: Tue, 14 Feb 2017 16:56:49 +0200

> Internally, I already implemented "dynamic page-cache" and
> "page-reuse" mechanisms in the driver, and together they totally
> bridge the performance gap.

I worry about a dynamically growing page cache inside of drivers
because it is invisible to the rest of the kernel.

It responds only to local needs.

The price of the real page allocator comes partly because it can
respond to global needs.

If a driver consumes some unreasonable percentage of system memory, it
is keeping that memory from being used from other parts of the system
even if it would be better for networking to be slightly slower with
less cache because that other thing that needs memory is more
important.

I think this is one of the primary reasons that the MM guys severely
chastise us when we build special purpose local caches into networking
facilities.

And the more I think about it the more I think they are right.

One path I see around all of this is full integration.  Meaning that
we can free pages into the page allocator which are still DMA mapped.
And future allocations from that device are prioritized to take still
DMA mapped objects.

Yes, we still need to make the page allocator faster, but this kind of
work helps everyone not just 100GB ethernet NICs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:04               ` David Miller
@ 2017-02-14 17:17                 ` David Laight
  2017-02-14 17:22                   ` David Miller
  2017-02-14 19:38                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 28+ messages in thread
From: David Laight @ 2017-02-14 17:17 UTC (permalink / raw)
  To: 'David Miller', ttoukan.linux@gmail.com
  Cc: edumazet@google.com, brouer@redhat.com, alexander.duyck@gmail.com,
	netdev@vger.kernel.org, tariqt@mellanox.com, kafai@fb.com,
	saeedm@mellanox.com, willemb@google.com, bblanco@plumgrid.com,
	ast@kernel.org, eric.dumazet@gmail.com, linux-mm@kvack.org

From: David Miller
> Sent: 14 February 2017 17:04
...
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.
...

For systems with 'expensive' iommu has anyone tried separating the
allocation of iommu resource (eg page table slots) from their
assignment to physical pages?

Provided the page sizes all match, setting up a receive buffer might
be as simple as writing the physical address into the iommu slot
that matches the ring entry.

Or am I thinking about hardware that is much simpler than real life?

	David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:17                 ` David Laight
@ 2017-02-14 17:22                   ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2017-02-14 17:22 UTC (permalink / raw)
  To: David.Laight
  Cc: ttoukan.linux, edumazet, brouer, alexander.duyck, netdev, tariqt,
	kafai, saeedm, willemb, bblanco, ast, eric.dumazet, linux-mm

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 14 Feb 2017 17:17:22 +0000

> From: David Miller
>> Sent: 14 February 2017 17:04
> ...
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> ...
> 
> For systems with 'expensive' iommu has anyone tried separating the
> allocation of iommu resource (eg page table slots) from their
> assignment to physical pages?
> 
> Provided the page sizes all match, setting up a receive buffer might
> be as simple as writing the physical address into the iommu slot
> that matches the ring entry.
> 
> Or am I thinking about hardware that is much simpler than real life?

You still will eat an expensive MMIO or hypervisor call to setup the
mapping.

IOMMU is expensive because of two operations, the slot allocation
(which takes locks) and the modification of the IOMMU PTE to setup
or teardown the mapping.

This is why attempts to preallocate slots (which people have looked
into) never really takes off.  You really have to eliminate the
entire operation to get worthwhile gains.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 15:51               ` Eric Dumazet
  2017-02-14 16:03                 ` Eric Dumazet
@ 2017-02-14 17:29                 ` Tom Herbert
  2017-02-15 16:42                   ` Tariq Toukan
  1 sibling, 1 reply; 28+ messages in thread
From: Tom Herbert @ 2017-02-14 17:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tariq Toukan, Eric Dumazet, Jesper Dangaard Brouer,
	Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, linux-mm

On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>
>> As the previous series caused hangs, we must run functional regression
>> tests over this series as well.
>> Run has already started, and results will be available tomorrow morning.
>>
>> In general, I really like this series. The re-factorization looks more
>> elegant and more correct, functionally.
>>
>> However, performance wise: we fear that the numbers will be drastically
>> lower with this transition to order-0 pages,
>> because of the (becoming critical) page allocator and dma operations
>> bottlenecks, especially on systems with costly
>> dma operations, such as ARM, iommu=on, etc...
>>
>
> So, again, performance after this patch series his higher,
> once you have sensible RX queues parameters, for the expected workload.
>
> Only in pathological cases, you might have some regression.
>
> The old schem was _maybe_ better _when_ memory is not fragmented.
>
> When you run hosts for months, memory _is_ fragmented.
>
> You never see that on benchmarks, unless you force memory being
> fragmented.
>
>
>
>> We already have this exact issue in mlx5, where we moved to order-0
>> allocations with a fixed size cache, but that was not enough.
>> Customers of mlx5 have already complained about the performance
>> degradation, and currently this is hurting our business.
>> We get a clear nack from our performance regression team regarding doing
>> the same in mlx4.
>> So, the question is, can we live with this degradation until those
>> bottleneck challenges are addressed?
>
> Again, there is no degradation.
>
> We have been using order-0 pages for years at Google.
>
> Only when we made the mistake to rebase from the upstream driver and
> order-3 pages we got horrible regressions, causing production outages.
>
> I was silly to believe that mm layer got better.
>
>> Following our perf experts feedback, I cannot just simply Ack. We need
>> to have a clear plan to close the perf gap or reduce the impact.
>
> Your perf experts need to talk to me, or any experts at Google and
> Facebook, really.
>

I agree with this 100%! To be blunt, power users like this are testing
your drivers far beyond what Mellanox is doing and understand how
performance gains in benchmarks translate to possible gains in real
production way more than your perf experts can. Listen to Eric!

Tom


> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 14:56             ` Tariq Toukan
  2017-02-14 15:51               ` Eric Dumazet
  2017-02-14 17:04               ` David Miller
@ 2017-02-14 17:29               ` Alexander Duyck
  2017-02-14 18:46                 ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 28+ messages in thread
From: Alexander Duyck @ 2017-02-14 17:29 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
> On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>>
>> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>
>>> It is important to understand that there are two cases for the cost of
>>> an atomic op, which depend on the cache-coherency state of the
>>> cacheline.
>>>
>>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>>
>>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>>
>> Okay, it seems you guys really want a patch that I said was not giving
>> good results
>>
>> Let me publish the numbers I get , adding or not the last (and not
>> official) patch.
>>
>> If I _force_ the user space process to run on the other node,
>> then the results are not the ones Alex or you are expecting.
>>
>> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> and 3.5 Mpps without it.
>>
>> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> 0.00      0.00      4.50
>>
>> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>>
>>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>>      31.07%  [kernel]      [k] skb_release_data
>>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>>       1.35%  [kernel]      [k] copy_page_to_iter
>>       0.98%  [kernel]      [k] _raw_spin_lock
>>       0.90%  [kernel]      [k] skb_release_head_state
>>       0.60%  [kernel]      [k] tcp_transmit_skb
>>       0.51%  [kernel]      [k] mlx4_en_xmit
>>       0.33%  [kernel]      [k] ___cache_free
>>       0.28%  [kernel]      [k] tcp_rcv_established
>>
>> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>>
>>
>>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>>      12.92%  [kernel]          [k] napi_gro_frags
>>       7.28%  [kernel]          [k] inet_gro_receive
>>       7.17%  [kernel]          [k] tcp_gro_receive
>>       5.10%  [kernel]          [k] dev_gro_receive
>>       4.87%  [kernel]          [k] skb_gro_receive
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       2.04%  [kernel]          [k] __build_skb
>>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>>       1.01%  [kernel]          [k] tcp4_gro_receive
>>       0.65%  [kernel]          [k] kmem_cache_alloc
>>       0.45%  [kernel]          [k] _raw_spin_lock
>>
>> Without the latest  patch (the exact patch series v3 I submitted),
>> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> reads.
>>
>> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> 0.00      0.00      4.50
>>
>> Profiles of the two cpus :
>>
>>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>>       6.42%  [kernel]      [k] skb_release_data
>>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>>       1.83%  [kernel]      [k] copy_page_to_iter
>>       1.59%  [kernel]      [k] _raw_spin_lock
>>       1.48%  [kernel]      [k] skb_release_head_state
>>       0.72%  [kernel]      [k] tcp_transmit_skb
>>       0.68%  [kernel]      [k] mlx4_en_xmit
>>       0.43%  [kernel]      [k] page_frag_free
>>       0.38%  [kernel]      [k] ___cache_free
>>       0.37%  [kernel]      [k] tcp_established_options
>>       0.37%  [kernel]      [k] __ip_local_out
>>
>>
>>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>>      26.47%  [kernel]          [k] napi_gro_frags
>>       7.02%  [kernel]          [k] inet_gro_receive
>>       5.89%  [kernel]          [k] tcp_gro_receive
>>       5.17%  [kernel]          [k] dev_gro_receive
>>       4.80%  [kernel]          [k] skb_gro_receive
>>       2.61%  [kernel]          [k] __build_skb
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>>       0.95%  [kernel]          [k] tcp4_gro_receive
>>       0.51%  [kernel]          [k] kmem_cache_alloc
>>       0.42%  [kernel]          [k] __inet_lookup_established
>>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>>
>>
>> So probably this will need further analysis, outside of the scope of
>> this patch series.
>>
>> Could we now please Ack this v3 and merge it ?
>>
>> Thanks.
>
> Thanks Eric.
>
> As the previous series caused hangs, we must run functional regression tests
> over this series as well.
> Run has already started, and results will be available tomorrow morning.
>
> In general, I really like this series. The re-factorization looks more
> elegant and more correct, functionally.
>
> However, performance wise: we fear that the numbers will be drastically
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...

So to give you an idea I originally came up with the approach used in
the Intel drivers when I was dealing with a PowerPC system where the
IOMMU was a requirement.  With this setup correctly the map/unmap
calls should be almost non-existent.  Basically the only time we
should have to allocate or free a page is if something is sitting on
the pages for an excessively long time or if the interrupt is bouncing
between memory nodes which forces us to free the memory since it is
sitting on the wrong node.

> We already have this exact issue in mlx5, where we moved to order-0
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance degradation,
> and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing the
> same in mlx4.

What form of recycling were you doing?  If you were doing the offset
based setup then that obviously only gets you so much.  The advantage
to using the page count is that you get almost a mobius strip effect
for your buffers and descriptor rings where you just keep flipping the
page offset back and forth via an XOR and the only cost is
dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
do the full allocation, mapping, and unmapping.

> So, the question is, can we live with this degradation until those
> bottleneck challenges are addressed?
> Following our perf experts feedback, I cannot just simply Ack. We need to
> have a clear plan to close the perf gap or reduce the impact.

I think we need to define what is the degradation you are expecting to
see.  Using the page count based approach should actually be better in
some cases than the order 3 page and just walking frags approach since
the theoretical reuse is infinite instead of fixed.

> Internally, I already implemented "dynamic page-cache" and "page-reuse"
> mechanisms in the driver,
> and together they totally bridge the performance gap.
> That's why I would like to hear from Jesper what is the status of his
> page_pool API, it is promising and could totally solve these issues.
>
> Regards,
> Tariq

The page pool may provide gains but we have to actually see it before
we can guarantee it.  If worse comes to worse I might just resort to
standardizing the logic used for the Intel driver page count based
approach.  Then if nothing else we would have a standard path for all
the drivers to use if we start going down this route.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:29               ` Alexander Duyck
@ 2017-02-14 18:46                 ` Jesper Dangaard Brouer
  2017-02-14 19:02                   ` Eric Dumazet
  2017-02-14 19:06                   ` Alexander Duyck
  0 siblings, 2 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 18:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	brouer

On Tue, 14 Feb 2017 09:29:54 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> >
> >
> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >>
> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:
> >>  
> >>> It is important to understand that there are two cases for the cost of
> >>> an atomic op, which depend on the cache-coherency state of the
> >>> cacheline.
> >>>
> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >>>
> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >>>  
> >> Okay, it seems you guys really want a patch that I said was not giving
> >> good results
> >>
> >> Let me publish the numbers I get , adding or not the last (and not
> >> official) patch.
> >>
> >> If I _force_ the user space process to run on the other node,
> >> then the results are not the ones Alex or you are expecting.
> >>
> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> and 3.5 Mpps without it.
> >>
> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> 0.00      0.00      4.50
> >>
> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >>
> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >>      31.07%  [kernel]      [k] skb_release_data
> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >>       0.90%  [kernel]      [k] skb_release_head_state
> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >>       0.33%  [kernel]      [k] ___cache_free
> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >>
> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >>
> >>
> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      12.92%  [kernel]          [k] napi_gro_frags
> >>       7.28%  [kernel]          [k] inet_gro_receive
> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >>       5.10%  [kernel]          [k] dev_gro_receive
> >>       4.87%  [kernel]          [k] skb_gro_receive
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       2.04%  [kernel]          [k] __build_skb
> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >>
> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> reads.
> >>
> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> 0.00      0.00      4.50
> >>
> >> Profiles of the two cpus :
> >>
> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >>       6.42%  [kernel]      [k] skb_release_data
> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >>       1.48%  [kernel]      [k] skb_release_head_state
> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >>       0.43%  [kernel]      [k] page_frag_free
> >>       0.38%  [kernel]      [k] ___cache_free
> >>       0.37%  [kernel]      [k] tcp_established_options
> >>       0.37%  [kernel]      [k] __ip_local_out
> >>
> >>
> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      26.47%  [kernel]          [k] napi_gro_frags
> >>       7.02%  [kernel]          [k] inet_gro_receive
> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >>       5.17%  [kernel]          [k] dev_gro_receive
> >>       4.80%  [kernel]          [k] skb_gro_receive
> >>       2.61%  [kernel]          [k] __build_skb
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >>
> >>
> >> So probably this will need further analysis, outside of the scope of
> >> this patch series.
> >>
> >> Could we now please Ack this v3 and merge it ?
> >>
> >> Thanks.  
> >
> > Thanks Eric.
> >
> > As the previous series caused hangs, we must run functional regression tests
> > over this series as well.
> > Run has already started, and results will be available tomorrow morning.
> >
> > In general, I really like this series. The re-factorization looks more
> > elegant and more correct, functionally.
> >
> > However, performance wise: we fear that the numbers will be drastically
> > lower with this transition to order-0 pages,
> > because of the (becoming critical) page allocator and dma operations
> > bottlenecks, especially on systems with costly
> > dma operations, such as ARM, iommu=on, etc...  
> 
> So to give you an idea I originally came up with the approach used in
> the Intel drivers when I was dealing with a PowerPC system where the
> IOMMU was a requirement.  With this setup correctly the map/unmap
> calls should be almost non-existent.  Basically the only time we
> should have to allocate or free a page is if something is sitting on
> the pages for an excessively long time or if the interrupt is bouncing
> between memory nodes which forces us to free the memory since it is
> sitting on the wrong node.
>
> > We already have this exact issue in mlx5, where we moved to order-0
> > allocations with a fixed size cache, but that was not enough.
> > Customers of mlx5 have already complained about the performance degradation,
> > and currently this is hurting our business.
> > We get a clear nack from our performance regression team regarding doing the
> > same in mlx4.  
> 
> What form of recycling were you doing?  If you were doing the offset
> based setup then that obviously only gets you so much.  The advantage
> to using the page count is that you get almost a mobius strip effect
> for your buffers and descriptor rings where you just keep flipping the
> page offset back and forth via an XOR and the only cost is
> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> do the full allocation, mapping, and unmapping.
> 
> > So, the question is, can we live with this degradation until those
> > bottleneck challenges are addressed?
> > Following our perf experts feedback, I cannot just simply Ack. We need to
> > have a clear plan to close the perf gap or reduce the impact.  
> 
> I think we need to define what is the degradation you are expecting to
> see.  Using the page count based approach should actually be better in
> some cases than the order 3 page and just walking frags approach since
> the theoretical reuse is infinite instead of fixed.
> 
> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> > mechanisms in the driver,
> > and together they totally bridge the performance gap.
> > That's why I would like to hear from Jesper what is the status of his
> > page_pool API, it is promising and could totally solve these issues.
> >
> > Regards,
> > Tariq  
> 
> The page pool may provide gains but we have to actually see it before
> we can guarantee it.  If worse comes to worse I might just resort to
> standardizing the logic used for the Intel driver page count based
> approach.  Then if nothing else we would have a standard path for all
> the drivers to use if we start going down this route.

With this Intel driver page count based recycle approach, the recycle
size is tied to the size of the RX ring.  As Eric and Tariq discovered.
And for other performance reasons (memory footprint of walking RX ring
data-structures), don't want to increase the RX ring sizes.  Thus, it
create two opposite performance needs.  That is why I think a more
explicit approach with a pool is more attractive.

How is this approach doing to work for XDP?
(XDP doesn't "share" the page, and in-general we don't want the extra
atomic.)

We absolutely need recycling with XDP, when transmitting out another
device, and the other devices DMA-TX completion need some way of
returning this page.
What is basically needed is a standardized callback to allow the remote
driver to return the page to the originating driver.  As we don't have
a NDP for XDP-forward/transmit yet, we could pass this callback as a
parameter along with the packet-page to send?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 18:46                 ` Jesper Dangaard Brouer
@ 2017-02-14 19:02                   ` Eric Dumazet
  2017-02-14 20:02                     ` Jesper Dangaard Brouer
  2017-02-14 19:06                   ` Alexander Duyck
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2017-02-14 19:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)
>
> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.
> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?
>
>


mlx4 already has a cache for XDP.
I believe I did not change this part, it still should work.

commit d576acf0a22890cf3f8f7a9b035f1558077f6770
Author: Brenden Blanco <bblanco@plumgrid.com>
Date:   Tue Jul 19 12:16:52 2016 -0700

    net/mlx4_en: add page recycle to prepare rx ring for tx support

I have not checked if recent Tom work added core infra for this cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 18:46                 ` Jesper Dangaard Brouer
  2017-02-14 19:02                   ` Eric Dumazet
@ 2017-02-14 19:06                   ` Alexander Duyck
  2017-02-14 19:50                     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 28+ messages in thread
From: Alexander Duyck @ 2017-02-14 19:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	John Fastabend

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 14 Feb 2017 09:29:54 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>> >
>> >
>> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>> >>
>> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> >> <brouer@redhat.com> wrote:
>> >>
>> >>> It is important to understand that there are two cases for the cost of
>> >>> an atomic op, which depend on the cache-coherency state of the
>> >>> cacheline.
>> >>>
>> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>> >>>
>> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>> >>>
>> >> Okay, it seems you guys really want a patch that I said was not giving
>> >> good results
>> >>
>> >> Let me publish the numbers I get , adding or not the last (and not
>> >> official) patch.
>> >>
>> >> If I _force_ the user space process to run on the other node,
>> >> then the results are not the ones Alex or you are expecting.
>> >>
>> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> >> and 3.5 Mpps without it.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> >> 0.00      0.00      4.50
>> >>
>> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>> >>
>> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>      31.07%  [kernel]      [k] skb_release_data
>> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.35%  [kernel]      [k] copy_page_to_iter
>> >>       0.98%  [kernel]      [k] _raw_spin_lock
>> >>       0.90%  [kernel]      [k] skb_release_head_state
>> >>       0.60%  [kernel]      [k] tcp_transmit_skb
>> >>       0.51%  [kernel]      [k] mlx4_en_xmit
>> >>       0.33%  [kernel]      [k] ___cache_free
>> >>       0.28%  [kernel]      [k] tcp_rcv_established
>> >>
>> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>> >>
>> >>
>> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      12.92%  [kernel]          [k] napi_gro_frags
>> >>       7.28%  [kernel]          [k] inet_gro_receive
>> >>       7.17%  [kernel]          [k] tcp_gro_receive
>> >>       5.10%  [kernel]          [k] dev_gro_receive
>> >>       4.87%  [kernel]          [k] skb_gro_receive
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       2.04%  [kernel]          [k] __build_skb
>> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       1.01%  [kernel]          [k] tcp4_gro_receive
>> >>       0.65%  [kernel]          [k] kmem_cache_alloc
>> >>       0.45%  [kernel]          [k] _raw_spin_lock
>> >>
>> >> Without the latest  patch (the exact patch series v3 I submitted),
>> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> >> reads.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> >> 0.00      0.00      4.50
>> >>
>> >> Profiles of the two cpus :
>> >>
>> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>       6.42%  [kernel]      [k] skb_release_data
>> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.83%  [kernel]      [k] copy_page_to_iter
>> >>       1.59%  [kernel]      [k] _raw_spin_lock
>> >>       1.48%  [kernel]      [k] skb_release_head_state
>> >>       0.72%  [kernel]      [k] tcp_transmit_skb
>> >>       0.68%  [kernel]      [k] mlx4_en_xmit
>> >>       0.43%  [kernel]      [k] page_frag_free
>> >>       0.38%  [kernel]      [k] ___cache_free
>> >>       0.37%  [kernel]      [k] tcp_established_options
>> >>       0.37%  [kernel]      [k] __ip_local_out
>> >>
>> >>
>> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      26.47%  [kernel]          [k] napi_gro_frags
>> >>       7.02%  [kernel]          [k] inet_gro_receive
>> >>       5.89%  [kernel]          [k] tcp_gro_receive
>> >>       5.17%  [kernel]          [k] dev_gro_receive
>> >>       4.80%  [kernel]          [k] skb_gro_receive
>> >>       2.61%  [kernel]          [k] __build_skb
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       0.95%  [kernel]          [k] tcp4_gro_receive
>> >>       0.51%  [kernel]          [k] kmem_cache_alloc
>> >>       0.42%  [kernel]          [k] __inet_lookup_established
>> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>> >>
>> >>
>> >> So probably this will need further analysis, outside of the scope of
>> >> this patch series.
>> >>
>> >> Could we now please Ack this v3 and merge it ?
>> >>
>> >> Thanks.
>> >
>> > Thanks Eric.
>> >
>> > As the previous series caused hangs, we must run functional regression tests
>> > over this series as well.
>> > Run has already started, and results will be available tomorrow morning.
>> >
>> > In general, I really like this series. The re-factorization looks more
>> > elegant and more correct, functionally.
>> >
>> > However, performance wise: we fear that the numbers will be drastically
>> > lower with this transition to order-0 pages,
>> > because of the (becoming critical) page allocator and dma operations
>> > bottlenecks, especially on systems with costly
>> > dma operations, such as ARM, iommu=on, etc...
>>
>> So to give you an idea I originally came up with the approach used in
>> the Intel drivers when I was dealing with a PowerPC system where the
>> IOMMU was a requirement.  With this setup correctly the map/unmap
>> calls should be almost non-existent.  Basically the only time we
>> should have to allocate or free a page is if something is sitting on
>> the pages for an excessively long time or if the interrupt is bouncing
>> between memory nodes which forces us to free the memory since it is
>> sitting on the wrong node.
>>
>> > We already have this exact issue in mlx5, where we moved to order-0
>> > allocations with a fixed size cache, but that was not enough.
>> > Customers of mlx5 have already complained about the performance degradation,
>> > and currently this is hurting our business.
>> > We get a clear nack from our performance regression team regarding doing the
>> > same in mlx4.
>>
>> What form of recycling were you doing?  If you were doing the offset
>> based setup then that obviously only gets you so much.  The advantage
>> to using the page count is that you get almost a mobius strip effect
>> for your buffers and descriptor rings where you just keep flipping the
>> page offset back and forth via an XOR and the only cost is
>> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
>> do the full allocation, mapping, and unmapping.
>>
>> > So, the question is, can we live with this degradation until those
>> > bottleneck challenges are addressed?
>> > Following our perf experts feedback, I cannot just simply Ack. We need to
>> > have a clear plan to close the perf gap or reduce the impact.
>>
>> I think we need to define what is the degradation you are expecting to
>> see.  Using the page count based approach should actually be better in
>> some cases than the order 3 page and just walking frags approach since
>> the theoretical reuse is infinite instead of fixed.
>>
>> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
>> > mechanisms in the driver,
>> > and together they totally bridge the performance gap.
>> > That's why I would like to hear from Jesper what is the status of his
>> > page_pool API, it is promising and could totally solve these issues.
>> >
>> > Regards,
>> > Tariq
>>
>> The page pool may provide gains but we have to actually see it before
>> we can guarantee it.  If worse comes to worse I might just resort to
>> standardizing the logic used for the Intel driver page count based
>> approach.  Then if nothing else we would have a standard path for all
>> the drivers to use if we start going down this route.
>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)

The extra atomic is moot since it is we can get rid of that by doing
bulk page count updates.

Why can't XDP share a page?  You have brought up the user space aspect
of things repeatedly but the AF_PACKET patches John did more or less
demonstrated that wasn't the case.  What you end up with is you have
to either be providing pure user space pages or pure kernel pages.
You can't have pages being swapped between the two without introducing
security issues.  So even with your pool approach it doesn't matter
which way things are run.  Your pool is either device to user space,
or device to kernel space and it can't be both without creating
sharing concerns.

> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.

I fully agree here.  However the easiest way of handling this is via
the page count in my opinion.

> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?

No.  This assumes way too much.  Most packets aren't going to be
device to device routing.  We have seen this type of thing and
rejected it multiple times.  Don't think driver to driver.  This is
driver to network stack, socket, device, virtual machine, storage,
etc.  The fact is there are many spots where a frame might get
terminated.  This is why the bulk alloc/free of skbs never went
anywhere.  We never really addressed the fact that there are many more
spots where a frame is terminated.

This is why I said before what we need to do is have a page destructor
to handle this sort of thing.  The idea is you want to have this work
everywhere.  Having just drivers do this would make it a nice toy but
completely useless since not too many people are doing
routing/bridging between interfaces.  Using the page destructor it is
easy to create a pool of "free" pages that you can then pull your DMA
pages out of or that you can release back into the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:04               ` David Miller
  2017-02-14 17:17                 ` David Laight
@ 2017-02-14 19:38                 ` Jesper Dangaard Brouer
  2017-02-14 19:59                   ` David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 19:38 UTC (permalink / raw)
  To: David Miller
  Cc: ttoukan.linux, edumazet, alexander.duyck, netdev, tariqt, kafai,
	saeedm, willemb, bblanco, ast, eric.dumazet, linux-mm, brouer

On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Tariq Toukan <ttoukan.linux@gmail.com>
> Date: Tue, 14 Feb 2017 16:56:49 +0200
> 
> > Internally, I already implemented "dynamic page-cache" and
> > "page-reuse" mechanisms in the driver, and together they totally
> > bridge the performance gap.  

It sounds like you basically implemented a page_pool scheme...

> I worry about a dynamically growing page cache inside of drivers
> because it is invisible to the rest of the kernel.

Exactly, that is why I wanted a separate standardized thing, I call the
page_pool, which is part of the MM-tree and interacts with the page
allocator.  E.g. it must implement/support a way the page allocator can
reclaim pages from it (admit I didn't implement this in RFC patches).


> It responds only to local needs.

Generally true, but a side effect of recycling these pages, result in
less fragmentation of the page allocator/buddy system.


> The price of the real page allocator comes partly because it can
> respond to global needs.
> 
> If a driver consumes some unreasonable percentage of system memory, it
> is keeping that memory from being used from other parts of the system
> even if it would be better for networking to be slightly slower with
> less cache because that other thing that needs memory is more
> important.

(That is why I want to have OOM protection at device level, with the
recycle feedback from page pool we have this knowledge, and further I
wanted to allow blocking a specific RX queue[1])
[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html#userspace-delivery-and-oom


> I think this is one of the primary reasons that the MM guys severely
> chastise us when we build special purpose local caches into networking
> facilities.
> 
> And the more I think about it the more I think they are right.

+1
 
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.

I like this idea.  Are you saying that this should be done per DMA
engine or per device?

If this is per device, it is almost the page_pool idea.  

 
> Yes, we still need to make the page allocator faster, but this kind of
> work helps everyone not just 100GB ethernet NICs.

True.  And Mel already have some generic improvements to the page
allocator queued for the next merge.  And I have the responsibility to
get the bulking API into shape.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 19:06                   ` Alexander Duyck
@ 2017-02-14 19:50                     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 19:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	John Fastabend, brouer

On Tue, 14 Feb 2017 11:06:25 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Tue, 14 Feb 2017 09:29:54 -0800
> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >  
> >> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:  
> >> >
> >> >
> >> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >> >>
> >> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> >> <brouer@redhat.com> wrote:
> >> >>  
> >> >>> It is important to understand that there are two cases for the cost of
> >> >>> an atomic op, which depend on the cache-coherency state of the
> >> >>> cacheline.
> >> >>>
> >> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >> >>>
> >> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >> >>>  
> >> >> Okay, it seems you guys really want a patch that I said was not giving
> >> >> good results
> >> >>
> >> >> Let me publish the numbers I get , adding or not the last (and not
> >> >> official) patch.
> >> >>
> >> >> If I _force_ the user space process to run on the other node,
> >> >> then the results are not the ones Alex or you are expecting.
> >> >>
> >> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> >> and 3.5 Mpps without it.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >> >>
> >> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>      31.07%  [kernel]      [k] skb_release_data
> >> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >> >>       0.90%  [kernel]      [k] skb_release_head_state
> >> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.33%  [kernel]      [k] ___cache_free
> >> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >> >>
> >> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >> >>
> >> >>
> >> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      12.92%  [kernel]          [k] napi_gro_frags
> >> >>       7.28%  [kernel]          [k] inet_gro_receive
> >> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >> >>       5.10%  [kernel]          [k] dev_gro_receive
> >> >>       4.87%  [kernel]          [k] skb_gro_receive
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       2.04%  [kernel]          [k] __build_skb
> >> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >> >>
> >> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> >> reads.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profiles of the two cpus :
> >> >>
> >> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>       6.42%  [kernel]      [k] skb_release_data
> >> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >> >>       1.48%  [kernel]      [k] skb_release_head_state
> >> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.43%  [kernel]      [k] page_frag_free
> >> >>       0.38%  [kernel]      [k] ___cache_free
> >> >>       0.37%  [kernel]      [k] tcp_established_options
> >> >>       0.37%  [kernel]      [k] __ip_local_out
> >> >>
> >> >>
> >> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      26.47%  [kernel]          [k] napi_gro_frags
> >> >>       7.02%  [kernel]          [k] inet_gro_receive
> >> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >> >>       5.17%  [kernel]          [k] dev_gro_receive
> >> >>       4.80%  [kernel]          [k] skb_gro_receive
> >> >>       2.61%  [kernel]          [k] __build_skb
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >> >>
> >> >>
> >> >> So probably this will need further analysis, outside of the scope of
> >> >> this patch series.
> >> >>
> >> >> Could we now please Ack this v3 and merge it ?
> >> >>
> >> >> Thanks.  
> >> >
> >> > Thanks Eric.
> >> >
> >> > As the previous series caused hangs, we must run functional regression tests
> >> > over this series as well.
> >> > Run has already started, and results will be available tomorrow morning.
> >> >
> >> > In general, I really like this series. The re-factorization looks more
> >> > elegant and more correct, functionally.
> >> >
> >> > However, performance wise: we fear that the numbers will be drastically
> >> > lower with this transition to order-0 pages,
> >> > because of the (becoming critical) page allocator and dma operations
> >> > bottlenecks, especially on systems with costly
> >> > dma operations, such as ARM, iommu=on, etc...  
> >>
> >> So to give you an idea I originally came up with the approach used in
> >> the Intel drivers when I was dealing with a PowerPC system where the
> >> IOMMU was a requirement.  With this setup correctly the map/unmap
> >> calls should be almost non-existent.  Basically the only time we
> >> should have to allocate or free a page is if something is sitting on
> >> the pages for an excessively long time or if the interrupt is bouncing
> >> between memory nodes which forces us to free the memory since it is
> >> sitting on the wrong node.
> >>  
> >> > We already have this exact issue in mlx5, where we moved to order-0
> >> > allocations with a fixed size cache, but that was not enough.
> >> > Customers of mlx5 have already complained about the performance degradation,
> >> > and currently this is hurting our business.
> >> > We get a clear nack from our performance regression team regarding doing the
> >> > same in mlx4.  
> >>
> >> What form of recycling were you doing?  If you were doing the offset
> >> based setup then that obviously only gets you so much.  The advantage
> >> to using the page count is that you get almost a mobius strip effect
> >> for your buffers and descriptor rings where you just keep flipping the
> >> page offset back and forth via an XOR and the only cost is
> >> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> >> do the full allocation, mapping, and unmapping.
> >>  
> >> > So, the question is, can we live with this degradation until those
> >> > bottleneck challenges are addressed?
> >> > Following our perf experts feedback, I cannot just simply Ack. We need to
> >> > have a clear plan to close the perf gap or reduce the impact.  
> >>
> >> I think we need to define what is the degradation you are expecting to
> >> see.  Using the page count based approach should actually be better in
> >> some cases than the order 3 page and just walking frags approach since
> >> the theoretical reuse is infinite instead of fixed.
> >>  
> >> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> >> > mechanisms in the driver,
> >> > and together they totally bridge the performance gap.
> >> > That's why I would like to hear from Jesper what is the status of his
> >> > page_pool API, it is promising and could totally solve these issues.
> >> >
> >> > Regards,
> >> > Tariq  
> >>
> >> The page pool may provide gains but we have to actually see it before
> >> we can guarantee it.  If worse comes to worse I might just resort to
> >> standardizing the logic used for the Intel driver page count based
> >> approach.  Then if nothing else we would have a standard path for all
> >> the drivers to use if we start going down this route.  
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)  
> 
> The extra atomic is moot since it is we can get rid of that by doing
> bulk page count updates.
> 
> Why can't XDP share a page?  You have brought up the user space aspect
> of things repeatedly but the AF_PACKET patches John did more or less
> demonstrated that wasn't the case.  What you end up with is you have
> to either be providing pure user space pages or pure kernel pages.
> You can't have pages being swapped between the two without introducing
> security issues.  So even with your pool approach it doesn't matter
> which way things are run.  Your pool is either device to user space,
> or device to kernel space and it can't be both without creating
> sharing concerns.

(I think we are talking past each other here.)

 
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.  
> 
> I fully agree here.  However the easiest way of handling this is via
> the page count in my opinion.
> 
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?  
> 
> No.  This assumes way too much.  Most packets aren't going to be
> device to device routing.  We have seen this type of thing and
> rejected it multiple times.  Don't think driver to driver.  This is
> driver to network stack, socket, device, virtual machine, storage,
> etc.  The fact is there are many spots where a frame might get
> terminated. [...]

I fully agree, that XDP need to think further than driver to driver.

 
> This is why I said before what we need to do is have a page destructor
> to handle this sort of thing.  The idea is you want to have this work
> everywhere.  Having just drivers do this would make it a nice toy but
> completely useless since not too many people are doing
> routing/bridging between interfaces.  Using the page destructor it is
> easy to create a pool of "free" pages that you can then pull your DMA
> pages out of or that you can release back into the page allocator.

How is this page destructor different from my page_pool RFC
implementation?  (It basically functions as a page destructor...)

Help me understand what I'm missing?

Pointers to page_pool discussions
* page_pool RFC patchset:
 - http://lkml.kernel.org/r/20161220132444.18788.50875.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132812.18788.20431.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132817.18788.64726.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132822.18788.19768.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132827.18788.8658.stgit@firesoul

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 19:38                 ` Jesper Dangaard Brouer
@ 2017-02-14 19:59                   ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2017-02-14 19:59 UTC (permalink / raw)
  To: brouer
  Cc: ttoukan.linux, edumazet, alexander.duyck, netdev, tariqt, kafai,
	saeedm, willemb, bblanco, ast, eric.dumazet, linux-mm

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 14 Feb 2017 20:38:22 +0100

> On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
> 
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> 
> I like this idea.  Are you saying that this should be done per DMA
> engine or per device?
> 
> If this is per device, it is almost the page_pool idea.  

Per-device is simplest, at least at first.

Maybe later down the road we can try to pool by "mapping entity" be
that a parent IOMMU or something else like a hypervisor managed
page table.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 19:02                   ` Eric Dumazet
@ 2017-02-14 20:02                     ` Jesper Dangaard Brouer
  2017-02-14 21:56                       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 20:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	brouer

On Tue, 14 Feb 2017 11:02:01 -0800
Eric Dumazet <edumazet@google.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >  
> 
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)
> >
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?
> >
> >  
> 
> 
> mlx4 already has a cache for XDP.
> I believe I did not change this part, it still should work.
> 
> commit d576acf0a22890cf3f8f7a9b035f1558077f6770
> Author: Brenden Blanco <bblanco@plumgrid.com>
> Date:   Tue Jul 19 12:16:52 2016 -0700
> 
>     net/mlx4_en: add page recycle to prepare rx ring for tx support

This obviously does not work for the case I'm talking about
(transmitting out another device with XDP).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 20:02                     ` Jesper Dangaard Brouer
@ 2017-02-14 21:56                       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2017-02-14 21:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

>
> This obviously does not work for the case I'm talking about
> (transmitting out another device with XDP).
>

XDP_TX does not handle this yet.

When XDP_TX was added, it was very clear that the transmit _had_ to be
done on the same port.

Since all this discussion happened in this thread ( mlx4: use order-0
pages for RX )
I was kind of assuming all the comments were relevant to current code or patch,
not future devs ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:29                 ` Tom Herbert
@ 2017-02-15 16:42                   ` Tariq Toukan
  2017-02-15 16:57                     ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Tariq Toukan @ 2017-02-15 16:42 UTC (permalink / raw)
  To: Tom Herbert, Eric Dumazet
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm



On 14/02/2017 7:29 PM, Tom Herbert wrote:
> On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>>
>>> As the previous series caused hangs, we must run functional regression
>>> tests over this series as well.
>>> Run has already started, and results will be available tomorrow morning.
>>>
>>> In general, I really like this series. The re-factorization looks more
>>> elegant and more correct, functionally.
>>>
>>> However, performance wise: we fear that the numbers will be drastically
>>> lower with this transition to order-0 pages,
>>> because of the (becoming critical) page allocator and dma operations
>>> bottlenecks, especially on systems with costly
>>> dma operations, such as ARM, iommu=on, etc...
>>>
>> So, again, performance after this patch series his higher,
>> once you have sensible RX queues parameters, for the expected workload.
>>
>> Only in pathological cases, you might have some regression.
>>
>> The old schem was _maybe_ better _when_ memory is not fragmented.
>>
>> When you run hosts for months, memory _is_ fragmented.
>>
>> You never see that on benchmarks, unless you force memory being
>> fragmented.
>>
>>
>>
>>> We already have this exact issue in mlx5, where we moved to order-0
>>> allocations with a fixed size cache, but that was not enough.
>>> Customers of mlx5 have already complained about the performance
>>> degradation, and currently this is hurting our business.
>>> We get a clear nack from our performance regression team regarding doing
>>> the same in mlx4.
>>> So, the question is, can we live with this degradation until those
>>> bottleneck challenges are addressed?
>> Again, there is no degradation.
>>
>> We have been using order-0 pages for years at Google.
>>
>> Only when we made the mistake to rebase from the upstream driver and
>> order-3 pages we got horrible regressions, causing production outages.
>>
>> I was silly to believe that mm layer got better.
>>
>>> Following our perf experts feedback, I cannot just simply Ack. We need
>>> to have a clear plan to close the perf gap or reduce the impact.
>> Your perf experts need to talk to me, or any experts at Google and
>> Facebook, really.
>>
> I agree with this 100%! To be blunt, power users like this are testing
> your drivers far beyond what Mellanox is doing and understand how
> performance gains in benchmarks translate to possible gains in real
> production way more than your perf experts can. Listen to Eric!
>
> Tom
>
>
>> Anything _relying_ on order-3 pages being available to impress
>> friends/customers is a lie.

Isn't it the same principle in page_frag_alloc() ?
It is called form __netdev_alloc_skb()/__napi_alloc_skb().

Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is 
a frag of a huge page,
and it is not going to be freed before the other non-linear frags.
Cannot this cause the same threats (memory pinning and so...)?

Currently, mlx4 doesn't use this generic API, while most other drivers do.

Similar claims are true for TX:
https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-15 16:42                   ` Tariq Toukan
@ 2017-02-15 16:57                     ` Eric Dumazet
  2017-02-16 13:08                       ` Tariq Toukan
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2017-02-15 16:57 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Tom Herbert, Eric Dumazet, Jesper Dangaard Brouer,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>

>
> Isn't it the same principle in page_frag_alloc() ?
> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>
> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?

This is not ok.

This is a very well known problem, we already mentioned that here in the past,
but at least core networking stack uses  order-0 pages on PowerPC.

mlx4 driver suffers from this problem 100% more than other drivers ;)

One problem at a time Tariq. Right now, only mlx4 has this big problem
compared to other NIC.

Then, if we _still_ hit major issues, we might also need to force
napi_get_frags()
to allocate skb->head using kmalloc() instead of a page frag.

That is a very simple fix.

Remember that we have skb->truesize that is an approximation, it will
never be completely accurate,
but we need to make it better.

mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
is obviously wrong when host is under memory pressure
(2 frags per page -> truesize should be 2048)


> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
> frag of a huge page,
> and it is not going to be freed before the other non-linear frags.
> Cannot this cause the same threats (memory pinning and so...)?
>
> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>
> Similar claims are true for TX:
> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

We do not have such problem on TX. GFP_KERNEL allocations do not have
the same issues.

Tasks are usually not malicious in our DC, and most serious
applications use memcg or such memory control.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-15 16:57                     ` Eric Dumazet
@ 2017-02-16 13:08                       ` Tariq Toukan
  2017-02-16 15:47                         ` Eric Dumazet
  2017-02-16 17:05                         ` Tom Herbert
  0 siblings, 2 replies; 28+ messages in thread
From: Tariq Toukan @ 2017-02-16 13:08 UTC (permalink / raw)
  To: Eric Dumazet, Tariq Toukan, Jesper Dangaard Brouer
  Cc: Tom Herbert, Eric Dumazet, Alexander Duyck, David S . Miller,
	netdev, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, linux-mm


On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>> Isn't it the same principle in page_frag_alloc() ?
>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>
>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> This is not ok.
>
> This is a very well known problem, we already mentioned that here in the past,
> but at least core networking stack uses  order-0 pages on PowerPC.
You're right, we should have done this as well in mlx4 on PPC.
> mlx4 driver suffers from this problem 100% more than other drivers ;)
>
> One problem at a time Tariq. Right now, only mlx4 has this big problem
> compared to other NIC.
We _do_ agree that the series improves the driver's quality, stability,
and performance in a fragmented system.

But due to the late rc we're in, and the fact that we know what benchmarks
our customers are going to run, we cannot Ack the series and get it
as is inside kernel 4.11.

We are interested to get your series merged along another perf improvement
we are preparing for next rc1. This way we will earn the desired stability
without breaking existing benchmarks.
I think this is the right thing to do at this point of time.


The idea behind the perf improvement, suggested by Jesper, is to split
the napi_poll call mlx4_en_process_rx_cq() loop into two.
The first loop extracts completed CQEs and starts prefetching on data
and RX descriptors. The second loop process the real packets.

>
> Then, if we _still_ hit major issues, we might also need to force
> napi_get_frags()
> to allocate skb->head using kmalloc() instead of a page frag.
>
> That is a very simple fix.
>
> Remember that we have skb->truesize that is an approximation, it will
> never be completely accurate,
> but we need to make it better.
>
> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
> is obviously wrong when host is under memory pressure
> (2 frags per page -> truesize should be 2048)
>
>
>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
>> frag of a huge page,
>> and it is not going to be freed before the other non-linear frags.
>> Cannot this cause the same threats (memory pinning and so...)?
>>
>> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>>
>> Similar claims are true for TX:
>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
> We do not have such problem on TX. GFP_KERNEL allocations do not have
> the same issues.
>
> Tasks are usually not malicious in our DC, and most serious
> applications use memcg or such memory control.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 13:08                       ` Tariq Toukan
@ 2017-02-16 15:47                         ` Eric Dumazet
  2017-02-16 17:05                         ` Tom Herbert
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2017-02-16 15:47 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Tom Herbert,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Thu, 2017-02-16 at 15:08 +0200, Tariq Toukan wrote:
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> > On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
> >> Isn't it the same principle in page_frag_alloc() ?
> >> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
> >>
> >> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> > This is not ok.
> >
> > This is a very well known problem, we already mentioned that here in the past,
> > but at least core networking stack uses  order-0 pages on PowerPC.
> You're right, we should have done this as well in mlx4 on PPC.
> > mlx4 driver suffers from this problem 100% more than other drivers ;)
> >
> > One problem at a time Tariq. Right now, only mlx4 has this big problem
> > compared to other NIC.
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
> 
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
> 
> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
> 
> 
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.

Make sure to resubmit my patches before anything new.

We need to backport them to stable versions, without XDP, without
anything fancy.

And submit what is needed for 4.11, since current mlx4 driver in
net-next is broken, in case you missed it.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 13:08                       ` Tariq Toukan
  2017-02-16 15:47                         ` Eric Dumazet
@ 2017-02-16 17:05                         ` Tom Herbert
  2017-02-16 17:11                           ` Eric Dumazet
  2017-02-16 19:03                           ` David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Tom Herbert @ 2017-02-16 17:05 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>
>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>> Isn't it the same principle in page_frag_alloc() ?
>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>
>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>
>> This is not ok.
>>
>> This is a very well known problem, we already mentioned that here in the
>> past,
>> but at least core networking stack uses  order-0 pages on PowerPC.
>
> You're right, we should have done this as well in mlx4 on PPC.
>>
>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>
>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>> compared to other NIC.
>
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
>
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
>
You're admitting that Eric's patches improve driver quality,
stability, and performance but you're not allowing this in the kernel
because "we know what benchmarks our customers are going to run".
Sorry, but that is a weak explanation.

> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
>
>
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.
>
>
>>
>> Then, if we _still_ hit major issues, we might also need to force
>> napi_get_frags()
>> to allocate skb->head using kmalloc() instead of a page frag.
>>
>> That is a very simple fix.
>>
>> Remember that we have skb->truesize that is an approximation, it will
>> never be completely accurate,
>> but we need to make it better.
>>
>> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
>> is obviously wrong when host is under memory pressure
>> (2 frags per page -> truesize should be 2048)
>>
>>
>>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is
>>> a
>>> frag of a huge page,
>>> and it is not going to be freed before the other non-linear frags.
>>> Cannot this cause the same threats (memory pinning and so...)?
>>>
>>> Currently, mlx4 doesn't use this generic API, while most other drivers
>>> do.
>>>
>>> Similar claims are true for TX:
>>>
>>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
>>
>> We do not have such problem on TX. GFP_KERNEL allocations do not have
>> the same issues.
>>
>> Tasks are usually not malicious in our DC, and most serious
>> applications use memcg or such memory control.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 17:05                         ` Tom Herbert
@ 2017-02-16 17:11                           ` Eric Dumazet
  2017-02-16 20:49                             ` Saeed Mahameed
  2017-02-16 19:03                           ` David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2017-02-16 17:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".

Note that I do not particularly care if these patches go in 4.11 or 4.12 really.

I already backported them into our 4.3 based kernel.

I guess that we could at least propose the trivial patch for stable releases,
since PowerPC arches really need it.

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -102,7 +102,8 @@
 /* Use the maximum between 16384 and a single page */
 #define MLX4_EN_ALLOC_SIZE     PAGE_ALIGN(16384)

-#define MLX4_EN_ALLOC_PREFER_ORDER     PAGE_ALLOC_COSTLY_ORDER
+#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
         \
+                                        PAGE_ALLOC_COSTLY_ORDER)

 /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
  * and 4K allocations) */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 17:05                         ` Tom Herbert
  2017-02-16 17:11                           ` Eric Dumazet
@ 2017-02-16 19:03                           ` David Miller
  2017-02-16 21:06                             ` Saeed Mahameed
  1 sibling, 1 reply; 28+ messages in thread
From: David Miller @ 2017-02-16 19:03 UTC (permalink / raw)
  To: tom
  Cc: tariqt, edumazet, brouer, eric.dumazet, alexander.duyck, netdev,
	kafai, saeedm, willemb, bblanco, ast, linux-mm

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 16 Feb 2017 09:05:26 -0800

> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>
>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>
>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>
>>>> Isn't it the same principle in page_frag_alloc() ?
>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>
>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>
>>> This is not ok.
>>>
>>> This is a very well known problem, we already mentioned that here in the
>>> past,
>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>
>> You're right, we should have done this as well in mlx4 on PPC.
>>>
>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>
>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>> compared to other NIC.
>>
>> We _do_ agree that the series improves the driver's quality, stability,
>> and performance in a fragmented system.
>>
>> But due to the late rc we're in, and the fact that we know what benchmarks
>> our customers are going to run, we cannot Ack the series and get it
>> as is inside kernel 4.11.
>>
> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".
> Sorry, but that is a weak explanation.

I have to agree with Tom and Eric.

If your customers have gotten into the habit of using metrics which
actually do not represent real life performance, that is a completely
inappropriate reason to not include Eric's changes as-is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 17:11                           ` Eric Dumazet
@ 2017-02-16 20:49                             ` Saeed Mahameed
  0 siblings, 0 replies; 28+ messages in thread
From: Saeed Mahameed @ 2017-02-16 20:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Tariq Toukan, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Thu, Feb 16, 2017 at 7:11 PM, Eric Dumazet <edumazet@google.com> wrote:
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>
> Note that I do not particularly care if these patches go in 4.11 or 4.12 really.
>
> I already backported them into our 4.3 based kernel.
>
> I guess that we could at least propose the trivial patch for stable releases,
> since PowerPC arches really need it.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
> 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -102,7 +102,8 @@
>  /* Use the maximum between 16384 and a single page */
>  #define MLX4_EN_ALLOC_SIZE     PAGE_ALIGN(16384)
>
> -#define MLX4_EN_ALLOC_PREFER_ORDER     PAGE_ALLOC_COSTLY_ORDER
> +#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
>          \
> +                                        PAGE_ALLOC_COSTLY_ORDER)
>
>  /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
>   * and 4K allocations) */

+1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 19:03                           ` David Miller
@ 2017-02-16 21:06                             ` Saeed Mahameed
  0 siblings, 0 replies; 28+ messages in thread
From: Saeed Mahameed @ 2017-02-16 21:06 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Tariq Toukan, Eric Dumazet, Jesper Dangaard Brouer,
	Eric Dumazet, Alexander Duyck, Linux Netdev List,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, linux-mm

On Thu, Feb 16, 2017 at 9:03 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Thu, 16 Feb 2017 09:05:26 -0800
>
>> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>>
>>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>>
>>>>> Isn't it the same principle in page_frag_alloc() ?
>>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>>
>>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>>
>>>> This is not ok.
>>>>
>>>> This is a very well known problem, we already mentioned that here in the
>>>> past,
>>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>>
>>> You're right, we should have done this as well in mlx4 on PPC.
>>>>
>>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>>
>>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>>> compared to other NIC.
>>>
>>> We _do_ agree that the series improves the driver's quality, stability,
>>> and performance in a fragmented system.
>>>
>>> But due to the late rc we're in, and the fact that we know what benchmarks
>>> our customers are going to run, we cannot Ack the series and get it
>>> as is inside kernel 4.11.
>>>
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>> Sorry, but that is a weak explanation.
>
> I have to agree with Tom and Eric.
>
> If your customers have gotten into the habit of using metrics which
> actually do not represent real life performance, that is a completely
> inappropriate reason to not include Eric's changes as-is.

Guys, It is not about customers, benchmark and great performance numbers.
We agree with you that those patches are good and needed for the long term,
but as we are already at rc8 and although this change is correct, i
think it is a little bit too late
to have such huge change in the core RX engine of the driver. ( the
code is already like this for
more kernel releases than one could count, it will hurt no one to keep
it like this for two weeks more).

All we ask is to have a little bit more time - one or two weeks- to
test them and evaluate the impact.
As Eric stated we don't care if they make it to 4.11 or 4.12, the idea
is to have them in ASAP.
so why not wait to 4.11 (just two more weeks) and Tariq already said
that he will accept it as is.
and by then we will be smarter and have a clear plan of the gaps and
how to close them.

For PowerPC page order issue, Eric already have a simpler suggestion
that i support, and can easily be
sent to net and -stable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-02-16 21:06 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170213195858.5215-1-edumazet@google.com>
     [not found] ` <20170213195858.5215-9-edumazet@google.com>
     [not found]   ` <CAKgT0Ufx0Y=9kjLax36Gx4e7Y-A7sKZDNYxgJ9wbCT4_vxHhGA@mail.gmail.com>
     [not found]     ` <CANn89iLkPB_Dx1L2dFfwOoeXOmPhu_C3OO2yqZi8+Rvjr=-EtA@mail.gmail.com>
     [not found]       ` <CAKgT0UeB_e_Z7LM1_r=en8JJdgLhoYFstWpCDQN6iawLYZJKDA@mail.gmail.com>
2017-02-14 12:12         ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Jesper Dangaard Brouer
2017-02-14 13:45           ` Eric Dumazet
2017-02-14 14:12             ` Eric Dumazet
2017-02-14 14:56             ` Tariq Toukan
2017-02-14 15:51               ` Eric Dumazet
2017-02-14 16:03                 ` Eric Dumazet
2017-02-14 17:29                 ` Tom Herbert
2017-02-15 16:42                   ` Tariq Toukan
2017-02-15 16:57                     ` Eric Dumazet
2017-02-16 13:08                       ` Tariq Toukan
2017-02-16 15:47                         ` Eric Dumazet
2017-02-16 17:05                         ` Tom Herbert
2017-02-16 17:11                           ` Eric Dumazet
2017-02-16 20:49                             ` Saeed Mahameed
2017-02-16 19:03                           ` David Miller
2017-02-16 21:06                             ` Saeed Mahameed
2017-02-14 17:04               ` David Miller
2017-02-14 17:17                 ` David Laight
2017-02-14 17:22                   ` David Miller
2017-02-14 19:38                 ` Jesper Dangaard Brouer
2017-02-14 19:59                   ` David Miller
2017-02-14 17:29               ` Alexander Duyck
2017-02-14 18:46                 ` Jesper Dangaard Brouer
2017-02-14 19:02                   ` Eric Dumazet
2017-02-14 20:02                     ` Jesper Dangaard Brouer
2017-02-14 21:56                       ` Eric Dumazet
2017-02-14 19:06                   ` Alexander Duyck
2017-02-14 19:50                     ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).