Major network performance regression in 3.7

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Major network performance regression in 3.7
@ 2013-01-05 21:49 Willy Tarreau
  2013-01-05 23:18 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-05 21:49 UTC (permalink / raw)
  To: netdev, linux-kernel

Hi,

I'm observing multiple apparently unrelated network performance
issues in 3.7, to the point that I'm doubting it comes from the
network stack.

My setup involves 3 machines connected point-to-point with myri
10GE NICs (the middle machine has 2 NICs). The middle machine
normally runs haproxy, the other two run either an HTTP load
generator or a dummy web server :

  [ client ] <--------> [ haproxy ] <--------> [ server ]

Usually transferring HTTP objects from the server to the client
via haproxy causes no problem at 10 Gbps for moderately large
objects.

This time I observed that it was not possible to go beyond 6.8 Gbps,
with all the chain idling a lot. I tried to change the IRQ rate, CPU
affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
knobs, nothing managed to go beyond.

So I removed haproxy from the equation, and simply started the client
on the middle machine. Same issue. I thought about concurrency issues,
so I reduced to a single connection, and nothing changed (usually I
achieve 10G even with a single connection with large enough TCP windows).
I tried to start tcpdump and the transfer immediately stalled and did not
come back after I stopped tcpdump. This was reproducible several times
but not always.

So I first thought about an issue in the myri10ge driver and wanted to
confirm that everything was OK on the middle machine.

I started the server on it and aimed the client at it via the loopback.
The transfer rate was even worse : randomly oscillating between 10 and
100 MB/s ! Normally on the loop back, I get several GB/s here.

Running tcpdump on the loopback showed be several very concerning issues :

1) lots of packets are lost before reaching tcpdump. The trace shows that
   these segments are ACKed so they're correctly received, but tcpdump
   does not get them. Tcpdump stats at the end report impressive numbers,
   around 90% packet dropped from the capture!

2) ACKs seem to be immediately delivered but do not trigger sending, the
   system seems to be running with delayed ACKs, as it waits 40 or 200ms
   before restarting, and this is visible even in the first round trips :

   - connection setup :

   18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
   18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
   18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16

   - GET /?s=1g HTTP/1.0

   18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16

   - HTTP/1.1 200 OK with the beginning of the response :

   18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16
   18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250
   ==> 200ms pause here
   18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16
   ==> 40ms pause here
   18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256
   18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16

   ... and so on

   My server is using splice() with the SPLICE_F_MORE flag to send data.
   I noticed that not using splice and relying on send(MSG_MORE) instead
   I don't get the issue.

3) I wondered if this had something to do with the 64k MTU on the loopback
   so I lowered it to 16kB. The performance was even worse (about 5MB/s).
   Starting tcpdump managed to make my transfer stall, just like with the
   myri10ge. In this last test, I noticed that there were some real drops,
   because there were some SACKs :

   18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16
   18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
   18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16
   18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
   18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16
   18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16
   18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846
   18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16
   18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902
   18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16
   18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941
   18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959405614:959421958}>
   18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959421958:959438302}>

Please note that the Myri card is running with the normal MTU of 1500,
jumbo frames were not used.

>From what I could test, only the Tx path seems affected, because after
rebooting the server on 3.5.7, I can transfer at 10 Gbps via the myri10ge
again to the 3.7.1 client. I tried to disable GSO, TSO, etc... but nothing
worked. Also, on 3.5.7, there are almost no drops between the kernel and
tcpdump.

I really suspect that all these issues are related. Maybe something about
send buffers recycling, I don't know. The fact that tcpdump gets very few
packets also makes me think about something about memory allocation. And
that could perhaps explain why starting it causes the traffic to stall.

Sadly, it will be hard to bisect this bug because I had a hard time running
3.7-rc up to rc5 due to several bugs that affected this workload (and were
fortunately fixed since).

I think it's better to only focus on the issue with the loopback since it's
easier to reproduce and involves less moving blocks.

I've just tried to remove the SPLICE_F_MORE flag on the sender but it does
not change anything.

If anyone has any idea of things to test to narrow the problem down, I'm
interested. I have not tried 3.8-rc2 yet, and reading the archives there
does not seem to be any similar reports. Note that I've been using 3.5.7
as a fallback, but the same lab has run 3.6.6 without any issues 2 months
ago, so I'm really confident that this is a 3.7 regression.

Regards,
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-05 21:49 Major network performance regression in 3.7 Willy Tarreau
@ 2013-01-05 23:18 ` Eric Dumazet
  2013-01-05 23:29   ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-05 23:18 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote:
> Hi,
> 
> I'm observing multiple apparently unrelated network performance
> issues in 3.7, to the point that I'm doubting it comes from the
> network stack.
> 
> My setup involves 3 machines connected point-to-point with myri
> 10GE NICs (the middle machine has 2 NICs). The middle machine
> normally runs haproxy, the other two run either an HTTP load
> generator or a dummy web server :
> 
> 
>   [ client ] <--------> [ haproxy ] <--------> [ server ]
> 
> Usually transferring HTTP objects from the server to the client
> via haproxy causes no problem at 10 Gbps for moderately large
> objects.
> 
> This time I observed that it was not possible to go beyond 6.8 Gbps,
> with all the chain idling a lot. I tried to change the IRQ rate, CPU
> affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
> knobs, nothing managed to go beyond.
> 
> So I removed haproxy from the equation, and simply started the client
> on the middle machine. Same issue. I thought about concurrency issues,
> so I reduced to a single connection, and nothing changed (usually I
> achieve 10G even with a single connection with large enough TCP windows).
> I tried to start tcpdump and the transfer immediately stalled and did not
> come back after I stopped tcpdump. This was reproducible several times
> but not always.
> 
> So I first thought about an issue in the myri10ge driver and wanted to
> confirm that everything was OK on the middle machine.
> 
> I started the server on it and aimed the client at it via the loopback.
> The transfer rate was even worse : randomly oscillating between 10 and
> 100 MB/s ! Normally on the loop back, I get several GB/s here.
> 
> Running tcpdump on the loopback showed be several very concerning issues :
> 
> 1) lots of packets are lost before reaching tcpdump. The trace shows that
>    these segments are ACKed so they're correctly received, but tcpdump
>    does not get them. Tcpdump stats at the end report impressive numbers,
>    around 90% packet dropped from the capture!
> 
> 2) ACKs seem to be immediately delivered but do not trigger sending, the
>    system seems to be running with delayed ACKs, as it waits 40 or 200ms
>    before restarting, and this is visible even in the first round trips :
> 
>    - connection setup :
> 
>    18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
>    18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
>    18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16
> 
>    - GET /?s=1g HTTP/1.0
> 
>    18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16
> 
>    - HTTP/1.1 200 OK with the beginning of the response :
> 
>    18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16
>    18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250
>    ==> 200ms pause here
>    18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16
>    ==> 40ms pause here
>    18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256
>    18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16
> 
>    ... and so on
> 
>    My server is using splice() with the SPLICE_F_MORE flag to send data.
>    I noticed that not using splice and relying on send(MSG_MORE) instead
>    I don't get the issue.
> 
> 3) I wondered if this had something to do with the 64k MTU on the loopback
>    so I lowered it to 16kB. The performance was even worse (about 5MB/s).
>    Starting tcpdump managed to make my transfer stall, just like with the
>    myri10ge. In this last test, I noticed that there were some real drops,
>    because there were some SACKs :
> 
>    18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16
>    18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
>    18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16
>    18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
>    18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16
>    18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16
>    18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846
>    18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16
>    18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902
>    18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16
>    18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941
>    18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959405614:959421958}>
>    18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959421958:959438302}>
> 
> Please note that the Myri card is running with the normal MTU of 1500,
> jumbo frames were not used.
> 
> From what I could test, only the Tx path seems affected, because after
> rebooting the server on 3.5.7, I can transfer at 10 Gbps via the myri10ge
> again to the 3.7.1 client. I tried to disable GSO, TSO, etc... but nothing
> worked. Also, on 3.5.7, there are almost no drops between the kernel and
> tcpdump.
> 
> I really suspect that all these issues are related. Maybe something about
> send buffers recycling, I don't know. The fact that tcpdump gets very few
> packets also makes me think about something about memory allocation. And
> that could perhaps explain why starting it causes the traffic to stall.
> 
> Sadly, it will be hard to bisect this bug because I had a hard time running
> 3.7-rc up to rc5 due to several bugs that affected this workload (and were
> fortunately fixed since).
> 
> I think it's better to only focus on the issue with the loopback since it's
> easier to reproduce and involves less moving blocks.
> 
> I've just tried to remove the SPLICE_F_MORE flag on the sender but it does
> not change anything.
> 
> If anyone has any idea of things to test to narrow the problem down, I'm
> interested. I have not tried 3.8-rc2 yet, and reading the archives there
> does not seem to be any similar reports. Note that I've been using 3.5.7
> as a fallback, but the same lab has run 3.6.6 without any issues 2 months
> ago, so I'm really confident that this is a 3.7 regression.
> 
> Regards,
> Willy
> 
> --

Hi Willy, another good finding during the week end ! ;)

1) This looks like interrupts are spreaded on multiple cpus, and this
give Out Of Order problems with TCP stack.

2) Another possibility would be that Myri card/driver doesnt like very
well high order pages.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-05 23:18 ` Eric Dumazet
@ 2013-01-05 23:29   ` Willy Tarreau
  2013-01-06  0:02     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-05 23:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

Hi Eric,

On Sat, Jan 05, 2013 at 03:18:46PM -0800, Eric Dumazet wrote:
> Hi Willy, another good finding during the week end ! ;)

Yes, I wanted to experiment with TFO and stopped on this :-)

> 1) This looks like interrupts are spreaded on multiple cpus, and this
> give Out Of Order problems with TCP stack.

No, I forgot to mention this, I have tried to bind IRQs to a single
core, with the server either on the same or another one, but the
problem remained.

Also, the loopback is much more affected and doesn't use IRQs. And
BTW tcpdump on the loopback shouldn't drop that many packets (up to
90% even at low rate). I just noticed something, transferring data
using netcat on the loopback doesn't affect tcpdump. So it's likely
only the spliced data that are affected.

> 2) Another possibility would be that Myri card/driver doesnt like very
> well high order pages.

It looks like it has not changed much since 3.6 :-/ I really suspect
something is wrong with memory allocation. I have tried reverting many
patches affecting the mm/ directory just in case but I did not come to
anything useful yet.

I'm continuing to dig.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-05 23:29   ` Willy Tarreau
@ 2013-01-06  0:02     ` Eric Dumazet
  2013-01-06  0:50       ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  0:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:

> > 2) Another possibility would be that Myri card/driver doesnt like very
> > well high order pages.
> 
> It looks like it has not changed much since 3.6 :-/ I really suspect
> something is wrong with memory allocation. I have tried reverting many
> patches affecting the mm/ directory just in case but I did not come to
> anything useful yet.
> 

Hmm, I was referring to TCP stack now using order-3 pages instead of
order-0 ones

See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
(net: use a per task frag allocator)

Could you please post :

ethtool -S eth0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  0:02     ` Eric Dumazet
@ 2013-01-06  0:50       ` Willy Tarreau
  2013-01-06  1:21         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06  0:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sat, Jan 05, 2013 at 04:02:03PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:
> 
> > > 2) Another possibility would be that Myri card/driver doesnt like very
> > > well high order pages.
> > 
> > It looks like it has not changed much since 3.6 :-/ I really suspect
> > something is wrong with memory allocation. I have tried reverting many
> > patches affecting the mm/ directory just in case but I did not come to
> > anything useful yet.
> > 
> 
> Hmm, I was referring to TCP stack now using order-3 pages instead of
> order-0 ones
> 
> See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
> (net: use a per task frag allocator)

OK, so you think there are two distinct problems ?

I have tried to revert this one but it did not change the performance, I'm
still saturating at ~6.9 Gbps.

> Could you please post :
> 
> ethtool -S eth0

Yes, I've removed all zero counters in this short view for easier
reading (complete version appended at the end of this email). This
was after around 140 GB were transferred :

# ethtool -S eth1|grep -vw 0
NIC statistics:
     rx_packets: 8001500
     tx_packets: 10015409
     rx_bytes: 480115998
     tx_bytes: 148825674976
     tx_boundary: 2048
     WC: 1
     irq: 45
     MSI: 1
     read_dma_bw_MBs: 1200
     write_dma_bw_MBs: 1614
     read_write_dma_bw_MBs: 2101
     serial_number: 320061
     link_changes: 2
     link_up: 1
     tx_pkt_start: 10015409
     tx_pkt_done: 10015409
     tx_req: 93407411
     tx_done: 93407411
     rx_small_cnt: 8001500
     wake_queue: 187727
     stop_queue: 187727
     LRO aggregated: 146
     LRO flushed: 146
     LRO avg aggr: 1
     LRO no_desc: 80

Quite honnestly, this is typically the pattern what I'm used to
observe here. I'm now trying to bisect, hopefully we'll get
something exploitable.

Cheers,
Willy

----- full ethtool -S ----

NIC statistics:
     rx_packets: 8001500
     tx_packets: 10015409
     rx_bytes: 480115998
     tx_bytes: 148825674976
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_boundary: 2048
     WC: 1
     irq: 45
     MSI: 1
     MSIX: 0
     read_dma_bw_MBs: 1200
     write_dma_bw_MBs: 1614
     read_write_dma_bw_MBs: 2101
     serial_number: 320061
     watchdog_resets: 0
     link_changes: 2
     link_up: 1
     dropped_link_overflow: 0
     dropped_link_error_or_filtered: 0
     dropped_pause: 0
     dropped_bad_phy: 0
     dropped_bad_crc32: 0
     dropped_unicast_filtered: 0
     dropped_multicast_filtered: 0
     dropped_runt: 0
     dropped_overrun: 0
     dropped_no_small_buffer: 0
     dropped_no_big_buffer: 0
     ----------- slice ---------: 0
     tx_pkt_start: 10015409
     tx_pkt_done: 10015409
     tx_req: 93407411
     tx_done: 93407411
     rx_small_cnt: 8001500
     rx_big_cnt: 0
     wake_queue: 187727
     stop_queue: 187727
     tx_linearized: 0
     LRO aggregated: 146
     LRO flushed: 146
     LRO avg aggr: 1
     LRO no_desc: 80

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  0:50       ` Willy Tarreau
@ 2013-01-06  1:21         ` Eric Dumazet
  2013-01-06  1:30           ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  1:21 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:

> Yes, I've removed all zero counters in this short view for easier
> reading (complete version appended at the end of this email). This
> was after around 140 GB were transferred :

OK I only wanted to make sure skb were not linearized in xmit.

Could you try to disable CONFIG_COMPACTION ?

( This is the other thread mentioning this : "ppoll() stuck on POLLIN
while TCP peer is sending" )

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  1:21         ` Eric Dumazet
@ 2013-01-06  1:30           ` Willy Tarreau
  2013-01-06  1:40             ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06  1:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sat, Jan 05, 2013 at 05:21:16PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:
> 
> > Yes, I've removed all zero counters in this short view for easier
> > reading (complete version appended at the end of this email). This
> > was after around 140 GB were transferred :
> 
> OK I only wanted to make sure skb were not linearized in xmit.
> 
> Could you try to disable CONFIG_COMPACTION ?

It's already disabled.

> ( This is the other thread mentioning this : "ppoll() stuck on POLLIN
> while TCP peer is sending" )

Ah interesting because these were some of the mm patches that I had
tried to revert.

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  1:30           ` Willy Tarreau
@ 2013-01-06  1:40             ` Eric Dumazet
  2013-01-06  1:51               ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  1:40 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:

> Ah interesting because these were some of the mm patches that I had
> tried to revert.

Hmm, or we should fix __skb_splice_bits()

I'll send a patch.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  1:40             ` Eric Dumazet
@ 2013-01-06  1:51               ` Eric Dumazet
  2013-01-06  2:16                 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  1:51 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> 
> > Ah interesting because these were some of the mm patches that I had
> > tried to revert.
> 
> Hmm, or we should fix __skb_splice_bits()
> 
> I'll send a patch.
> 

Could you try the following ?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..c5246be 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1768,14 +1768,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 			      struct splice_pipe_desc *spd, struct sock *sk)
 {
 	int seg;
+	struct page *page = virt_to_page(skb->data);
+	unsigned int poff = skb->data - (unsigned char *)page_address(page);
 
 	/* map the linear part :
 	 * If skb->head_frag is set, this 'linear' part is backed by a
 	 * fragment, and if the head is not shared with any clones then
 	 * we can avoid a copy since we own the head portion of this page.
 	 */
-	if (__splice_segment(virt_to_page(skb->data),
-			     (unsigned long) skb->data & (PAGE_SIZE - 1),
+	if (__splice_segment(page, poff,
 			     skb_headlen(skb),
 			     offset, len, skb, spd,
 			     skb_head_is_locked(skb),

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  1:51               ` Eric Dumazet
@ 2013-01-06  2:16                 ` Eric Dumazet
  2013-01-06  2:18                   ` Willy Tarreau
  2013-01-06  2:52                   ` Willy Tarreau
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  2:16 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > 
> > > Ah interesting because these were some of the mm patches that I had
> > > tried to revert.
> > 
> > Hmm, or we should fix __skb_splice_bits()
> > 
> > I'll send a patch.
> > 
> 
> Could you try the following ?

Or more exactly...

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..01f222c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned int poff,
 		return false;
 	}
 
-	/* ignore any bits we already processed */
-	if (*off) {
-		__segment_seek(&page, &poff, &plen, *off);
-		*off = 0;
-	}
+	__segment_seek(&page, &poff, &plen, *off);
+	*off = 0;
 
 	do {
 		unsigned int flen = min(*len, plen);
@@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 			      struct splice_pipe_desc *spd, struct sock *sk)
 {
 	int seg;
+	struct page *page = virt_to_page(skb->data);
+	unsigned int poff = skb->data - (unsigned char *)page_address(page);
 
 	/* map the linear part :
 	 * If skb->head_frag is set, this 'linear' part is backed by a
 	 * fragment, and if the head is not shared with any clones then
 	 * we can avoid a copy since we own the head portion of this page.
 	 */
-	if (__splice_segment(virt_to_page(skb->data),
-			     (unsigned long) skb->data & (PAGE_SIZE - 1),
+	if (__splice_segment(page, poff,
 			     skb_headlen(skb),
 			     offset, len, skb, spd,
 			     skb_head_is_locked(skb),

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  2:16                 ` Eric Dumazet
@ 2013-01-06  2:18                   ` Willy Tarreau
  2013-01-06  2:22                     ` Eric Dumazet
  2013-01-06  2:52                   ` Willy Tarreau
  1 sibling, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06  2:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > 
> > > > Ah interesting because these were some of the mm patches that I had
> > > > tried to revert.
> > > 
> > > Hmm, or we should fix __skb_splice_bits()
> > > 
> > > I'll send a patch.
> > > 
> > 
> > Could you try the following ?
> 
> Or more exactly...

The first one did not change a iota unfortunately. I'm about to
spot the commit causing the loopback regression. It's a few patches
before the first one you pointed. It's almost finished and I test
your patch below immediately after.

Thanks,
Willy

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..01f222c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned int poff,
>  		return false;
>  	}
>  
> -	/* ignore any bits we already processed */
> -	if (*off) {
> -		__segment_seek(&page, &poff, &plen, *off);
> -		*off = 0;
> -	}
> +	__segment_seek(&page, &poff, &plen, *off);
> +	*off = 0;
>  
>  	do {
>  		unsigned int flen = min(*len, plen);
> @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
>  			      struct splice_pipe_desc *spd, struct sock *sk)
>  {
>  	int seg;
> +	struct page *page = virt_to_page(skb->data);
> +	unsigned int poff = skb->data - (unsigned char *)page_address(page);
>  
>  	/* map the linear part :
>  	 * If skb->head_frag is set, this 'linear' part is backed by a
>  	 * fragment, and if the head is not shared with any clones then
>  	 * we can avoid a copy since we own the head portion of this page.
>  	 */
> -	if (__splice_segment(virt_to_page(skb->data),
> -			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> +	if (__splice_segment(page, poff,
>  			     skb_headlen(skb),
>  			     offset, len, skb, spd,
>  			     skb_head_is_locked(skb),
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  2:18                   ` Willy Tarreau
@ 2013-01-06  2:22                     ` Eric Dumazet
  2013-01-06  2:32                       ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  2:22 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
> On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > > 
> > > > > Ah interesting because these were some of the mm patches that I had
> > > > > tried to revert.
> > > > 
> > > > Hmm, or we should fix __skb_splice_bits()
> > > > 
> > > > I'll send a patch.
> > > > 
> > > 
> > > Could you try the following ?
> > 
> > Or more exactly...
> 
> The first one did not change a iota unfortunately. I'm about to
> spot the commit causing the loopback regression. It's a few patches
> before the first one you pointed. It's almost finished and I test
> your patch below immediately after.

I bet you are going to find commit
69b08f62e17439ee3d436faf0b9a7ca6fffb78db
(net: use bigger pages in __netdev_alloc_frag )

Am I wrong ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  2:22                     ` Eric Dumazet
@ 2013-01-06  2:32                       ` Willy Tarreau
  2013-01-06  2:44                         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06  2:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sat, Jan 05, 2013 at 06:22:13PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
> > On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> > > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > > > 
> > > > > > Ah interesting because these were some of the mm patches that I had
> > > > > > tried to revert.
> > > > > 
> > > > > Hmm, or we should fix __skb_splice_bits()
> > > > > 
> > > > > I'll send a patch.
> > > > > 
> > > > 
> > > > Could you try the following ?
> > > 
> > > Or more exactly...
> > 
> > The first one did not change a iota unfortunately. I'm about to
> > spot the commit causing the loopback regression. It's a few patches
> > before the first one you pointed. It's almost finished and I test
> > your patch below immediately after.
> 
> I bet you are going to find commit
> 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
> (net: use bigger pages in __netdev_alloc_frag )
> 
> Am I wrong ?

Yes this time you guessed wrong :-) Well maybe it's participating
to the issue.

It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
The trick is that once the MTU has been set to this large a value,
even when I set it back to 16kB the problem persists.

Now I'm retrying your other patch to see if it brings the 10GE back
to full speed.

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  2:32                       ` Willy Tarreau
@ 2013-01-06  2:44                         ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  2:44 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 03:32 +0100, Willy Tarreau wrote:

> It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
> reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
> The trick is that once the MTU has been set to this large a value,
> even when I set it back to 16kB the problem persists.
> 

Well, this MTU change can uncover a prior bug, or make it happen faster,
for sure.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  2:16                 ` Eric Dumazet
  2013-01-06  2:18                   ` Willy Tarreau
@ 2013-01-06  2:52                   ` Willy Tarreau
  2013-01-06  7:31                     ` [PATCH net-next] net: splice: avoid high order page splitting Eric Dumazet
  2013-01-06  7:35                     ` Major network performance regression in 3.7 Eric Dumazet
  1 sibling, 2 replies; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06  2:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > 
> > > > Ah interesting because these were some of the mm patches that I had
> > > > tried to revert.
> > > 
> > > Hmm, or we should fix __skb_splice_bits()
> > > 
> > > I'll send a patch.
> > > 
> > 
> > Could you try the following ?
> 
> Or more exactly...
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..01f222c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned int poff,
>  		return false;
>  	}
>  
> -	/* ignore any bits we already processed */
> -	if (*off) {
> -		__segment_seek(&page, &poff, &plen, *off);
> -		*off = 0;
> -	}
> +	__segment_seek(&page, &poff, &plen, *off);
> +	*off = 0;
>  
>  	do {
>  		unsigned int flen = min(*len, plen);
> @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
>  			      struct splice_pipe_desc *spd, struct sock *sk)
>  {
>  	int seg;
> +	struct page *page = virt_to_page(skb->data);
> +	unsigned int poff = skb->data - (unsigned char *)page_address(page);
>  
>  	/* map the linear part :
>  	 * If skb->head_frag is set, this 'linear' part is backed by a
>  	 * fragment, and if the head is not shared with any clones then
>  	 * we can avoid a copy since we own the head portion of this page.
>  	 */
> -	if (__splice_segment(virt_to_page(skb->data),
> -			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> +	if (__splice_segment(page, poff,
>  			     skb_headlen(skb),
>  			     offset, len, skb, spd,
>  			     skb_head_is_locked(skb),
> 

OK so I observed no change with this patch, either on the loopback
data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
experimentation anyway.

Concerning the loopback MTU, I find it strange that the MTU changes
the splice() behaviour and not send/recv. I thought that there could
be a relation between the MTU and the pipe size, but it does not
appear to be the case either, as I tried various sizes between 16kB
and 256kB without achieving original performance.

I've started to bisect the 10GE issue again (since both issues are
unrelated), but I'll finish tomorrow, it's time to get some sleep
now.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH net-next] net: splice: avoid high order page splitting
  2013-01-06  2:52                   ` Willy Tarreau
@ 2013-01-06  7:31                     ` Eric Dumazet
  2013-01-07  5:07                       ` David Miller
  2013-01-06  7:35                     ` Major network performance regression in 3.7 Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  7:31 UTC (permalink / raw)
  To: Willy Tarreau, David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>
 
splice() can handle pages of any order, but network code tries hard to
split them in PAGE_SIZE units. Not quite successfully anyway, as
__splice_segment() assumed poff < PAGE_SIZE. This is true for
the skb->data part, not necessarily for the fragments.

This patch removes this logic to give the pages as they are in the skb.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
---
 net/core/skbuff.c |   38 +++++++++-----------------------------
 1 file changed, 9 insertions(+), 29 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bc96100..b03fc0c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1707,20 +1707,6 @@ static bool spd_fill_page(struct splice_pipe_desc *spd,
 	return false;
 }
 
-static inline void __segment_seek(struct page **page, unsigned int *poff,
-				  unsigned int *plen, unsigned int off)
-{
-	unsigned long n;
-
-	*poff += off;
-	n = *poff / PAGE_SIZE;
-	if (n)
-		*page = nth_page(*page, n);
-
-	*poff = *poff % PAGE_SIZE;
-	*plen -= off;
-}
-
 static bool __splice_segment(struct page *page, unsigned int poff,
 			     unsigned int plen, unsigned int *off,
 			     unsigned int *len, struct sk_buff *skb,
@@ -1728,6 +1714,8 @@ static bool __splice_segment(struct page *page, unsigned int poff,
 			     struct sock *sk,
 			     struct pipe_inode_info *pipe)
 {
+	unsigned int flen;
+
 	if (!*len)
 		return true;
 
@@ -1738,24 +1726,16 @@ static bool __splice_segment(struct page *page, unsigned int poff,
 	}
 
 	/* ignore any bits we already processed */
-	if (*off) {
-		__segment_seek(&page, &poff, &plen, *off);
-		*off = 0;
-	}
-
-	do {
-		unsigned int flen = min(*len, plen);
+	poff += *off;
+	plen -= *off;
+	*off = 0;
 
-		/* the linear region may spread across several pages  */
-		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
+	flen = min(*len, plen);
 
-		if (spd_fill_page(spd, pipe, page, &flen, poff, skb, linear, sk))
-			return true;
-
-		__segment_seek(&page, &poff, &plen, flen);
-		*len -= flen;
+	if (spd_fill_page(spd, pipe, page, &flen, poff, skb, linear, sk))
+		return true;
 
-	} while (*len && plen);
+	*len -= flen;
 
 	return false;
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  2:52                   ` Willy Tarreau
  2013-01-06  7:31                     ` [PATCH net-next] net: splice: avoid high order page splitting Eric Dumazet
@ 2013-01-06  7:35                     ` Eric Dumazet
  2013-01-06  9:24                       ` Willy Tarreau
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06  7:35 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote:

> OK so I observed no change with this patch, either on the loopback
> data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
> experimentation anyway.
> 

Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim
only.

> Concerning the loopback MTU, I find it strange that the MTU changes
> the splice() behaviour and not send/recv. I thought that there could
> be a relation between the MTU and the pipe size, but it does not
> appear to be the case either, as I tried various sizes between 16kB
> and 256kB without achieving original performance.


It probably is related to a too small receive window, given the MTU was
multiplied by 4, I guess we need to make some adjustments

You also could try :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..b68cdfb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 					break;
 			}
 			used = recv_actor(desc, skb, offset, len);
+			/* Clean up data we have read: This will do ACK frames. */
+			if (used > 0)
+				tcp_cleanup_rbuf(sk, used);
 			if (used < 0) {
 				if (!copied)
 					copied = used;

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  7:35                     ` Major network performance regression in 3.7 Eric Dumazet
@ 2013-01-06  9:24                       ` Willy Tarreau
  2013-01-06 10:25                         ` Willy Tarreau
  2013-01-06 14:59                         ` Eric Dumazet
  0 siblings, 2 replies; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06  9:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sat, Jan 05, 2013 at 11:35:24PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote:
> 
> > OK so I observed no change with this patch, either on the loopback
> > data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
> > experimentation anyway.
> > 
> 
> Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim
> only.

I have re-applied your last rewrite and noticed a small but nice
performance improvement on a single stream over the loopback :

                        1 session       10 sessions
  - without the patch :   55.8 Gbps       68.4 Gbps
  - with the patch    :   56.4 Gbps       70.4 Gbps

This was with the loopback reverted to 16kB MTU of course.

> > Concerning the loopback MTU, I find it strange that the MTU changes
> > the splice() behaviour and not send/recv. I thought that there could
> > be a relation between the MTU and the pipe size, but it does not
> > appear to be the case either, as I tried various sizes between 16kB
> > and 256kB without achieving original performance.
> 
> 
> It probably is related to a too small receive window, given the MTU was
> multiplied by 4, I guess we need to make some adjustments

In fact even if I set it to 32kB it breaks.

I have tried to progressively increase the loopback's MTU from the default
16436, by steps of 4096 :

            tcp_rmem = 256 kB           tcp_rmem = 256 kB
            pipe size = 64 kB           pipe size = 256 kB

    16436 : 55.8 Gbps                   65.2 Gbps
    20532 : 32..48 Gbps unstable        24..45 Gbps unstable
    24628 : 56.0 Gbps                   66.4 Gbps
    28724 : 58.6 Gbps                   67.8 Gbps
    32820 : 54.5 Gbps                   61.7 Gbps
    36916 : 56.8 Gbps                   65.5 Gbps
    41012 : 57.8..58.2 Gbps ~stable     67.5..68.8 Gbps ~stable
    45108 : 59.4 Gbps                   70.0 Gbps
    49204 : 61.2 Gbps                   71.1 Gbps
    53300 : 58.8 Gbps                   70.6 Gbps
    57396 : 60.2 Gbps                   70.8 Gbps
    61492 : 61.4 Gbps                   71.1 Gbps

            tcp_rmem = 1 MB             tcp_rmem = 1 MB
            pipe size = 64 kB           pipe size = 256 kB

    16436 : 16..34 Gbps unstable        49.5 or 65.2 Gbps (unstable)
    20532 :  7..15 Gbps unstable        15..32 Gbps unstable
    24628 : 36..48 Gbps unstable        34..61 Gbps unstable
    28724 : 40..51 Gbps unstable        40..69 Gbps unstable
    32820 : 40..55 Gbps unstable        59.9..62.3 Gbps ~stable
    36916 : 38..51 Gbps unstable        66.0 Gbps
    41012 : 30..42 Gbps unstable        47..66 Gbps unstable
    45108 : 59.5 Gbps                   71.2 Gbps
    49204 : 61.3 Gbps                   74.0 Gbps
    53300 : 63.1 Gbps                   74.5 Gbps
    57396 : 64.6 Gbps                   74.7 Gbps
    61492 : 61..66 Gbps unstable        76.5 Gbps

So as long as we maintain the MTU to n*4096 + 52, performance is still
almost OK. It is interesting to see that the transfer rate is unstable
at many values and that it depends both on the rmem and pipe size, just
as if some segments sometimes remained stuck for too long.

And if I pick a value which does not match n*4096+52, such as
61492+2048 = 63540, then the transfer falls to about 50-100 Mbps again.

So there's clearly something related to the copy of segments from
incomplete pages instead of passing them via the pipe.

It is possible that this bug has been there for a long time and that
we never detected it because nobody plays with the loopback MTU.

I have tried with 2.6.35 :

    16436 : 31..33 Gbps
    61492 : 48..50 Gbps
    63540 : 50..53 Gbps  => so at least it's not affected

Even forcing the MTU to 16384 maintains 30..33 Gbps almost stable.

On 3.5.7.2 :

    16436 : 23..27 Gbps
    61492 : 61..64 Gbps
    63540 : 40..100 Mbps  => the problem was already there.

Since there were many splice changes in 3.5, I'd suspect that the issue
appeared there though I could be wrong.

> You also could try :
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ca2536..b68cdfb 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  					break;
>  			}
>  			used = recv_actor(desc, skb, offset, len);
> +			/* Clean up data we have read: This will do ACK frames. */
> +			if (used > 0)
> +				tcp_cleanup_rbuf(sk, used);
>  			if (used < 0) {
>  				if (!copied)
>  					copied = used;

It does not change anything to the tests above unfortunately. It did not
even stabilize the unstable runs.

I'll check if I can spot the original commit which caused the regression
for MTUs that are not n*4096+52.

But before that I'll try to find the recent one causing the myri10ge to
slow down, it should take less time to bisect.

Regards,
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  9:24                       ` Willy Tarreau
@ 2013-01-06 10:25                         ` Willy Tarreau
  2013-01-06 11:46                           ` Romain Francoise
  2013-01-06 12:01                           ` Willy Tarreau
  2013-01-06 14:59                         ` Eric Dumazet
  1 sibling, 2 replies; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 10:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sun, Jan 06, 2013 at 10:24:35AM +0100, Willy Tarreau wrote:
> But before that I'll try to find the recent one causing the myri10ge to
> slow down, it should take less time to bisect.

OK good news here, the performance drop on the myri was caused by a
problem between the keyboard and the chair. After the reboot series,
I forgot to reload the firmware so the driver used the less efficient
firmware from the NIC (it performs just as if LRO is disabled).

That makes me think that I should try 3.8-rc2 since LRO was removed
there :-/

The only remaining issue really is the loopback then.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 10:25                         ` Willy Tarreau
@ 2013-01-06 11:46                           ` Romain Francoise
  2013-01-06 11:53                             ` Willy Tarreau
  2013-01-06 12:01                           ` Willy Tarreau
  1 sibling, 1 reply; 41+ messages in thread
From: Romain Francoise @ 2013-01-06 11:46 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, netdev, linux-kernel

Willy Tarreau <w@1wt.eu> writes:

> That makes me think that I should try 3.8-rc2 since LRO was removed
> there :-/

Better yet, find a way to automate these tests so they can run continually
against net-next and find problems early...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 11:46                           ` Romain Francoise
@ 2013-01-06 11:53                             ` Willy Tarreau
  0 siblings, 0 replies; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 11:53 UTC (permalink / raw)
  To: Romain Francoise; +Cc: Eric Dumazet, netdev, linux-kernel

On Sun, Jan 06, 2013 at 12:46:58PM +0100, Romain Francoise wrote:
> Willy Tarreau <w@1wt.eu> writes:
> 
> > That makes me think that I should try 3.8-rc2 since LRO was removed
> > there :-/
> 
> Better yet, find a way to automate these tests so they can run continually
> against net-next and find problems early...

There is no way scripts will plug cables and turn on sleeping hardware
unfortunately. I'm already following network updates closely enough to
spot occasional regressions that are naturally expected due to the number
of changes.

Also, automated tests won't easily report a behaviour analysis, and
behaviour is important in networking. You don't want to accept 100ms
pauses all the time for example (and that's just an example).

Right now my lab is simplified enough so that I can test something like
100 patches in a week-end, I think that's already fine.

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 10:25                         ` Willy Tarreau
  2013-01-06 11:46                           ` Romain Francoise
@ 2013-01-06 12:01                           ` Willy Tarreau
  1 sibling, 0 replies; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 12:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, Andrew Gallatin

On Sun, Jan 06, 2013 at 11:25:25AM +0100, Willy Tarreau wrote:
> OK good news here, the performance drop on the myri was caused by a
> problem between the keyboard and the chair. After the reboot series,
> I forgot to reload the firmware so the driver used the less efficient
> firmware from the NIC (it performs just as if LRO is disabled).
> 
> That makes me think that I should try 3.8-rc2 since LRO was removed
> there :-/

Just for the record, I tested 3.8-rc2, and the myri works as fast with
GRO there as it used to work with LRO in previous kernels. The softirq
work has increased from 26 to 48% but there is no performance drop when
using GRO anymore. Andrew has done a good job !

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06  9:24                       ` Willy Tarreau
  2013-01-06 10:25                         ` Willy Tarreau
@ 2013-01-06 14:59                         ` Eric Dumazet
  2013-01-06 15:51                           ` Willy Tarreau
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 14:59 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote:

> It does not change anything to the tests above unfortunately. It did not
> even stabilize the unstable runs.
> 
> I'll check if I can spot the original commit which caused the regression
> for MTUs that are not n*4096+52.

Since you don't post your program, I wont be able to help, just by
guessing what it does...

TCP has very low defaults concerning initial window, and it appears you
set RCVBUF to even smaller values.

Here we can see "win 8030", this is not a sane value...

18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>

So you apparently changed /proc/sys/net/ipv4/tcp_rmem or SO_RCVBUF ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 14:59                         ` Eric Dumazet
@ 2013-01-06 15:51                           ` Willy Tarreau
  2013-01-06 16:39                             ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 15:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

Hi Eric,

On Sun, Jan 06, 2013 at 06:59:02AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote:
> 
> > It does not change anything to the tests above unfortunately. It did not
> > even stabilize the unstable runs.
> > 
> > I'll check if I can spot the original commit which caused the regression
> > for MTUs that are not n*4096+52.
> 
> Since you don't post your program, I wont be able to help, just by
> guessing what it does...

Oh sorry, I didn't really want to pollute the list with links and configs,
especially during the initial report with various combined issues :-(

The client is my old "inject" tool, available here :

     http://git.1wt.eu/web?p=inject.git

The server is my "httpterm" tool, available here :

     http://git.1wt.eu/web?p=httpterm.git
     Use "-O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE" for CFLAGS.

I'm starting httpterm this way :
    httpterm -D -L :8000 -P 256
    => it starts a server on port 8000, and sets pipe size to 256 kB. It
       uses SPLICE_F_MORE on output data but removing it did not fix the
       issue one of the early tests.

Then I'm starting inject this way :
    inject -o 1 -u 1 -G 0:8000/?s=1g
    => 1 user, 1 object at a time, and fetch /?s=1g from the loopback.
       The server will then emit 1 GB of data using splice().

It's possible to disable splicing on the server using -dS. The client
"eats" data using recv(MSG_TRUNC) to avoid a useless copy.

> TCP has very low defaults concerning initial window, and it appears you
> set RCVBUF to even smaller values.

Yes, you're right, my bootup scripts still change the default value, though
I increase them to larger values during the tests (except the one where you
saw win 8030 due to the default rmem set to 16060). I've been using this
value in the past with older kernels because it allowed an integer number
of segments to fit into the default window, and offered optimal performance
with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf
works very well and this is not needed anymore.

Anyway, it does not affect the test here. Good kernels are OK whatever the
default value, and bad kernels are bad whatever the default value too.

Hmmm finally it's this commit again :

   2f53384 tcp: allow splice() to build full TSO packets

I'm saying "again" because we already diagnosed a similar effect several
months ago that was revealed by this patch and we fixed it with the
following  one, though I remember that we weren't completely sure it
would fix everything :

   bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions

Just out of curiosity, I tried to re-apply the patch above just after the
first one but it did not change anything (after all it changed a symptom
which appeared in different conditions).

Interestingly, this commit (2f53384) significantly improved performance
on spliced data over the loopback (more than 50% in this test). In 3.7,
it seems to have no positive effect anymore. I reverted it using the
following patch and now the problem is fixed (mtu=64k works fine now) :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e457c7a..61e4517 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -935,7 +935,7 @@ wait_for_memory:
 	}

 out:
-	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
+	if (copied)
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	return copied;

Regards,
Willy

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 15:51                           ` Willy Tarreau
@ 2013-01-06 16:39                             ` Eric Dumazet
  2013-01-06 16:44                               ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 16:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 16:51 +0100, Willy Tarreau wrote:
> Hi Eric,
> 

> Oh sorry, I didn't really want to pollute the list with links and configs,
> especially during the initial report with various combined issues :-(
> 
> The client is my old "inject" tool, available here :
> 
>      http://git.1wt.eu/web?p=inject.git
> 
> The server is my "httpterm" tool, available here :
> 
>      http://git.1wt.eu/web?p=httpterm.git
>      Use "-O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE" for CFLAGS.
> 
> I'm starting httpterm this way :
>     httpterm -D -L :8000 -P 256
>     => it starts a server on port 8000, and sets pipe size to 256 kB. It
>        uses SPLICE_F_MORE on output data but removing it did not fix the
>        issue one of the early tests.
> 
> Then I'm starting inject this way :
>     inject -o 1 -u 1 -G 0:8000/?s=1g
>     => 1 user, 1 object at a time, and fetch /?s=1g from the loopback.
>        The server will then emit 1 GB of data using splice().
> 
> It's possible to disable splicing on the server using -dS. The client
> "eats" data using recv(MSG_TRUNC) to avoid a useless copy.
> 
> > TCP has very low defaults concerning initial window, and it appears you
> > set RCVBUF to even smaller values.
> 
> Yes, you're right, my bootup scripts still change the default value, though
> I increase them to larger values during the tests (except the one where you
> saw win 8030 due to the default rmem set to 16060). I've been using this
> value in the past with older kernels because it allowed an integer number
> of segments to fit into the default window, and offered optimal performance
> with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf
> works very well and this is not needed anymore.
> 
> Anyway, it does not affect the test here. Good kernels are OK whatever the
> default value, and bad kernels are bad whatever the default value too.
> 
> Hmmm finally it's this commit again :
> 
>    2f53384 tcp: allow splice() to build full TSO packets
> 
> I'm saying "again" because we already diagnosed a similar effect several
> months ago that was revealed by this patch and we fixed it with the
> following  one, though I remember that we weren't completely sure it
> would fix everything :
> 
>    bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions
> 
> Just out of curiosity, I tried to re-apply the patch above just after the
> first one but it did not change anything (after all it changed a symptom
> which appeared in different conditions).
> 
> Interestingly, this commit (2f53384) significantly improved performance
> on spliced data over the loopback (more than 50% in this test). In 3.7,
> it seems to have no positive effect anymore. I reverted it using the
> following patch and now the problem is fixed (mtu=64k works fine now) :
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e457c7a..61e4517 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -935,7 +935,7 @@ wait_for_memory:
>  	}
>  
>  out:
> -	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> +	if (copied)
>  		tcp_push(sk, flags, mss_now, tp->nonagle);
>  	return copied;
> 
> Regards,
> Willy
> 

Hmm, I'll have to check if this really can be reverted without hurting
vmsplice() again.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 16:39                             ` Eric Dumazet
@ 2013-01-06 16:44                               ` Willy Tarreau
  2013-01-06 17:10                                 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 16:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
> Hmm, I'll have to check if this really can be reverted without hurting
> vmsplice() again.

Looking at the code I've been wondering whether we shouldn't transform
the condition to perform the push if we can't push more segments, but
I don't know what to rely on. It would be something like this :

       if (copied &&
          (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more))
                tcp_push(sk, flags, mss_now, tp->nonagle);

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 16:44                               ` Willy Tarreau
@ 2013-01-06 17:10                                 ` Eric Dumazet
  2013-01-06 17:35                                   ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 17:10 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote:
> On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
> > Hmm, I'll have to check if this really can be reverted without hurting
> > vmsplice() again.
> 
> Looking at the code I've been wondering whether we shouldn't transform
> the condition to perform the push if we can't push more segments, but
> I don't know what to rely on. It would be something like this :
> 
>        if (copied &&
>           (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more))
>                 tcp_push(sk, flags, mss_now, tp->nonagle);

Good point !

Maybe the following fix then ?


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..7ba0717 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -941,8 +941,10 @@ out:
 	return copied;
 
 do_error:
-	if (copied)
+	if (copied) {
+		flags &= ~MSG_SENDPAGE_NOTLAST;
 		goto out;
+	}
 out_err:
 	return sk_stream_error(sk, flags, err);
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 17:10                                 ` Eric Dumazet
@ 2013-01-06 17:35                                   ` Willy Tarreau
  2013-01-06 18:39                                     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 17:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sun, Jan 06, 2013 at 09:10:55AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote:
> > On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
> > > Hmm, I'll have to check if this really can be reverted without hurting
> > > vmsplice() again.
> > 
> > Looking at the code I've been wondering whether we shouldn't transform
> > the condition to perform the push if we can't push more segments, but
> > I don't know what to rely on. It would be something like this :
> > 
> >        if (copied &&
> >           (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more))
> >                 tcp_push(sk, flags, mss_now, tp->nonagle);
> 
> Good point !
> 
> Maybe the following fix then ?
> 
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ca2536..7ba0717 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -941,8 +941,10 @@ out:
>  	return copied;
>  
>  do_error:
> -	if (copied)
> +	if (copied) {
> +		flags &= ~MSG_SENDPAGE_NOTLAST;
>  		goto out;
> +	}
>  out_err:
>  	return sk_stream_error(sk, flags, err);
>  }

Unfortunately it does not work any better, which means to me
that we don't leave via this code path. I tried other tricks
which failed too. I need to understand this part better before
randomly fiddling with it.

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 17:35                                   ` Willy Tarreau
@ 2013-01-06 18:39                                     ` Eric Dumazet
  2013-01-06 18:43                                       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 18:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote:

> Unfortunately it does not work any better, which means to me
> that we don't leave via this code path. I tried other tricks
> which failed too. I need to understand this part better before
> randomly fiddling with it.
> 

OK, now I have your test program, I can work on a fix, dont worry ;)

The MSG_SENDPAGE_NOTLAST logic needs to be tweaked.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 18:39                                     ` Eric Dumazet
@ 2013-01-06 18:43                                       ` Eric Dumazet
  2013-01-06 18:51                                         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 18:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 10:39 -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote:
> 
> > Unfortunately it does not work any better, which means to me
> > that we don't leave via this code path. I tried other tricks
> > which failed too. I need to understand this part better before
> > randomly fiddling with it.
> > 
> 
> OK, now I have your test program, I can work on a fix, dont worry ;)
> 
> The MSG_SENDPAGE_NOTLAST logic needs to be tweaked.
> 


(sd->len is usually 4096, which is expected, but sd->total_len value is
huge in your case, so we always set the flag in fs/splice.c)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 18:43                                       ` Eric Dumazet
@ 2013-01-06 18:51                                         ` Eric Dumazet
  2013-01-06 19:00                                           ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 18:51 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel


> 
> (sd->len is usually 4096, which is expected, but sd->total_len value is
> huge in your case, so we always set the flag in fs/splice.c)

I am testing :

       if (sd->len < sd->total_len && pipe->nrbufs > 1)
                more |= MSG_SENDPAGE_NOTLAST;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 18:51                                         ` Eric Dumazet
@ 2013-01-06 19:00                                           ` Eric Dumazet
  2013-01-06 19:34                                             ` Willy Tarreau
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 19:00 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
> > 
> > (sd->len is usually 4096, which is expected, but sd->total_len value is
> > huge in your case, so we always set the flag in fs/splice.c)
> 
> I am testing :
> 
>        if (sd->len < sd->total_len && pipe->nrbufs > 1)
>                 more |= MSG_SENDPAGE_NOTLAST;
> 

Yes, this should fix the problem :

If there is no following buffer in the pipe, we should not set NOTLAST.

diff --git a/fs/splice.c b/fs/splice.c
index 8890604..6909d89 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 		return -EINVAL;
 
 	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
-	if (sd->len < sd->total_len)
+
+	if (sd->len < sd->total_len && pipe->nrbufs > 1)
 		more |= MSG_SENDPAGE_NOTLAST;
+
 	return file->f_op->sendpage(file, buf->page, buf->offset,
 				    sd->len, &pos, more);
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 19:00                                           ` Eric Dumazet
@ 2013-01-06 19:34                                             ` Willy Tarreau
  2013-01-06 19:39                                               ` Eric Dumazet
  2013-01-06 21:49                                               ` Major network performance regression in 3.7 John Stoffel
  0 siblings, 2 replies; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 19:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
> > > 
> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
> > > huge in your case, so we always set the flag in fs/splice.c)
> > 
> > I am testing :
> > 
> >        if (sd->len < sd->total_len && pipe->nrbufs > 1)
> >                 more |= MSG_SENDPAGE_NOTLAST;
> > 
> 
> Yes, this should fix the problem :
> 
> If there is no following buffer in the pipe, we should not set NOTLAST.
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 8890604..6909d89 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>  		return -EINVAL;
>  
>  	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
> -	if (sd->len < sd->total_len)
> +
> +	if (sd->len < sd->total_len && pipe->nrbufs > 1)
>  		more |= MSG_SENDPAGE_NOTLAST;
> +
>  	return file->f_op->sendpage(file, buf->page, buf->offset,
>  				    sd->len, &pos, more);
>  }

OK it works like a charm here now ! I can't break it anymore, so it
looks like you finally got it !

I noticed that the data rate was higher when the loopback's MTU
is exactly a multiple of 4096 (making the 64k choice optimal)
while I would have assumed that in order to efficiently splice
TCP segments, we'd need to have some space for IP/TCP headers
and n*4k for the payload.

I also got the transfer freezes again a few times when starting
tcpdump on the server, but this is not 100% reproducible I'm afraid.
So I'll bring this back when I manage to get some analysable pattern.

The spliced transfer through all the chain haproxy works fine again
at 10gig with your fix. The issue is closed for me. Feel free to add
my Tested-By if you want.

Thank you Eric :-)
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 19:34                                             ` Willy Tarreau
@ 2013-01-06 19:39                                               ` Eric Dumazet
  2013-01-06 19:53                                                 ` Willy Tarreau
  2013-01-06 21:49                                               ` Major network performance regression in 3.7 John Stoffel
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-06 19:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, linux-kernel

On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote:

> OK it works like a charm here now ! I can't break it anymore, so it
> looks like you finally got it !
> 
> I noticed that the data rate was higher when the loopback's MTU
> is exactly a multiple of 4096 (making the 64k choice optimal)
> while I would have assumed that in order to efficiently splice
> TCP segments, we'd need to have some space for IP/TCP headers
> and n*4k for the payload.
> 
> I also got the transfer freezes again a few times when starting
> tcpdump on the server, but this is not 100% reproducible I'm afraid.
> So I'll bring this back when I manage to get some analysable pattern.
> 
> The spliced transfer through all the chain haproxy works fine again
> at 10gig with your fix. The issue is closed for me. Feel free to add
> my Tested-By if you want.
> 

Good to know !

What is the max speed you get now ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 19:39                                               ` Eric Dumazet
@ 2013-01-06 19:53                                                 ` Willy Tarreau
  2013-01-07  4:21                                                   ` [PATCH] tcp: fix MSG_SENDPAGE_NOTLAST logic Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 19:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

On Sun, Jan 06, 2013 at 11:39:31AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote:
> 
> > OK it works like a charm here now ! I can't break it anymore, so it
> > looks like you finally got it !
> > 
> > I noticed that the data rate was higher when the loopback's MTU
> > is exactly a multiple of 4096 (making the 64k choice optimal)
> > while I would have assumed that in order to efficiently splice
> > TCP segments, we'd need to have some space for IP/TCP headers
> > and n*4k for the payload.
> > 
> > I also got the transfer freezes again a few times when starting
> > tcpdump on the server, but this is not 100% reproducible I'm afraid.
> > So I'll bring this back when I manage to get some analysable pattern.
> > 
> > The spliced transfer through all the chain haproxy works fine again
> > at 10gig with your fix. The issue is closed for me. Feel free to add
> > my Tested-By if you want.
> > 
> 
> Good to know !
> 
> What is the max speed you get now ?

Line rate with 1500 MTU and LRO enabled :

#   time   eth1(ikb  ipk     okb      opk)    eth2(ikb   ipk      okb    opk) 

1357060023 19933.3 41527.7 9355538.2 62167.7  9757888.1 808701.1 19400.3 40417.7
1357060024 26124.1 54425.5 9290064.9 48804.4  9778294.0 810210.0 18068.8 37643.3
1357060025 27015.2 56281.1 9296115.3 46868.8  9797125.9 811271.1 8790.1 18312.2 
1357060026 27556.0 57408.8 9291701.4 46805.5  9805371.6 811410.0 3494.8 7280.0 
1357060027 27577.0 57452.2 9293606.8 46804.4  9806122.3 811314.4 2558.7 5330.0 
1357060028 27476.1 57242.2 9296885.4 46830.0  9794537.3 810527.7 2516.1 5242.2 
                           ^^^^^^^            ^^^^^^^
                           kbps out           kbps in
eth1=facing the client
eth2=facing the server

Top reports the following usage :

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 31.7%id,  0.0%wa,  0.0%hi, 68.3%si,  0.0%st
Cpu1  :  1.0%us, 37.3%sy,  0.0%ni, 61.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

(IRQ bound to cpu 0, haproxy to cpu 1)

This is a core2duo 2.66 GHz and the myris are 1st generation.

BTW I was very happy to see that the LRO->GRO conversion patches in 3.8-rc2
don't affect byte rate anymore (just a minor CPU usage increase but this is
not critical here), now I won't complain about it being slower anymore, you
won :-)


With the GRO patches backported, still at 1500 MTU but with GRO now :

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 28.7%id,  0.0%wa,  0.0%hi, 71.3%si,  0.0%st
Cpu1  :  0.0%us, 37.6%sy,  0.0%ni, 62.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

#   time   eth1(ikb  ipk     okb      opk)    eth2(ikb   ipk      okb    opk) 
1357058637 18319.3 38165.5 9401736.3 65159.9  9761613.4 808963.3 19403.6 40424.4
1357058638 20009.8 41687.7 9400903.7 62706.6  9770555.8 809522.2 18696.5 38951.1
1357058639 25439.5 52999.9 9301635.3 50267.7  9773666.7 809721.1 19174.1 39946.6
1357058640 26808.2 55850.0 9298301.4 46876.6  9790470.1 810843.3 12408.7 25851.1
1357058641 27110.9 56481.1 9297009.2 46832.2  9803308.4 811339.9 5692.5 11859.9
1357058642 27411.1 57106.6 9291419.2 46796.6  9806846.5 811378.8 2804.4 5842.2

This kernel is getting really good :-)

Cheers,
Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 19:34                                             ` Willy Tarreau
  2013-01-06 19:39                                               ` Eric Dumazet
@ 2013-01-06 21:49                                               ` John Stoffel
  2013-01-06 21:52                                                 ` Willy Tarreau
  1 sibling, 1 reply; 41+ messages in thread
From: John Stoffel @ 2013-01-06 21:49 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, netdev, linux-kernel

>>>>> "Willy" == Willy Tarreau <w@1wt.eu> writes:

Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
>> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
>> > > 
>> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
>> > > huge in your case, so we always set the flag in fs/splice.c)
>> > 
>> > I am testing :
>> > 
>> >        if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >                 more |= MSG_SENDPAGE_NOTLAST;
>> > 
>> 
>> Yes, this should fix the problem :
>> 
>> If there is no following buffer in the pipe, we should not set NOTLAST.
>> 
>> diff --git a/fs/splice.c b/fs/splice.c
>> index 8890604..6909d89 100644
>> --- a/fs/splice.c
>> +++ b/fs/splice.c
>> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>> return -EINVAL;
>> 
>> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
>> -	if (sd->len < sd->total_len)
>> +
>> +	if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> more |= MSG_SENDPAGE_NOTLAST;
>> +
>> return file->f_op->sendpage(file, buf->page, buf->offset,
sd-> len, &pos, more);
>> }
 
Willy> OK it works like a charm here now ! I can't break it anymore, so it
Willy> looks like you finally got it !

It's still broken, there's no comments in the code to explain all this
magic to mere mortals!  *grin*

John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 21:49                                               ` Major network performance regression in 3.7 John Stoffel
@ 2013-01-06 21:52                                                 ` Willy Tarreau
  2013-01-06 21:55                                                   ` John Stoffel
  0 siblings, 1 reply; 41+ messages in thread
From: Willy Tarreau @ 2013-01-06 21:52 UTC (permalink / raw)
  To: John Stoffel; +Cc: Eric Dumazet, netdev, linux-kernel

On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
> >>>>> "Willy" == Willy Tarreau <w@1wt.eu> writes:
> 
> Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
> >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
> >> > > 
> >> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
> >> > > huge in your case, so we always set the flag in fs/splice.c)
> >> > 
> >> > I am testing :
> >> > 
> >> >        if (sd->len < sd->total_len && pipe->nrbufs > 1)
> >> >                 more |= MSG_SENDPAGE_NOTLAST;
> >> > 
> >> 
> >> Yes, this should fix the problem :
> >> 
> >> If there is no following buffer in the pipe, we should not set NOTLAST.
> >> 
> >> diff --git a/fs/splice.c b/fs/splice.c
> >> index 8890604..6909d89 100644
> >> --- a/fs/splice.c
> >> +++ b/fs/splice.c
> >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
> >> return -EINVAL;
> >> 
> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
> >> -	if (sd->len < sd->total_len)
> >> +
> >> +	if (sd->len < sd->total_len && pipe->nrbufs > 1)
> >> more |= MSG_SENDPAGE_NOTLAST;
> >> +
> >> return file->f_op->sendpage(file, buf->page, buf->offset,
> sd-> len, &pos, more);
> >> }
>  
> Willy> OK it works like a charm here now ! I can't break it anymore, so it
> Willy> looks like you finally got it !
> 
> It's still broken, there's no comments in the code to explain all this
> magic to mere mortals!  *grin*

I would generally agree, but when Eric fixes such a thing, he
generally goes with lengthy details in the commit message.

Willy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Major network performance regression in 3.7
  2013-01-06 21:52                                                 ` Willy Tarreau
@ 2013-01-06 21:55                                                   ` John Stoffel
  0 siblings, 0 replies; 41+ messages in thread
From: John Stoffel @ 2013-01-06 21:55 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: John Stoffel, Eric Dumazet, netdev, linux-kernel

>>>>> "Willy" == Willy Tarreau <w@1wt.eu> writes:

Willy> On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
>> >>>>> "Willy" == Willy Tarreau <w@1wt.eu> writes:
>> 
Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
>> >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
>> >> > > 
>> >> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
>> >> > > huge in your case, so we always set the flag in fs/splice.c)
>> >> > 
>> >> > I am testing :
>> >> > 
>> >> >        if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >> >                 more |= MSG_SENDPAGE_NOTLAST;
>> >> > 
>> >> 
>> >> Yes, this should fix the problem :
>> >> 
>> >> If there is no following buffer in the pipe, we should not set NOTLAST.
>> >> 
>> >> diff --git a/fs/splice.c b/fs/splice.c
>> >> index 8890604..6909d89 100644
>> >> --- a/fs/splice.c
>> >> +++ b/fs/splice.c
>> >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>> >> return -EINVAL;
>> >> 
>> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
>> >> -	if (sd->len < sd->total_len)
>> >> +
>> >> +	if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >> more |= MSG_SENDPAGE_NOTLAST;
>> >> +
>> >> return file->f_op->sendpage(file, buf->page, buf->offset,
sd-> len, &pos, more);
>> >> }
>> 
Willy> OK it works like a charm here now ! I can't break it anymore, so it
Willy> looks like you finally got it !
>> 
>> It's still broken, there's no comments in the code to explain all this
>> magic to mere mortals!  *grin*

Willy> I would generally agree, but when Eric fixes such a thing, he
Willy> generally goes with lengthy details in the commit message.

I'm sure he will too, I just wanted to nudge him because while I sorta
followed this discussion, I see lots of pain down the road if the code
wasn't updated with some nice big fat comments.

Great job finding this code and testing, testing, testing.

John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] tcp: fix MSG_SENDPAGE_NOTLAST logic
  2013-01-06 19:53                                                 ` Willy Tarreau
@ 2013-01-07  4:21                                                   ` Eric Dumazet
  2013-01-07  4:59                                                     ` David Miller
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2013-01-07  4:21 UTC (permalink / raw)
  To: Willy Tarreau, David Miller; +Cc: netdev, linux-kernel

From: Eric Dumazet <edumazet@google.com>

commit 35f9c09fe9c72e (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag : MSG_SENDPAGE_NOTLAST meant to be set on all
frags but the last one for a splice() call.

The condition used to set the flag in pipe_to_sendpage() relied on
splice() user passing the exact number of bytes present in the pipe,
or a smaller one.

But some programs pass an arbitrary high value, and the test fails.

The effect of this bug is a lack of tcp_push() at the end of a
splice(pipe -> socket) call, and possibly very slow or erratic TCP
sessions.

We should both test sd->total_len and fact that another fragment
is in the pipe (pipe->nrbufs > 1)

Many thanks to Willy for providing very clear bug report, bisection
and test programs.

Reported-by: Willy Tarreau <w@1wt.eu>
Bisected-by: Willy Tarreau <w@1wt.eu>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 fs/splice.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index 8890604..6909d89 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 		return -EINVAL;

 	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
-	if (sd->len < sd->total_len)
+
+	if (sd->len < sd->total_len && pipe->nrbufs > 1)
 		more |= MSG_SENDPAGE_NOTLAST;
+
 	return file->f_op->sendpage(file, buf->page, buf->offset,
 				    sd->len, &pos, more);
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] tcp: fix MSG_SENDPAGE_NOTLAST logic
  2013-01-07  4:21                                                   ` [PATCH] tcp: fix MSG_SENDPAGE_NOTLAST logic Eric Dumazet
@ 2013-01-07  4:59                                                     ` David Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2013-01-07  4:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: w, netdev, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 06 Jan 2013 20:21:49 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> commit 35f9c09fe9c72e (tcp: tcp_sendpages() should call tcp_push() once)
> added an internal flag : MSG_SENDPAGE_NOTLAST meant to be set on all
> frags but the last one for a splice() call.
> 
> The condition used to set the flag in pipe_to_sendpage() relied on
> splice() user passing the exact number of bytes present in the pipe,
> or a smaller one.
> 
> But some programs pass an arbitrary high value, and the test fails.
> 
> The effect of this bug is a lack of tcp_push() at the end of a
> splice(pipe -> socket) call, and possibly very slow or erratic TCP
> sessions.
> 
> We should both test sd->total_len and fact that another fragment
> is in the pipe (pipe->nrbufs > 1)
> 
> Many thanks to Willy for providing very clear bug report, bisection
> and test programs.
> 
> Reported-by: Willy Tarreau <w@1wt.eu>
> Bisected-by: Willy Tarreau <w@1wt.eu>
> Tested-by: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied and queued up for -stable, thanks everyone.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next] net: splice: avoid high order page splitting
  2013-01-06  7:31                     ` [PATCH net-next] net: splice: avoid high order page splitting Eric Dumazet
@ 2013-01-07  5:07                       ` David Miller
  0 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2013-01-07  5:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: w, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 05 Jan 2013 23:31:18 -0800

> From: Eric Dumazet <edumazet@google.com>
>  
> splice() can handle pages of any order, but network code tries hard to
> split them in PAGE_SIZE units. Not quite successfully anyway, as
> __splice_segment() assumed poff < PAGE_SIZE. This is true for
> the skb->data part, not necessarily for the fragments.
> 
> This patch removes this logic to give the pages as they are in the skb.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2013-01-07  5:07 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-05 21:49 Major network performance regression in 3.7 Willy Tarreau
2013-01-05 23:18 ` Eric Dumazet
2013-01-05 23:29   ` Willy Tarreau
2013-01-06  0:02     ` Eric Dumazet
2013-01-06  0:50       ` Willy Tarreau
2013-01-06  1:21         ` Eric Dumazet
2013-01-06  1:30           ` Willy Tarreau
2013-01-06  1:40             ` Eric Dumazet
2013-01-06  1:51               ` Eric Dumazet
2013-01-06  2:16                 ` Eric Dumazet
2013-01-06  2:18                   ` Willy Tarreau
2013-01-06  2:22                     ` Eric Dumazet
2013-01-06  2:32                       ` Willy Tarreau
2013-01-06  2:44                         ` Eric Dumazet
2013-01-06  2:52                   ` Willy Tarreau
2013-01-06  7:31                     ` [PATCH net-next] net: splice: avoid high order page splitting Eric Dumazet
2013-01-07  5:07                       ` David Miller
2013-01-06  7:35                     ` Major network performance regression in 3.7 Eric Dumazet
2013-01-06  9:24                       ` Willy Tarreau
2013-01-06 10:25                         ` Willy Tarreau
2013-01-06 11:46                           ` Romain Francoise
2013-01-06 11:53                             ` Willy Tarreau
2013-01-06 12:01                           ` Willy Tarreau
2013-01-06 14:59                         ` Eric Dumazet
2013-01-06 15:51                           ` Willy Tarreau
2013-01-06 16:39                             ` Eric Dumazet
2013-01-06 16:44                               ` Willy Tarreau
2013-01-06 17:10                                 ` Eric Dumazet
2013-01-06 17:35                                   ` Willy Tarreau
2013-01-06 18:39                                     ` Eric Dumazet
2013-01-06 18:43                                       ` Eric Dumazet
2013-01-06 18:51                                         ` Eric Dumazet
2013-01-06 19:00                                           ` Eric Dumazet
2013-01-06 19:34                                             ` Willy Tarreau
2013-01-06 19:39                                               ` Eric Dumazet
2013-01-06 19:53                                                 ` Willy Tarreau
2013-01-07  4:21                                                   ` [PATCH] tcp: fix MSG_SENDPAGE_NOTLAST logic Eric Dumazet
2013-01-07  4:59                                                     ` David Miller
2013-01-06 21:49                                               ` Major network performance regression in 3.7 John Stoffel
2013-01-06 21:52                                                 ` Willy Tarreau
2013-01-06 21:55                                                   ` John Stoffel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).