bad TSO performance in 2.6.9-rc2-BK

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bad TSO performance in 2.6.9-rc2-BK
@ 2004-09-20  6:30 Anton Blanchard
  2004-09-20 15:54 ` Nivedita Singhvi
  2004-09-20 20:30 ` Andi Kleen
  0 siblings, 2 replies; 97+ messages in thread
From: Anton Blanchard @ 2004-09-20  6:30 UTC (permalink / raw)
  To: netdev


Hi,

I just tried latest 2.6.9-rc2-BK on a machine with an e1000 on it. With
TSO off it does about 100MB/sec. With TSO on it does between 1MB/sec and
10MB/sec.

Here are some tcpdumps of socklib (just a TCP bw test). The client makes the 
connection and the server streams bytes down the connection.

client:

14:59:21.745368 IP client.32818 > server.7001: S 2966429246:2966429246(0) win 5840 <mss 1460,sackOK,timestamp 102828576 0,nop,wscale 0>
14:59:21.745497 IP server.7001 > client.32818: S 3127678265:3127678265(0) ack 2966429247 win 5792 <mss 1460,sackOK,timestamp 1452516 102828576,nop,wscale 2>
14:59:21.745511 IP client.32818 > server.7001: . ack 1 win 5840 <nop,nop,timestamp 102828576 1452516>
14:59:21.746245 IP server.7001 > client.32818: . 1:1449(1448) ack 1 win 1448 <nop,nop,timestamp 1452516 102828576>
14:59:21.746253 IP server.7001 > client.32818: . 1449:2897(1448) ack 1 win 1448 <nop,nop,timestamp 1452516 102828576>
14:59:21.746273 IP client.32818 > server.7001: . ack 1449 win 8688 <nop,nop,timestamp 102828576 1452516>
14:59:21.746284 IP client.32818 > server.7001: . ack 2897 win 11584 <nop,nop,timestamp 102828576 1452516>
14:59:21.746492 IP server.7001 > client.32818: . 2897:4345(1448) ack 1 win 1448 <nop,nop,timestamp 1452516 102828576>
14:59:21.746500 IP server.7001 > client.32818: P 4345:5793(1448) ack 1 win 1448 <nop,nop,timestamp 1452516 102828576>
14:59:21.746515 IP client.32818 > server.7001: . ack 4345 win 14480 <nop,nop,timestamp 102828576 1452516>
14:59:21.746525 IP client.32818 > server.7001: . ack 5793 win 17376 <nop,nop,timestamp 102828576 1452516>
14:59:21.746742 IP server.7001 > client.32818: . 5793:7241(1448) ack 1 win 1448 <nop,nop,timestamp 1452517 102828576>
14:59:21.746749 IP server.7001 > client.32818: . 7241:8689(1448) ack 1 win 1448 <nop,nop,timestamp 1452517 1028

...

It finally settles down and we see 3 packets per ack:

4:59:24.343034 IP client.32818 > server.7001: . ack 13078337 win 34752 <nop,nop,timestamp 102828836 1455113>
14:59:24.343367 IP server.7001 > client.32818: . 13078337:13079785(1448) ack 1 win 1448 <nop,nop,timestamp 1455113 102828836>
14:59:24.343375 IP server.7001 > client.32818: . 13079785:13081233(1448) ack 1 win 1448 <nop,nop,timestamp 1455113 102828836>
14:59:24.343380 IP server.7001 > client.32818: . 13081233:13082681(1448) ack 1 win 1448 <nop,nop,timestamp 1455113 102828836>
14:59:24.343419 IP client.32818 > server.7001: . ack 13082681 win 34752 <nop,nop,timestamp 102828836 1455113>
14:59:24.343750 IP server.7001 > client.32818: . 13082681:13084129(1448) ack 1 win 1448 <nop,nop,timestamp 1455114 102828836>
14:59:24.343759 IP server.7001 > client.32818: . 13084129:13085577(1448) ack 1 win 1448 <nop,nop,timestamp 1455114 102828836>
14:59:24.343765 IP server.7001 > client.32818: P 13085577:13087025(1448) ack 1 win 1448 <nop,nop,timestamp 1455114 102828836>

server:

15:44:04.939695 IP client.32823 > server.7001: S 4245828116:4245828116(0) win 5840 <mss 1460,sackOK,timestamp 102953269 0,nop,wscale 0>
15:44:04.939703 IP server.7001 > client.32823: S 130434711:130434711(0) ack 4245828117 win 5792 <mss 1460,sackOK,timestamp 2699455 102953269,nop,wscale 2>
15:44:04.939899 IP client.32823 > server.7001: . ack 1 win 5840 <nop,nop,timestamp 102953269 2699455>
15:44:04.940439 IP bad-len 0
15:44:04.940649 IP client.32823 > server.7001: . ack 1449 win 8688 <nop,nop,timestamp 102953269 2699456>
15:44:04.940650 IP client.32823 > server.7001: . ack 2897 win 11584 <nop,nop,timestamp 102953269 2699456>
15:44:04.940675 IP bad-len 0

...

This is what it looks like after things settle down. Nasty how tcpdump doesnt
understand TSO bundles. Notice how we send out one TSO bundle and then wait
for the ack:

15:44:05.068048 IP client.32823 > server.7001: . ack 2213993 win 34752 <nop,nop,timestamp 102953282 2699583>
15:44:05.068059 IP bad-len 0
15:44:05.068298 IP client.32823 > server.7001: . ack 2218337 win 34752 <nop,nop,timestamp 102953282 2699584>
15:44:05.068310 IP bad-len 0
15:44:05.068549 IP client.32823 > server.7001: . ack 2222681 win 34752 <nop,nop,timestamp 102953282 2699584>
15:44:05.068565 IP bad-len 0

>From the first trace we see that each TSO bundle consists of 3 packets. The
application is doing 64kB sends so its surprising that we only pack 3 packets
into a TSO bundle. It looks like we only think there is a 5k window on
this connection when TSO is enabled.

Anton

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-20  6:30 bad TSO performance in 2.6.9-rc2-BK Anton Blanchard
@ 2004-09-20 15:54 ` Nivedita Singhvi
  2004-09-21 15:55   ` Anton Blanchard
  2004-09-20 20:30 ` Andi Kleen
  1 sibling, 1 reply; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-20 15:54 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: netdev

Anton Blanchard wrote:
> Hi,
> 
> I just tried latest 2.6.9-rc2-BK on a machine with an e1000 on it. With
> TSO off it does about 100MB/sec. With TSO on it does between 1MB/sec and
> 10MB/sec.

Hey Anton,  yep, I was just debugging this. TSO mainline reworked
implementation is still not cooked. Don't worry, final state won't
be this bad :).

Could you echo 0 > /proc/sys/net/ipv4/tcp_bic and redo test,
please?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-20 15:54 ` Nivedita Singhvi
@ 2004-09-21 15:55   ` Anton Blanchard
  0 siblings, 0 replies; 97+ messages in thread
From: Anton Blanchard @ 2004-09-21 15:55 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: netdev


Hi Niv,

> Hey Anton,  yep, I was just debugging this. TSO mainline reworked
> implementation is still not cooked. Don't worry, final state won't
> be this bad :).

Cool :)

> Could you echo 0 > /proc/sys/net/ipv4/tcp_bic and redo test,
> please?

It didnt seem to help.

Anton

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-20  6:30 bad TSO performance in 2.6.9-rc2-BK Anton Blanchard
  2004-09-20 15:54 ` Nivedita Singhvi
@ 2004-09-20 20:30 ` Andi Kleen
  2004-09-21 22:58   ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-20 20:30 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: netdev

On Mon, Sep 20, 2004 at 04:30:13PM +1000, Anton Blanchard wrote:
> 
> Hi,
> 
> I just tried latest 2.6.9-rc2-BK on a machine with an e1000 on it. With
> TSO off it does about 100MB/sec. With TSO on it does between 1MB/sec and
> 10MB/sec.

I see the same problem here, but it's even worse. I only get 150-200KB/s
sending data with scp from a fast machine with e1000 with a gigabit link. 
netperf also gives only 250KB/s. 

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-20 20:30 ` Andi Kleen
@ 2004-09-21 22:58   ` David S. Miller
  2004-09-22 14:00     ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-21 22:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: anton, netdev

On Mon, 20 Sep 2004 22:30:21 +0200
Andi Kleen <ak@suse.de> wrote:

> I see the same problem here, but it's even worse. I only get 150-200KB/s
> sending data with scp from a fast machine with e1000 with a gigabit link. 
> netperf also gives only 250KB/s. 

So I re-enabled TSO support in the loopback driver to try and
reproduce this, but I can't.

There has been a lot of churn in this area so please make sure you
are using the latest sources, all of my current TSO fixes are in
Linus's BK tree.

And, take the patch below and do a loopback bandwidth test before
and after the patch is applied.  Do things slow down when loopback
has TSO enabled just as it does for your gigabit interfaces?

(If you want to use the ethtool bits included here, you'll
 have to first recompile the ethtool utility with the
 non-sense "eth" and "usb" device name checks removed...)

===== drivers/net/loopback.c 1.17 vs edited =====
--- 1.17/drivers/net/loopback.c	2004-06-22 14:07:33 -07:00
+++ edited/drivers/net/loopback.c	2004-09-21 15:33:04 -07:00
@@ -49,6 +49,7 @@
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
+#include <linux/ethtool.h>
 #include <net/sock.h>
 #include <net/checksum.h>
 #include <linux/if_ether.h>	/* For the statistics structure. */
@@ -183,6 +184,17 @@
 	return stats;
 }
 
+u32 loopback_get_link(struct net_device *dev)
+{
+	return 1;
+}
+
+static struct ethtool_ops loopback_ethtool_ops = {
+	.get_link		= loopback_get_link,
+	.get_tso		= ethtool_op_get_tso,
+	.set_tso		= ethtool_op_set_tso,
+};
+
 struct net_device loopback_dev = {
 	.name	 		= "lo",
 	.mtu			= (16 * 1024) + 20 + 20 + 12,
@@ -198,7 +210,9 @@
 	.flags			= IFF_LOOPBACK,
 	.features 		= NETIF_F_SG|NETIF_F_FRAGLIST
 				  |NETIF_F_NO_CSUM|NETIF_F_HIGHDMA
+				  |NETIF_F_TSO
 				  |NETIF_F_LLTX,
+	.ethtool_ops		= &loopback_ethtool_ops,
 };
 
 /* Setup and register the of the LOOPBACK device. */

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-21 22:58   ` David S. Miller
@ 2004-09-22 14:00     ` Andi Kleen
  2004-09-22 18:12       ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-22 14:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, anton, netdev

On Tue, Sep 21, 2004 at 03:58:35PM -0700, David S. Miller wrote:
> On Mon, 20 Sep 2004 22:30:21 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > I see the same problem here, but it's even worse. I only get 150-200KB/s
> > sending data with scp from a fast machine with e1000 with a gigabit link. 
> > netperf also gives only 250KB/s. 
> 
> So I re-enabled TSO support in the loopback driver to try and
> reproduce this, but I can't.
> 
> There has been a lot of churn in this area so please make sure you
> are using the latest sources, all of my current TSO fixes are in
> Linus's BK tree.

I tried it with rc2bk8 again and the performance is much better (22MB/s) 
that what I got earlier with 2.6.8rc2, but still far below what 2.6.5 gets 
on the same hardware (68MB/s) 

Both tests with netperf over e1000.

> And, take the patch below and do a loopback bandwidth test before
> and after the patch is applied.  Do things slow down when loopback
> has TSO enabled just as it does for your gigabit interfaces?

Without TSO ~755MB/s, with TSO ~600-620MB/s (results seem to be a bit
variable). 

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 14:00     ` Andi Kleen
@ 2004-09-22 18:12       ` David S. Miller
  2004-09-22 19:55         ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-22 18:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, anton, netdev

On Wed, 22 Sep 2004 16:00:00 +0200
Andi Kleen <ak@suse.de> wrote:

> I tried it with rc2bk8 again and the performance is much better (22MB/s) 
> that what I got earlier with 2.6.8rc2, but still far below what 2.6.5 gets 
> on the same hardware (68MB/s) 
> 
> Both tests with netperf over e1000.

Great, please try one more thing to help me narrow this down.
Rerun your e1000 tests after going:

ethtool -K eth? tso off

and see if that gets you back to 2.6.5 era performance.

Thanks Andi.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 18:12       ` David S. Miller
@ 2004-09-22 19:55         ` Andi Kleen
  2004-09-22 20:07           ` Nivedita Singhvi
                             ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Andi Kleen @ 2004-09-22 19:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, anton, netdev

On Wed, Sep 22, 2004 at 11:12:09AM -0700, David S. Miller wrote:
> On Wed, 22 Sep 2004 16:00:00 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > I tried it with rc2bk8 again and the performance is much better (22MB/s) 
> > that what I got earlier with 2.6.8rc2, but still far below what 2.6.5 gets 
> > on the same hardware (68MB/s) 
> > 
> > Both tests with netperf over e1000.
> 
> Great, please try one more thing to help me narrow this down.
> Rerun your e1000 tests after going:
> 
> ethtool -K eth? tso off
> 
> and see if that gets you back to 2.6.5 era performance.

With tso off i get the same performance as on 2.6.5.

I must add that this is a CSA e1000 (directly integrated into
the chipset and doesn't use PCI) and TSO doesn't seem to be 
bring any advantage. On 2.6.5 the performance is the same
with both TSO on or off.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 19:55         ` Andi Kleen
@ 2004-09-22 20:07           ` Nivedita Singhvi
  2004-09-22 20:30             ` David S. Miller
  2004-09-22 20:12           ` Andrew Grover
  2004-09-22 20:28           ` David S. Miller
  2 siblings, 1 reply; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-22 20:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, anton, netdev

Andi Kleen wrote:

> With tso off i get the same performance as on 2.6.5.
> 
> I must add that this is a CSA e1000 (directly integrated into
> the chipset and doesn't use PCI) and TSO doesn't seem to be 
> bring any advantage. On 2.6.5 the performance is the same
> with both TSO on or off.

Andi, was that with a netperf TCP stream test? I
would not have thought there would be no difference
prior to the changes DaveM made recently (now we
obey congestion window). We certainly got quite a bit
of a difference running SPECWeb etc, but that was on
the e1000s.

Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 20:07           ` Nivedita Singhvi
@ 2004-09-22 20:30             ` David S. Miller
  2004-09-22 20:56               ` Nivedita Singhvi
  2004-09-22 21:56               ` Andi Kleen
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-22 20:30 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: ak, anton, netdev

On Wed, 22 Sep 2004 13:07:08 -0700
Nivedita Singhvi <niv@us.ibm.com> wrote:

> Andi, was that with a netperf TCP stream test? I
> would not have thought there would be no difference
> prior to the changes DaveM made recently (now we
> obey congestion window). We certainly got quite a bit
> of a difference running SPECWeb etc, but that was on
> the e1000s.

A netperf single stream TCP test and something like
SpecWEB are two different animals.

The former rarely goes faster with TSO enabled simply
because there is sufficient cpu and bus bandwidth to
keep the card full.

Whereas with something like SpecWEB the extra cpu and
bus cycles are needed by other resources of the benchmark
and thus performance goes up.

I have no idea why people think TSO will make some single
stream TCP test go faster, it doesn't buy you more bytes
on the wire :-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 20:30             ` David S. Miller
@ 2004-09-22 20:56               ` Nivedita Singhvi
  2004-09-22 21:56               ` Andi Kleen
  1 sibling, 0 replies; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-22 20:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, anton, netdev

David S. Miller wrote:

> The former rarely goes faster with TSO enabled simply
> because there is sufficient cpu and bus bandwidth to
> keep the card full.

At about 700MHz (ok, old, I know) or thereabouts, it
usually took > 1 CPU to drive a gigabit card, iirc,
about 1.5 CPUs or so.

> Whereas with something like SpecWEB the extra cpu and
> bus cycles are needed by other resources of the benchmark
> and thus performance goes up.

Yep..

> I have no idea why people think TSO will make some single
> stream TCP test go faster, it doesn't buy you more bytes
> on the wire :-)

True, if the card was already doing line speed,
no, as you say, it won't help making the stack go
faster :). If not, though, the gain in doing only
one pass down the stack, one route look up, etc in
place of multiple handoffs should help, correct?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 20:30             ` David S. Miller
  2004-09-22 20:56               ` Nivedita Singhvi
@ 2004-09-22 21:56               ` Andi Kleen
  2004-09-22 22:04                 ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-22 21:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: Nivedita Singhvi, ak, anton, netdev

> I have no idea why people think TSO will make some single
> stream TCP test go faster, it doesn't buy you more bytes
> on the wire :-)

It definitely helps on 10GB/s - without it the maximum
for single stream is slower. 

I suspect with a slow bus you can also see advantages
on 1Gb/s.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 21:56               ` Andi Kleen
@ 2004-09-22 22:04                 ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-22 22:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: niv, ak, anton, netdev

On Wed, 22 Sep 2004 23:56:57 +0200
Andi Kleen <ak@suse.de> wrote:

> I suspect with a slow bus you can also see advantages
> on 1Gb/s.

Yes, that's true.

I can keep a gigabit line full on my slow 350MHZ
sparc64 boxes except when the card is in a
33Mhz/32-bit PCI slot.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 19:55         ` Andi Kleen
  2004-09-22 20:07           ` Nivedita Singhvi
@ 2004-09-22 20:12           ` Andrew Grover
  2004-09-22 20:39             ` David S. Miller
  2004-09-22 20:28           ` David S. Miller
  2 siblings, 1 reply; 97+ messages in thread
From: Andrew Grover @ 2004-09-22 20:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, anton, netdev

On Wed, 22 Sep 2004 21:55:15 +0200, Andi Kleen <ak@suse.de> wrote:
> With tso off i get the same performance as on 2.6.5.
> 
> I must add that this is a CSA e1000 (directly integrated into
> the chipset and doesn't use PCI) and TSO doesn't seem to be
> bring any advantage. On 2.6.5 the performance is the same
> with both TSO on or off.

I think this is caused by the congestion changes added between
2.6.9-rc1 and rc2. I am seeing good performance with rc1 and tso on
(or off), but bad performance with rc2 and bk-latest only if tso is
on.

Regards -- Andy

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 20:12           ` Andrew Grover
@ 2004-09-22 20:39             ` David S. Miller
  2004-09-22 22:06               ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-22 20:39 UTC (permalink / raw)
  To: Andrew Grover; +Cc: ak, anton, netdev

On Wed, 22 Sep 2004 13:12:47 -0700
Andrew Grover <andy.grover@gmail.com> wrote:

> I think this is caused by the congestion changes added between
> 2.6.9-rc1 and rc2. I am seeing good performance with rc1 and tso on
> (or off), but bad performance with rc2 and bk-latest only if tso is
> on.

Yes, we know it is the packet counting change but those
are pretty much necessary else TSO violates congestion window
rules.

There is a slight bug somewhere limiting performance, but we'll
find it.

The thing to watch if you're debugging this is what tp->packets_out
is set to, and what happens in each tcp_snd_test() call.  Also, watch
tcp_cong_avoid() to make sure that tp->snd_cwnd is incrementing
on every ACK and that tp->snd_cwnd_clamp is not some silly small
value.

Tracking tcp_snd_test() should tell you everything, watch what
happens to the values of:

	tcp_packets_in_flight(tp);
	'pkts' aka. TCP_SKB_CB(skb)->tso_factor after the
	     possible tcp_set_skb_tso_factor() call
	tp->snd_cwnd

One thing that could be biting us is the nagle check which
probably needs to be adjusted to use the standard MSS not
the TSO one... perhaps play with this patch:

===== include/net/tcp.h 1.88 vs edited =====
--- 1.88/include/net/tcp.h	2004-09-14 13:57:07 -07:00
+++ edited/include/net/tcp.h	2004-09-22 13:18:43 -07:00
@@ -1505,7 +1505,7 @@
 	 * final FIN frame.  -DaveM
 	 */
 	return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
-		 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
+		 || !tcp_nagle_check(tp, skb, tp->mss_cache_std, nonagle)) &&
 		(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
 		 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
 		!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 20:39             ` David S. Miller
@ 2004-09-22 22:06               ` Andi Kleen
  2004-09-22 22:25                 ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-22 22:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andrew Grover, ak, anton, netdev

> One thing that could be biting us is the nagle check which
> probably needs to be adjusted to use the standard MSS not
> the TSO one... perhaps play with this patch:

I tried it and it breaks the performance completely
(21MB/s) 

-Andi

> 
> ===== include/net/tcp.h 1.88 vs edited =====
> --- 1.88/include/net/tcp.h	2004-09-14 13:57:07 -07:00
> +++ edited/include/net/tcp.h	2004-09-22 13:18:43 -07:00
> @@ -1505,7 +1505,7 @@
>  	 * final FIN frame.  -DaveM
>  	 */
>  	return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
> -		 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
> +		 || !tcp_nagle_check(tp, skb, tp->mss_cache_std, nonagle)) &&
>  		(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
>  		 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
>  		!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
> 
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 22:06               ` Andi Kleen
@ 2004-09-22 22:25                 ` David S. Miller
  2004-09-22 22:47                   ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-22 22:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: andy.grover, ak, anton, netdev

On Thu, 23 Sep 2004 00:06:28 +0200
Andi Kleen <ak@suse.de> wrote:

> > One thing that could be biting us is the nagle check which
> > probably needs to be adjusted to use the standard MSS not
> > the TSO one... perhaps play with this patch:
> 
> I tried it and it breaks the performance completely
> (21MB/s) 

You said you were getting 22MB/sec before the change, right?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 22:25                 ` David S. Miller
@ 2004-09-22 22:47                   ` Andi Kleen
  2004-09-22 22:50                     ` David S. Miller
  2004-09-23 23:11                     ` David S. Miller
  0 siblings, 2 replies; 97+ messages in thread
From: Andi Kleen @ 2004-09-22 22:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, andy.grover, anton, netdev

On Wed, Sep 22, 2004 at 03:25:35PM -0700, David S. Miller wrote:
> On Thu, 23 Sep 2004 00:06:28 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > > One thing that could be biting us is the nagle check which
> > > probably needs to be adjusted to use the standard MSS not
> > > the TSO one... perhaps play with this patch:
> > 
> > I tried it and it breaks the performance completely
> > (21MB/s) 
> 
> You said you were getting 22MB/sec before the change, right?

22-23MB/s, but not 21MB/s. Ok "breaks completely" was a bit of overstatement,
admitted.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 22:47                   ` Andi Kleen
@ 2004-09-22 22:50                     ` David S. Miller
  2004-09-23 23:11                     ` David S. Miller
  1 sibling, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-22 22:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, andy.grover, anton, netdev

On Thu, 23 Sep 2004 00:47:32 +0200
Andi Kleen <ak@suse.de> wrote:

> 22-23MB/s, but not 21MB/s. Ok "breaks completely" was a bit of overstatement,
> admitted.

Right.

Can you do some of the tcp_snd_test() debugging I suggested above
the patch you just tested?  I really need help debugging this
as I cannot reproduce it locally.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 22:47                   ` Andi Kleen
  2004-09-22 22:50                     ` David S. Miller
@ 2004-09-23 23:11                     ` David S. Miller
  2004-09-23 23:41                       ` Herbert Xu
                                         ` (2 more replies)
  1 sibling, 3 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-23 23:11 UTC (permalink / raw)
  To: Andi Kleen, niv; +Cc: ak, andy.grover, anton, netdev

I think I know what may be going on here.

Let's say that we even get the congestion window openned up
so that we can build 64K TSO frames, that's around 43 or 44
1500 mtu frames.

That means as the window fills up, we have to see 44 ACKs
before we are able to send the next TSO frame.  Needless to
say that breaks ACK clocking completely.

And given that, getting 22MB/sec with TSO enabled is
actually an impressive feat.

I can't think of a fix I'm completely happy with.  We could
limit TSO to something like 2 or 4 normal MSS frames, but
that negates much of the gain from TSO.  But something like
this is necessary to keep the pipe full.

Anyways, for testing, something like the patch below.  If things
still stink a bit, try using a limit of "2" in this patch instead
of "4".

===== net/ipv4/tcp_output.c 1.58 vs edited =====
--- 1.58/net/ipv4/tcp_output.c	2004-09-13 21:39:17 -07:00
+++ edited/net/ipv4/tcp_output.c	2004-09-23 15:51:51 -07:00
@@ -645,6 +645,12 @@
 		if (factor > tp->snd_cwnd)
 			factor = tp->snd_cwnd;

+		/* Also, do not let it grow more than 4 frames
+		 * so that ACK clocking continues to work.
+		 */
+		if (factor > 4)
+			factor = 4;
+
 		tp->mss_cache = mss_now * factor;
 	}

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-23 23:11                     ` David S. Miller
@ 2004-09-23 23:41                       ` Herbert Xu
  2004-09-23 23:41                         ` David S. Miller
  2004-09-24  8:30                       ` Andi Kleen
  2004-09-27 22:38                       ` John Heffner
  2 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-23 23:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

David S. Miller <davem@davemloft.net> wrote:
> 
> That means as the window fills up, we have to see 44 ACKs
> before we are able to send the next TSO frame.  Needless to
> say that breaks ACK clocking completely.

Hang on a second, the same problem would occur before the congestion
changes were made, right? I thought Anton was saying that with the
old kernels he was getting 100MB/s with TSO enabled...

Anton, can you please get tcpdump to somehow show the length of the
TSO packets so that we know what the factor is being set to?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-23 23:41                       ` Herbert Xu
@ 2004-09-23 23:41                         ` David S. Miller
  2004-09-24  0:12                           ` Herbert Xu
  2004-09-27  1:27                           ` Herbert Xu
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-23 23:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, andy.grover, anton, netdev

On Fri, 24 Sep 2004 09:41:04 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> David S. Miller <davem@davemloft.net> wrote:
> > 
> > That means as the window fills up, we have to see 44 ACKs
> > before we are able to send the next TSO frame.  Needless to
> > say that breaks ACK clocking completely.
> 
> Hang on a second, the same problem would occur before the congestion
> changes were made, right?

Previously we counted TSO frames as single "packets" as long as
we could fit one more frame into the congestion window we'd spit
out the whole TSO frame, and this kept the pipe full.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-23 23:41                         ` David S. Miller
@ 2004-09-24  0:12                           ` Herbert Xu
  2004-09-24  0:40                             ` Herbert Xu
  2004-09-27  1:27                           ` Herbert Xu
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-24  0:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

On Thu, Sep 23, 2004 at 04:41:49PM -0700, David S. Miller wrote:
> 
> Previously we counted TSO frames as single "packets" as long as
> we could fit one more frame into the congestion window we'd spit
> out the whole TSO frame, and this kept the pipe full.

I see.  In that case what we've got now can't possibly work.

How about this? We always to treat TSO frames as one packet.  We
continue to obey the congestion window by starting with a TSO MSS
that is small enough.  We increase the TSO MSS as the congestion
window goes up.

That should work, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-24  0:12                           ` Herbert Xu
@ 2004-09-24  0:40                             ` Herbert Xu
  2004-09-24  1:07                               ` Herbert Xu
  0 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-24  0:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

On Fri, Sep 24, 2004 at 10:12:25AM +1000, herbert wrote:
> 
> How about this? We always to treat TSO frames as one packet.  We
> continue to obey the congestion window by starting with a TSO MSS
> that is small enough.  We increase the TSO MSS as the congestion
> window goes up.
> 
> That should work, no?

Probably not.  There many things (such as snd_cwnd) in the stack
that doesn't work properly when the mss changes drastically like this.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-24  0:40                             ` Herbert Xu
@ 2004-09-24  1:07                               ` Herbert Xu
  2004-09-24  1:17                                 ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-24  1:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

On Fri, Sep 24, 2004 at 10:40:38AM +1000, herbert wrote:
> 
> Probably not.  There many things (such as snd_cwnd) in the stack
> that doesn't work properly when the mss changes drastically like this.

Perhaps we should start counting in bytes instead of packets?
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-24  1:07                               ` Herbert Xu
@ 2004-09-24  1:17                                 ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-24  1:17 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, andy.grover, anton, netdev

On Fri, 24 Sep 2004 11:07:41 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Sep 24, 2004 at 10:40:38AM +1000, herbert wrote:
> > 
> > Probably not.  There many things (such as snd_cwnd) in the stack
> > that doesn't work properly when the mss changes drastically like this.
> 
> Perhaps we should start counting in bytes instead of packets?

No, because routers drop packets not bytes :-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-23 23:41                         ` David S. Miller
  2004-09-24  0:12                           ` Herbert Xu
@ 2004-09-27  1:27                           ` Herbert Xu
  2004-09-27  2:50                             ` Herbert Xu
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-27  1:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, ak, niv, andy.grover, anton, netdev

David S. Miller <davem@davemloft.net> wrote:
> 
> Previously we counted TSO frames as single "packets" as long as
> we could fit one more frame into the congestion window we'd spit
> out the whole TSO frame, and this kept the pipe full.

Indeed, I think the new code means that Minshall's check will disable
Nagle which is what was keeping TSO working properly.

Anton, could you please try this patch which disables Minshall's check
and see what it does?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
===== include/net/tcp.h 1.88 vs edited =====
--- 1.88/include/net/tcp.h	2004-09-15 06:57:07 +10:00
+++ edited/include/net/tcp.h	2004-09-27 11:25:56 +10:00
@@ -1461,8 +1461,7 @@
 		!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
 		((nonagle&TCP_NAGLE_CORK) ||
 		 (!nonagle &&
-		  tcp_get_pcount(&tp->packets_out) &&
-		  tcp_minshall_check(tp))));
+		  tcp_get_pcount(&tp->packets_out))));
 }
 
 extern void tcp_set_skb_tso_factor(struct sk_buff *, unsigned int);

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27  1:27                           ` Herbert Xu
@ 2004-09-27  2:50                             ` Herbert Xu
  2004-09-27  4:00                               ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-27  2:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

On Mon, Sep 27, 2004 at 11:27:24AM +1000, Herbert Xu wrote:
> 
> Indeed, I think the new code means that Minshall's check will disable
> Nagle which is what was keeping TSO working properly.
> 
> Anton, could you please try this patch which disables Minshall's check
> and see what it does?

Never mind.  Minshall is innocent :)

We set the maximum TSO factor bounded by the congestion window.  But
when the congestion window is raised, we don't call tcp_sync_mss
which is the only place that can raise the TSO factor.

So the TSO factor never grows above what we start with.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27  2:50                             ` Herbert Xu
@ 2004-09-27  4:00                               ` David S. Miller
  2004-09-27  5:45                                 ` Herbert Xu
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-27  4:00 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, andy.grover, anton, netdev

On Mon, 27 Sep 2004 12:50:48 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Mon, Sep 27, 2004 at 11:27:24AM +1000, Herbert Xu wrote:
> > 
> > Indeed, I think the new code means that Minshall's check will disable
> > Nagle which is what was keeping TSO working properly.
> > 
> > Anton, could you please try this patch which disables Minshall's check
> > and see what it does?
> 
> Never mind.  Minshall is innocent :)
> 
> We set the maximum TSO factor bounded by the congestion window.  But
> when the congestion window is raised, we don't call tcp_sync_mss
> which is the only place that can raise the TSO factor.
> 
> So the TSO factor never grows above what we start with.

The very next time we check to see if we can
make forward progress on the send queue we'll call tcp_current_mss()
which causes the right things to happen.

Something else is wrong.  I think part of it is that we need to
make tcp_clean_rtx_queue() return FLAG_DATA_ACKED even when a
partial TSO packet is ACK'd by the other end.  This will make
RTO etc. calculations actually occur, among other things.  But
I have no idea if that will clear up the performance problems
TSO is having with the new code.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27  4:00                               ` David S. Miller
@ 2004-09-27  5:45                                 ` Herbert Xu
  2004-09-27 19:01                                   ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-27  5:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

On Sun, Sep 26, 2004 at 09:00:29PM -0700, David S. Miller wrote:
> 
> The very next time we check to see if we can
> make forward progress on the send queue we'll call tcp_current_mss()
> which causes the right things to happen.

tcp_current_mss() doesn't call tcp_sync_mss() unless the PMTU changes.

> Something else is wrong.  I think part of it is that we need to
> make tcp_clean_rtx_queue() return FLAG_DATA_ACKED even when a
> partial TSO packet is ACK'd by the other end.  This will make
> RTO etc. calculations actually occur, among other things.  But
> I have no idea if that will clear up the performance problems
> TSO is having with the new code.

There probably is something else wrong, since this should only limit
the TSO size.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27  5:45                                 ` Herbert Xu
@ 2004-09-27 19:01                                   ` David S. Miller
  2004-09-27 21:32                                     ` Herbert Xu
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-27 19:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, andy.grover, anton, netdev

On Mon, 27 Sep 2004 15:45:41 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Sun, Sep 26, 2004 at 09:00:29PM -0700, David S. Miller wrote:
> > 
> > The very next time we check to see if we can
> > make forward progress on the send queue we'll call tcp_current_mss()
> > which causes the right things to happen.
> 
> tcp_current_mss() doesn't call tcp_sync_mss() unless the PMTU changes.

Good catch, probably we should make it do so when sk_route_caps
indicates we are doing TSO.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 19:01                                   ` David S. Miller
@ 2004-09-27 21:32                                     ` Herbert Xu
  2004-09-28 21:10                                       ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-27 21:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, andy.grover, anton, netdev

On Mon, Sep 27, 2004 at 12:01:54PM -0700, David S. Miller wrote:
>
> > tcp_current_mss() doesn't call tcp_sync_mss() unless the PMTU changes.
> 
> Good catch, probably we should make it do so when sk_route_caps
> indicates we are doing TSO.

Alternatively we could move the TSO code out of tcp_sync_mss() and
put it in tcp_current_mss() instead.  It seems to be the only one
using the factor anyway.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 21:32                                     ` Herbert Xu
@ 2004-09-28 21:10                                       ` David S. Miller
  2004-09-28 21:34                                         ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-28 21:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, andy.grover, anton, netdev

[-- Attachment #1: Type: text/plain, Size: 1069 bytes --]

On Tue, 28 Sep 2004 07:32:33 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Mon, Sep 27, 2004 at 12:01:54PM -0700, David S. Miller wrote:
> >
> > > tcp_current_mss() doesn't call tcp_sync_mss() unless the PMTU changes.
> > 
> > Good catch, probably we should make it do so when sk_route_caps
> > indicates we are doing TSO.
> 
> Alternatively we could move the TSO code out of tcp_sync_mss() and
> put it in tcp_current_mss() instead.  It seems to be the only one
> using the factor anyway.

Ok, here are 2 patches incorporating all of the things
we discussed in this area:

1) Uninline tcp_current_mss(), fix tcp_sync_mss() return
   value to match tcp_current_mss()'s

2) Fix the do_large calculation bug in tcp_current_mss() as
   per Herbert's original patch.

3) Move TSO mss calculation work to tcp_current_mss().  We have
   to do something like this since tcp_sync_mss() is only invoked
   when the PMTU changes whereas the TSO MTU is dependant upon
   both the path and the current congestion window.

So, this patch should wrap up these issues.


[-- Attachment #2: diff1 --]
[-- Type: application/octet-stream, Size: 3591 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/28 13:26:54-07:00 davem@nuts.davemloft.net 
#   [TCP]: Uninline tcp_current_mss().
#   
#   Also fix the return value of tcp_sync_mss() to
#   be unsigned.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/28 13:26:01-07:00 davem@nuts.davemloft.net +31 -1
#   [TCP]: Uninline tcp_current_mss().
# 
# include/net/tcp.h
#   2004/09/28 13:26:00-07:00 davem@nuts.davemloft.net +2 -32
#   [TCP]: Uninline tcp_current_mss().
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-28 13:49:22 -07:00
+++ b/include/net/tcp.h	2004-09-28 13:49:22 -07:00
@@ -961,7 +961,8 @@
 
 extern void tcp_delete_keepalive_timer (struct sock *);
 extern void tcp_reset_keepalive_timer (struct sock *, unsigned long);
-extern int tcp_sync_mss(struct sock *sk, u32 pmtu);
+extern unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu);
+extern unsigned int tcp_current_mss(struct sock *sk, int large);
 
 extern const char timer_bug_msg[];
 
@@ -1033,37 +1034,6 @@
 	default:
 		printk(timer_bug_msg);
 	};
-}
-
-/* Compute the current effective MSS, taking SACKs and IP options,
- * and even PMTU discovery events into account.
- *
- * LARGESEND note: !urg_mode is overkill, only frames up to snd_up
- * cannot be large. However, taking into account rare use of URG, this
- * is not a big flaw.
- */
-
-static inline unsigned int tcp_current_mss(struct sock *sk, int large)
-{
-	struct tcp_opt *tp = tcp_sk(sk);
-	struct dst_entry *dst = __sk_dst_get(sk);
-	int do_large, mss_now;
-
-	do_large = (large &&
-		    (sk->sk_route_caps & NETIF_F_TSO) &&
-		    !tp->urg_mode);
-	mss_now = do_large ? tp->mss_cache : tp->mss_cache_std;
-
-	if (dst) {
-		u32 mtu = dst_pmtu(dst);
-		if (mtu != tp->pmtu_cookie ||
-		    tp->ext2_header_len != dst->header_len)
-			mss_now = tcp_sync_mss(sk, mtu);
-	}
-	if (tp->eff_sacks)
-		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
-			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));
-	return mss_now;
 }
 
 /* Initialize RCV_MSS value.
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-28 13:49:22 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-28 13:49:22 -07:00
@@ -603,7 +603,7 @@
    this function.			--ANK (980731)
  */
 
-int tcp_sync_mss(struct sock *sk, u32 pmtu)
+unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
@@ -661,6 +661,36 @@
 	return mss_now;
 }
 
+/* Compute the current effective MSS, taking SACKs and IP options,
+ * and even PMTU discovery events into account.
+ *
+ * LARGESEND note: !urg_mode is overkill, only frames up to snd_up
+ * cannot be large. However, taking into account rare use of URG, this
+ * is not a big flaw.
+ */
+
+unsigned int tcp_current_mss(struct sock *sk, int large)
+{
+	struct tcp_opt *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
+	int do_large, mss_now;
+
+	do_large = (large &&
+		    (sk->sk_route_caps & NETIF_F_TSO) &&
+		    !tp->urg_mode);
+	mss_now = do_large ? tp->mss_cache : tp->mss_cache_std;
+
+	if (dst) {
+		u32 mtu = dst_pmtu(dst);
+		if (mtu != tp->pmtu_cookie ||
+		    tp->ext2_header_len != dst->header_len)
+			mss_now = tcp_sync_mss(sk, mtu);
+	}
+	if (tp->eff_sacks)
+		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
+			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));
+	return mss_now;
+}
 
 /* This routine writes packets to the network.  It advances the
  * send_head.  This happens as incoming acks open up the remote

[-- Attachment #3: diff2 --]
[-- Type: application/octet-stream, Size: 2712 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/28 13:46:58-07:00 davem@nuts.davemloft.net 
#   [TCP]: Move TSO mss calcs to tcp_current_mss()
#   
#   Based upon a bug fix patch and suggestions from
#   Herbert Xu <herbert@gondor.apana.org.au>
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/28 13:46:28-07:00 davem@nuts.davemloft.net +29 -24
#   [TCP]: Move TSO mss calcs to tcp_current_mss()
#   
#   Based upon a bug fix patch and suggestions from
#   Herbert Xu <herbert@gondor.apana.org.au>
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-28 13:49:37 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-28 13:49:37 -07:00
@@ -639,25 +639,6 @@
 	tp->pmtu_cookie = pmtu;
 	tp->mss_cache = tp->mss_cache_std = mss_now;
 
-	if (sk->sk_route_caps & NETIF_F_TSO) {
-		int large_mss, factor;
-
-		large_mss = 65535 - tp->af_specific->net_header_len -
-			tp->ext_header_len - tp->ext2_header_len - tp->tcp_header_len;
-
-		if (tp->max_window && large_mss > (tp->max_window>>1))
-			large_mss = max((tp->max_window>>1), 68U - tp->tcp_header_len);
-
-		/* Always keep large mss multiple of real mss, but
-		 * do not exceed congestion window.
-		 */
-		factor = large_mss / mss_now;
-		if (factor > tp->snd_cwnd)
-			factor = tp->snd_cwnd;
-
-		tp->mss_cache = mss_now * factor;
-	}
-
 	return mss_now;
 }
 
@@ -675,17 +656,41 @@
 	struct dst_entry *dst = __sk_dst_get(sk);
 	int do_large, mss_now;
 
-	do_large = (large &&
-		    (sk->sk_route_caps & NETIF_F_TSO) &&
-		    !tp->urg_mode);
-	mss_now = do_large ? tp->mss_cache : tp->mss_cache_std;
-
+	mss_now = tp->mss_cache_std;
 	if (dst) {
 		u32 mtu = dst_pmtu(dst);
 		if (mtu != tp->pmtu_cookie ||
 		    tp->ext2_header_len != dst->header_len)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
+
+	do_large = (large &&
+		    (sk->sk_route_caps & NETIF_F_TSO) &&
+		    !tp->urg_mode);
+
+	if (do_large) {
+		int large_mss, factor;
+
+		large_mss = 65535 - tp->af_specific->net_header_len -
+			tp->ext_header_len - tp->ext2_header_len -
+			tp->tcp_header_len;
+
+		if (tp->max_window && large_mss > (tp->max_window>>1))
+			large_mss = max((tp->max_window>>1),
+					68U - tp->tcp_header_len);
+
+		/* Always keep large mss multiple of real mss, but
+		 * do not exceed congestion window.
+		 */
+		factor = large_mss / mss_now;
+		if (factor > tp->snd_cwnd)
+			factor = tp->snd_cwnd;
+
+		tp->mss_cache = mss_now * factor;
+
+		mss_now = tp->mss_cache;
+	}
+
 	if (tp->eff_sacks)
 		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
 			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 21:10                                       ` David S. Miller
@ 2004-09-28 21:34                                         ` Andi Kleen
  2004-09-28 21:53                                           ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-28 21:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Herbert Xu, ak, niv, andy.grover, anton, netdev


I admit I lost track of all your patches now - can you give me a big
diff against the latest BK so that I can check that the problem
is gone for me too? 

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 21:34                                         ` Andi Kleen
@ 2004-09-28 21:53                                           ` David S. Miller
  2004-09-28 22:33                                             ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-28 21:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: herbert, ak, niv, andy.grover, anton, netdev

[-- Attachment #1: Type: text/plain, Size: 338 bytes --]

On Tue, 28 Sep 2004 23:34:15 +0200
Andi Kleen <ak@suse.de> wrote:

> I admit I lost track of all your patches now - can you give me a big
> diff against the latest BK so that I can check that the problem
> is gone for me too? 

Here are all of the pending TCP bug fixes, attached in order.
I'll be pushing these to Linus some time today.

[-- Attachment #2: diff1 --]
[-- Type: application/octet-stream, Size: 8293 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/27 21:50:11-07:00 davem@nuts.davemloft.net 
#   [TCP]: Fix congestion window expansion when using TSO.
#   
#   We only do congestion window expansion on full packet
#   ACKs.  We should do it for ACKs of sub-packets of a
#   TSO frame as well.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/27 21:48:59-07:00 davem@nuts.davemloft.net +35 -2
#   [TCP]: Fix congestion window expansion when using TSO.
# 
# net/ipv4/tcp_input.c
#   2004/09/27 21:48:59-07:00 davem@nuts.davemloft.net +85 -1
#   [TCP]: Fix congestion window expansion when using TSO.
# 
# include/net/tcp.h
#   2004/09/27 21:48:59-07:00 davem@nuts.davemloft.net +2 -1
#   [TCP]: Fix congestion window expansion when using TSO.
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-28 14:30:28 -07:00
+++ b/include/net/tcp.h	2004-09-28 14:30:28 -07:00
@@ -1180,7 +1180,8 @@
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
-	__u32		tso_factor;
+	__u16		tso_factor;	/* If > 1, TSO frame		*/
+	__u16		tso_mss;	/* MSS that FACTOR's in terms of*/
 };
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-28 14:30:28 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-28 14:30:28 -07:00
@@ -2355,6 +2355,86 @@
 	}
 }
 
+/* There is one downside to this scheme.  Although we keep the
+ * ACK clock ticking, adjusting packet counters and advancing
+ * congestion window, we do not liberate socket send buffer
+ * space.
+ *
+ * Mucking with skb->truesize and sk->sk_wmem_alloc et al.
+ * then making a write space wakeup callback is a possible
+ * future enhancement.  WARNING: it is not trivial to make.
+ */
+static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+			 __u32 now, __s32 *seq_rtt)
+{
+	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
+	__u32 mss = scb->tso_mss;
+	__u32 snd_una = tp->snd_una;
+	__u32 seq = scb->seq;
+	__u32 packets_acked = 0;
+	int acked = 0;
+
+	/* If we get here, the whole TSO packet has not been
+	 * acked.
+	 */
+	BUG_ON(!after(scb->end_seq, snd_una));
+
+	while (!after(seq + mss, snd_una)) {
+		packets_acked++;
+		seq += mss;
+	}
+
+	if (packets_acked) {
+		__u8 sacked = scb->sacked;
+
+		/* We adjust scb->seq but we do not pskb_pull() the
+		 * SKB.  We let tcp_retransmit_skb() handle this case
+		 * by checking skb->len against the data sequence span.
+		 * This way, we avoid the pskb_pull() work unless we
+		 * actually need to retransmit the SKB.
+		 */
+		scb->seq = seq;
+
+		acked |= FLAG_DATA_ACKED;
+		if (sacked) {
+			if (sacked & TCPCB_RETRANS) {
+				if (sacked & TCPCB_SACKED_RETRANS)
+					tcp_dec_pcount_explicit(&tp->retrans_out,
+								packets_acked);
+				acked |= FLAG_RETRANS_DATA_ACKED;
+				*seq_rtt = -1;
+			} else if (*seq_rtt < 0)
+				*seq_rtt = now - scb->when;
+			if (sacked & TCPCB_SACKED_ACKED)
+				tcp_dec_pcount_explicit(&tp->sacked_out,
+							packets_acked);
+			if (sacked & TCPCB_LOST)
+				tcp_dec_pcount_explicit(&tp->lost_out,
+							packets_acked);
+			if (sacked & TCPCB_URG) {
+				if (tp->urg_mode &&
+				    !before(scb->seq, tp->snd_up))
+					tp->urg_mode = 0;
+			}
+		} else if (*seq_rtt < 0)
+			*seq_rtt = now - scb->when;
+
+		if (tcp_get_pcount(&tp->fackets_out)) {
+			__u32 dval = min(tcp_get_pcount(&tp->fackets_out),
+					 packets_acked);
+			tcp_dec_pcount_explicit(&tp->fackets_out, dval);
+		}
+		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
+		scb->tso_factor -= packets_acked;
+
+		BUG_ON(scb->tso_factor == 0);
+		BUG_ON(!before(scb->seq, scb->end_seq));
+	}
+
+	return acked;
+}
+
+
 /* Remove acknowledged frames from the retransmission queue. */
 static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
 {
@@ -2373,8 +2453,12 @@
 		 * discard it as it's confirmed to have arrived at
 		 * the other end.
 		 */
-		if (after(scb->end_seq, tp->snd_una))
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (scb->tso_factor > 1)
+				acked |= tcp_tso_acked(tp, skb,
+						       now, &seq_rtt);
 			break;
+		}
 
 		/* Initial outgoing SYN's get put onto the write_queue
 		 * just like anything else we transmit.  It is not
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-28 14:30:28 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-28 14:30:28 -07:00
@@ -436,6 +436,7 @@
 		factor /= mss_std;
 		TCP_SKB_CB(skb)->tso_factor = factor;
 	}
+	TCP_SKB_CB(skb)->tso_mss = mss_std;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -552,7 +553,7 @@
 	return skb->tail;
 }
 
-static int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+static int __tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
@@ -565,11 +566,20 @@
 			return -ENOMEM;
 	}
 
-	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_HW;
 	return 0;
 }
 
+static inline int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+{
+	int err = __tcp_trim_head(sk, skb, len);
+
+	if (!err)
+		TCP_SKB_CB(skb)->seq += len;
+
+	return err;
+}
+
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
    tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -949,6 +959,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
  	unsigned int cur_mss = tcp_current_mss(sk, 0);
+	__u32 data_seq, data_end_seq;
 	int err;
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
@@ -958,6 +969,22 @@
 	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
 		return -EAGAIN;
 
+	/* What is going on here?  When TSO packets are partially ACK'd,
+	 * we adjust the TCP_SKB_CB(skb)->seq value forward but we do
+	 * not adjust the data area of the SKB.  We defer that to here
+	 * so that we can avoid the work unless we really retransmit
+	 * the packet.
+	 */
+	data_seq = TCP_SKB_CB(skb)->seq;
+	data_end_seq = TCP_SKB_CB(skb)->end_seq;
+	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
+		data_end_seq--;
+
+	if (skb->len != (data_end_seq - data_seq)) {
+		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))
+			return -ENOMEM;
+	}		
+
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
@@ -1191,6 +1218,7 @@
 		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
 		TCP_SKB_CB(skb)->sacked = 0;
 		TCP_SKB_CB(skb)->tso_factor = 1;
+		TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		TCP_SKB_CB(skb)->seq = tp->write_seq;
@@ -1223,6 +1251,7 @@
 	TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 	/* Send it off. */
 	TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk, tp);
@@ -1304,6 +1333,7 @@
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(req->rcv_isn + 1);
 	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
@@ -1406,6 +1436,7 @@
 	TCP_ECN_send_syn(sk, tp, buff);
 	TCP_SKB_CB(buff)->sacked = 0;
 	TCP_SKB_CB(buff)->tso_factor = 1;
+	TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
 	buff->csum = 0;
 	TCP_SKB_CB(buff)->seq = tp->write_seq++;
 	TCP_SKB_CB(buff)->end_seq = tp->write_seq;
@@ -1506,6 +1537,7 @@
 		TCP_SKB_CB(buff)->flags = TCPCB_FLAG_ACK;
 		TCP_SKB_CB(buff)->sacked = 0;
 		TCP_SKB_CB(buff)->tso_factor = 1;
+		TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
 
 		/* Send it off, this clears delayed acks for us. */
 		TCP_SKB_CB(buff)->seq = TCP_SKB_CB(buff)->end_seq = tcp_acceptable_seq(sk, tp);
@@ -1541,6 +1573,7 @@
 	TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(skb)->sacked = urgent;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 	/* Use a previous sequence.  This should cause the other
 	 * end to send an ack.  Don't queue or clone SKB, just

[-- Attachment #3: diff2 --]
[-- Type: application/octet-stream, Size: 939 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/27 22:00:18-07:00 herbert@gondor.apana.org.au 
#   [TCP]: Use mss_cache_std in tcp_init_metrics().
#   
#   Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_input.c
#   2004/09/27 21:59:38-07:00 herbert@gondor.apana.org.au +2 -2
#   [TCP]: Use mss_cache_std in tcp_init_metrics().
# 
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-28 14:31:12 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-28 14:31:12 -07:00
@@ -802,10 +802,10 @@
 	__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
 
 	if (!cwnd) {
-		if (tp->mss_cache > 1460)
+		if (tp->mss_cache_std > 1460)
 			cwnd = 2;
 		else
-			cwnd = (tp->mss_cache > 1095) ? 3 : 4;
+			cwnd = (tp->mss_cache_std > 1095) ? 3 : 4;
 	}
 	return min_t(__u32, cwnd, tp->snd_cwnd_clamp);
 }

[-- Attachment #4: diff3 --]
[-- Type: application/octet-stream, Size: 966 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/27 22:37:27-07:00 davem@nuts.davemloft.net 
#   [TCP]: Fix third arg to __tcp_trim_head().
#   
#   Noted by Herbert Xu <herbert@gondor.apana.org.au>
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/27 22:36:41-07:00 davem@nuts.davemloft.net +4 -2
#   [TCP]: Fix third arg to __tcp_trim_head().
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-28 14:31:40 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-28 14:31:40 -07:00
@@ -980,8 +980,10 @@
 	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
 		data_end_seq--;
 
-	if (skb->len != (data_end_seq - data_seq)) {
-		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))
+	if (skb->len > (data_end_seq - data_seq)) {
+		u32 to_trim = skb->len - (data_end_seq - data_seq);
+
+		if (__tcp_trim_head(sk, skb, to_trim))
 			return -ENOMEM;
 	}		
 

[-- Attachment #5: diff4 --]
[-- Type: application/octet-stream, Size: 3591 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/28 13:26:54-07:00 davem@nuts.davemloft.net 
#   [TCP]: Uninline tcp_current_mss().
#   
#   Also fix the return value of tcp_sync_mss() to
#   be unsigned.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/28 13:26:01-07:00 davem@nuts.davemloft.net +31 -1
#   [TCP]: Uninline tcp_current_mss().
# 
# include/net/tcp.h
#   2004/09/28 13:26:00-07:00 davem@nuts.davemloft.net +2 -32
#   [TCP]: Uninline tcp_current_mss().
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-28 14:32:07 -07:00
+++ b/include/net/tcp.h	2004-09-28 14:32:07 -07:00
@@ -961,7 +961,8 @@
 
 extern void tcp_delete_keepalive_timer (struct sock *);
 extern void tcp_reset_keepalive_timer (struct sock *, unsigned long);
-extern int tcp_sync_mss(struct sock *sk, u32 pmtu);
+extern unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu);
+extern unsigned int tcp_current_mss(struct sock *sk, int large);
 
 extern const char timer_bug_msg[];
 
@@ -1033,37 +1034,6 @@
 	default:
 		printk(timer_bug_msg);
 	};
-}
-
-/* Compute the current effective MSS, taking SACKs and IP options,
- * and even PMTU discovery events into account.
- *
- * LARGESEND note: !urg_mode is overkill, only frames up to snd_up
- * cannot be large. However, taking into account rare use of URG, this
- * is not a big flaw.
- */
-
-static inline unsigned int tcp_current_mss(struct sock *sk, int large)
-{
-	struct tcp_opt *tp = tcp_sk(sk);
-	struct dst_entry *dst = __sk_dst_get(sk);
-	int do_large, mss_now;
-
-	do_large = (large &&
-		    (sk->sk_route_caps & NETIF_F_TSO) &&
-		    !tp->urg_mode);
-	mss_now = do_large ? tp->mss_cache : tp->mss_cache_std;
-
-	if (dst) {
-		u32 mtu = dst_pmtu(dst);
-		if (mtu != tp->pmtu_cookie ||
-		    tp->ext2_header_len != dst->header_len)
-			mss_now = tcp_sync_mss(sk, mtu);
-	}
-	if (tp->eff_sacks)
-		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
-			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));
-	return mss_now;
 }
 
 /* Initialize RCV_MSS value.
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-28 14:32:07 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-28 14:32:07 -07:00
@@ -603,7 +603,7 @@
    this function.			--ANK (980731)
  */
 
-int tcp_sync_mss(struct sock *sk, u32 pmtu)
+unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
@@ -661,6 +661,36 @@
 	return mss_now;
 }
 
+/* Compute the current effective MSS, taking SACKs and IP options,
+ * and even PMTU discovery events into account.
+ *
+ * LARGESEND note: !urg_mode is overkill, only frames up to snd_up
+ * cannot be large. However, taking into account rare use of URG, this
+ * is not a big flaw.
+ */
+
+unsigned int tcp_current_mss(struct sock *sk, int large)
+{
+	struct tcp_opt *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
+	int do_large, mss_now;
+
+	do_large = (large &&
+		    (sk->sk_route_caps & NETIF_F_TSO) &&
+		    !tp->urg_mode);
+	mss_now = do_large ? tp->mss_cache : tp->mss_cache_std;
+
+	if (dst) {
+		u32 mtu = dst_pmtu(dst);
+		if (mtu != tp->pmtu_cookie ||
+		    tp->ext2_header_len != dst->header_len)
+			mss_now = tcp_sync_mss(sk, mtu);
+	}
+	if (tp->eff_sacks)
+		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
+			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));
+	return mss_now;
+}
 
 /* This routine writes packets to the network.  It advances the
  * send_head.  This happens as incoming acks open up the remote

[-- Attachment #6: diff5 --]
[-- Type: application/octet-stream, Size: 2712 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/28 13:46:58-07:00 davem@nuts.davemloft.net 
#   [TCP]: Move TSO mss calcs to tcp_current_mss()
#   
#   Based upon a bug fix patch and suggestions from
#   Herbert Xu <herbert@gondor.apana.org.au>
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/28 13:46:28-07:00 davem@nuts.davemloft.net +29 -24
#   [TCP]: Move TSO mss calcs to tcp_current_mss()
#   
#   Based upon a bug fix patch and suggestions from
#   Herbert Xu <herbert@gondor.apana.org.au>
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-28 14:32:34 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-28 14:32:34 -07:00
@@ -639,25 +639,6 @@
 	tp->pmtu_cookie = pmtu;
 	tp->mss_cache = tp->mss_cache_std = mss_now;
 
-	if (sk->sk_route_caps & NETIF_F_TSO) {
-		int large_mss, factor;
-
-		large_mss = 65535 - tp->af_specific->net_header_len -
-			tp->ext_header_len - tp->ext2_header_len - tp->tcp_header_len;
-
-		if (tp->max_window && large_mss > (tp->max_window>>1))
-			large_mss = max((tp->max_window>>1), 68U - tp->tcp_header_len);
-
-		/* Always keep large mss multiple of real mss, but
-		 * do not exceed congestion window.
-		 */
-		factor = large_mss / mss_now;
-		if (factor > tp->snd_cwnd)
-			factor = tp->snd_cwnd;
-
-		tp->mss_cache = mss_now * factor;
-	}
-
 	return mss_now;
 }
 
@@ -675,17 +656,41 @@
 	struct dst_entry *dst = __sk_dst_get(sk);
 	int do_large, mss_now;
 
-	do_large = (large &&
-		    (sk->sk_route_caps & NETIF_F_TSO) &&
-		    !tp->urg_mode);
-	mss_now = do_large ? tp->mss_cache : tp->mss_cache_std;
-
+	mss_now = tp->mss_cache_std;
 	if (dst) {
 		u32 mtu = dst_pmtu(dst);
 		if (mtu != tp->pmtu_cookie ||
 		    tp->ext2_header_len != dst->header_len)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
+
+	do_large = (large &&
+		    (sk->sk_route_caps & NETIF_F_TSO) &&
+		    !tp->urg_mode);
+
+	if (do_large) {
+		int large_mss, factor;
+
+		large_mss = 65535 - tp->af_specific->net_header_len -
+			tp->ext_header_len - tp->ext2_header_len -
+			tp->tcp_header_len;
+
+		if (tp->max_window && large_mss > (tp->max_window>>1))
+			large_mss = max((tp->max_window>>1),
+					68U - tp->tcp_header_len);
+
+		/* Always keep large mss multiple of real mss, but
+		 * do not exceed congestion window.
+		 */
+		factor = large_mss / mss_now;
+		if (factor > tp->snd_cwnd)
+			factor = tp->snd_cwnd;
+
+		tp->mss_cache = mss_now * factor;
+
+		mss_now = tp->mss_cache;
+	}
+
 	if (tp->eff_sacks)
 		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
 			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 21:53                                           ` David S. Miller
@ 2004-09-28 22:33                                             ` Andi Kleen
  2004-09-28 22:57                                               ` David S. Miller
  2004-09-29  3:27                                               ` John Heffner
  0 siblings, 2 replies; 97+ messages in thread
From: Andi Kleen @ 2004-09-28 22:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, herbert, niv, andy.grover, anton, netdev

On Tue, Sep 28, 2004 at 02:53:45PM -0700, David S. Miller wrote:
> On Tue, 28 Sep 2004 23:34:15 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > I admit I lost track of all your patches now - can you give me a big
> > diff against the latest BK so that I can check that the problem
> > is gone for me too? 
> 
> Here are all of the pending TCP bug fixes, attached in order.
> I'll be pushing these to Linus some time today.

I'm afraid I must report it's still not completely solved for me yet.
10s netperf with TSO on with your patches gives now ~10MB/s less than 
with TSO off (57 vs 67). It's better than before, but not really
fixed yet.

Looking at my tcpdumps and comparing TSO on/off I see a quite
strange effect. It only acks on every ~25th packet with TSO off
but every ~16th packet with TSO on.

Receiver is a 2.6.5 kernel, it's weird that it violates the
ack every two MSS rule.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 22:33                                             ` Andi Kleen
@ 2004-09-28 22:57                                               ` David S. Miller
  2004-09-28 23:27                                                 ` Andi Kleen
  2004-09-29  3:27                                               ` John Heffner
  1 sibling, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-28 22:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, herbert, niv, andy.grover, anton, netdev

On Wed, 29 Sep 2004 00:33:44 +0200
Andi Kleen <ak@suse.de> wrote:

> Looking at my tcpdumps and comparing TSO on/off I see a quite
> strange effect. It only acks on every ~25th packet with TSO off
> but every ~16th packet with TSO on.
> 
> Receiver is a 2.6.5 kernel, it's weird that it violates the
> ack every two MSS rule.

Your system is SMP and packet reordering is occuring?
If so, you can lock the interrupts of the card
to a specific cpu to see if that makes the problem go
away.

Another possibility is tcpdump dropping some of the ACKs
or some bug in the TCP code of your "interesting experiment
based upon 2.6.5" kernel :-)

On my sparc64 box here, TSO makes performance go up quite
clearly, from 55MB/sec-->63MB/sec, and the sender is a tg3
card on a 32-bit/33Mhz bus.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 22:57                                               ` David S. Miller
@ 2004-09-28 23:27                                                 ` Andi Kleen
  2004-09-28 23:35                                                   ` David S. Miller
  2004-09-29 20:58                                                   ` John Heffner
  0 siblings, 2 replies; 97+ messages in thread
From: Andi Kleen @ 2004-09-28 23:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 15:57:06 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> On Wed, 29 Sep 2004 00:33:44 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > Looking at my tcpdumps and comparing TSO on/off I see a quite
> > strange effect. It only acks on every ~25th packet with TSO off
> > but every ~16th packet with TSO on.
> > 
> > Receiver is a 2.6.5 kernel, it's weird that it violates the
> > ack every two MSS rule.
> 
> Your system is SMP and packet reordering is occuring?

The sender is SMP, but the receiver is UP.
I tcpdumped at the receiver.

> If so, you can lock the interrupts of the card
> to a specific cpu to see if that makes the problem go
> away.
> 
> Another possibility is tcpdump dropping some of the ACKs

Possible, unfortunately there are no counters for this
(anyone on netdev motivated to add them to each packet drop
in PF_PACKET and dev.c?) 

> or some bug in the TCP code of your "interesting experiment
> based upon 2.6.5" kernel :-)

Possible. 

I tried to re-test it but I didn't get very far because the kernel
with your patch crashes regularly during netperf. No serial console,
but the backtrace is {tcp_ack+877} {tcp_rcv_established+350} ... 
Before that there are a few "retrans out leaked" messages.


-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 23:27                                                 ` Andi Kleen
@ 2004-09-28 23:35                                                   ` David S. Miller
  2004-09-28 23:55                                                     ` Andi Kleen
  2004-09-29 20:58                                                   ` John Heffner
  1 sibling, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-28 23:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: herbert, niv, andy.grover, anton, netdev

On Wed, 29 Sep 2004 01:27:57 +0200
Andi Kleen <ak@suse.de> wrote:

> I tried to re-test it but I didn't get very far because the kernel
> with your patch crashes regularly during netperf. No serial console,
> but the backtrace is {tcp_ack+877} {tcp_rcv_established+350} ... 
> Before that there are a few "retrans out leaked" messages.

You might be missing the topmost part of the backtrace which is probably
in tcp_clean_rtx_queue() or even more likely tcp_tso_acked() where there
are several assertions present.

I guess your compiler is auto-inlining those functions for you,
can you reproduce the crash with perhaps the noinline attribute
added to those two functions I mention above so we can get a
precise backtrace?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 23:35                                                   ` David S. Miller
@ 2004-09-28 23:55                                                     ` Andi Kleen
  2004-09-29  0:04                                                       ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-28 23:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 16:35:09 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> On Wed, 29 Sep 2004 01:27:57 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > I tried to re-test it but I didn't get very far because the kernel
> > with your patch crashes regularly during netperf. No serial console,
> > but the backtrace is {tcp_ack+877} {tcp_rcv_established+350} ... 
> > Before that there are a few "retrans out leaked" messages.
> 
> You might be missing the topmost part of the backtrace which is probably
> in tcp_clean_rtx_queue() or even more likely tcp_tso_acked() where there
> are several assertions present.
> 
> I guess your compiler is auto-inlining those functions for you,
> can you reproduce the crash with perhaps the noinline attribute
> added to those two functions I mention above so we can get a
> precise backtrace?

Hmpf, it stopped crashing after I recompiled without -funit-at-a-time
Maybe it is some compiler issue.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 23:55                                                     ` Andi Kleen
@ 2004-09-29  0:04                                                       ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-29  0:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: herbert, niv, andy.grover, anton, netdev

On Wed, 29 Sep 2004 01:55:34 +0200
Andi Kleen <ak@suse.de> wrote:

> Hmpf, it stopped crashing after I recompiled without -funit-at-a-time
> Maybe it is some compiler issue.

Could be, I was studying all of the retransmit queue state
handling and tried to find some hole in the TSO case and I
could not do so.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 23:27                                                 ` Andi Kleen
  2004-09-28 23:35                                                   ` David S. Miller
@ 2004-09-29 20:58                                                   ` John Heffner
  2004-09-29 21:10                                                     ` Nivedita Singhvi
  1 sibling, 1 reply; 97+ messages in thread
From: John Heffner @ 2004-09-29 20:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, herbert, niv, andy.grover, anton, netdev

On Wed, 29 Sep 2004, Andi Kleen wrote:

> I tried to re-test it but I didn't get very far because the kernel
> with your patch crashes regularly during netperf. No serial console,
> but the backtrace is {tcp_ack+877} {tcp_rcv_established+350} ...
> Before that there are a few "retrans out leaked" messages.

I just tried to re-create this situation, and I got the same problem.
Latest bk tree with dave's 5 patches, gcc 3.3.4, uni-processor p4/e1000 to
a slow-ish p3/sk98lin (dual-cpu but with maxcpus=1).

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 20:58                                                   ` John Heffner
@ 2004-09-29 21:10                                                     ` Nivedita Singhvi
  2004-09-29 21:50                                                       ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-29 21:10 UTC (permalink / raw)
  To: John Heffner
  Cc: Andi Kleen, David S. Miller, herbert, andy.grover, anton, netdev

John Heffner wrote:

> On Wed, 29 Sep 2004, Andi Kleen wrote:
> 
> 
>>I tried to re-test it but I didn't get very far because the kernel
>>with your patch crashes regularly during netperf. No serial console,
>>but the backtrace is {tcp_ack+877} {tcp_rcv_established+350} ...
>>Before that there are a few "retrans out leaked" messages.
> 
> 
> I just tried to re-create this situation, and I got the same problem.
> Latest bk tree with dave's 5 patches, gcc 3.3.4, uni-processor p4/e1000 to
> a slow-ish p3/sk98lin (dual-cpu but with maxcpus=1).

I just crashed too, no backtrace. netperf tcp stream test,
and was on bk14 + dave's 5 patches, p4/e1000 -> Intel Pentium
M proc (1.7GHz). Going to repeat on slower SMPs with serial
console, get more info..

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:10                                                     ` Nivedita Singhvi
@ 2004-09-29 21:50                                                       ` David S. Miller
  2004-09-29 21:56                                                         ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-29 21:50 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: jheffner, ak, herbert, andy.grover, anton, netdev

On Wed, 29 Sep 2004 14:10:24 -0700
Nivedita Singhvi <niv@us.ibm.com> wrote:

> I just crashed too, no backtrace. netperf tcp stream test,
> and was on bk14 + dave's 5 patches, p4/e1000 -> Intel Pentium
> M proc (1.7GHz). Going to repeat on slower SMPs with serial
> console, get more info..

I can reproduce this now, it has to do with some weird combinations
of packet loss and SACK'ing.  It's one of the BUG_ON() assertions
triggering in tcp_tso_acked() as I suspected in Andi's first report.

Working on a fix.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:50                                                       ` David S. Miller
@ 2004-09-29 21:56                                                         ` Andi Kleen
  2004-09-29 23:29                                                           ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-29 21:56 UTC (permalink / raw)
  To: David S. Miller
  Cc: Nivedita Singhvi, jheffner, ak, herbert, andy.grover, anton,
	netdev

On Wed, Sep 29, 2004 at 02:50:50PM -0700, David S. Miller wrote:
> On Wed, 29 Sep 2004 14:10:24 -0700
> Nivedita Singhvi <niv@us.ibm.com> wrote:
> 
> > I just crashed too, no backtrace. netperf tcp stream test,
> > and was on bk14 + dave's 5 patches, p4/e1000 -> Intel Pentium
> > M proc (1.7GHz). Going to repeat on slower SMPs with serial
> > console, get more info..
> 
> I can reproduce this now, it has to do with some weird combinations
> of packet loss and SACK'ing.  It's one of the BUG_ON() assertions
> triggering in tcp_tso_acked() as I suspected in Andi's first report.
> 
> Working on a fix.

Yes, it's a BUG. Here's a full oops I found from yesterday in some log.

2427 is 
	        BUG_ON(scb->tso_factor == 0);



----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tcp_input:2427
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.9-rc2-bk11
RIP: 0010:[<ffffffff8039638d>] <ffffffff8039638d>{tcp_ack+877}
RSP: 0018:ffffffff8053e128  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000001007df44a18 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000001007dd884f0 RDI: 000000004d083811
RBP: 000001007df44700 R08: 00000000000005a8 R09: 000000000000000c
R10: ffffffff8053e138 R11: 0000000000000004 R12: 0000000000000000
R13: 0000000000000002 R14: 000001007df447b8 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffffffff805bd280(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000522000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff805c0000, task ffffffff8047c280)
Stack: 0000000c0000000c 4d07f4314d083811 0000010000000001 000001007df44a18 
       000001007da6a034 000001007d64a080 000001007df44700 0000000000000020 
       0000000000000000 ffffffff8039a2be 
Call Trace:<IRQ> <ffffffff8039a2be>{tcp_rcv_established+350} <ffffffff80110ab5>{ret_from_intr+0} 
       <ffffffff803a1ebf>{tcp_v4_do_rcv+63} <ffffffff803a27db>{tcp_v4_rcv+1659} 
       <ffffffff802c7d80>{e1000_intr+1936} <ffffffff80117075>{timer_interrupt+1045} 
       <ffffffff803884d1>{ip_local_deliver+193} <ffffffff8038836e>{ip_rcv+910} 
       <ffffffff80375ddc>{netif_receive_skb+428} <ffffffff80375ea6>{process_backlog+150} 
       <ffffffff80374fd4>{net_rx_action+132} <ffffffff8013d231>{__do_softirq+113} 
       <ffffffff8013d2e5>{do_softirq+53} <ffffffff80113bef>{do_IRQ+335} 
       <ffffffff80110ab5>{ret_from_intr+0}  <EOI> <ffffffff8010f356>{mwait_idle+86} 
       <ffffffff8010f7ad>{cpu_idle+29} <ffffffff805c3925>{start_kernel+485} 
       <ffffffff805c31e0>{_sinittext+480} 

Code: 0f 0b 30 b8 45 80 ff ff ff ff 7b 09 8b 56 14 39 56 10 78 0c 
RIP <ffffffff8039638d>{tcp_ack+877} RSP <ffffffff8053e128>
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:56                                                         ` Andi Kleen
@ 2004-09-29 23:29                                                           ` David S. Miller
  2004-09-29 23:51                                                             ` John Heffner
  2004-09-30  0:05                                                             ` Herbert Xu
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-29 23:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: niv, jheffner, ak, herbert, andy.grover, anton, netdev


Ok, found the bug.  This bug was in the existing code and
the assertions in the new code was merely discovering it.

What happens is that we set TCP_SKB_CB(skb)->tso_factor,
for tcp_snd_test() or similar, for example.  But this packet
does not go out, and tcp_sendmsg() or tcp_sendpages() tacks
more data onto the tail of the SKB without updating
TCP_SKB_CB(skb)->tso_factor.

Simply setting the tso_factor to zero in these cases gets
it recalculated, and reduces the number of tso_factor
recalculations.  This works because by definition these
SKBs have not been sent for the first time yet.  We never
tack data onto the end of retransmitted SKBs.  And because
of that invariante there will be a tcp_set_skb_tso_factor()
call before it gets to tcp_transmit_skb() (which will BUG
otherwise).

I've also audited other spots that don't update the tso_factor
when they should, tcp_trim_head() was another such spot.
Finally, I added an assertion to retransmit queue collapsing
to make sure we don't collapse SKBs with non-1 TSO factors.
And the Solaris FIN workaround pskb_trim() needs to reset
the TSO factor as well.

Let me know if this cures the issue, and if it does we can
move back to Andi's performance issue and the MSS stuff
John Heffner just discovered.

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/29 16:09:18-07:00 davem@nuts.davemloft.net 
#   [TCP]: Fix inaccuracies in tso_factor settings.
#   
#   1) If tcp_{sendmsg,sendpage} tacks on more data to an
#      existing SKB, this can make tso_factor inaccurate.
#      Invalidate it, which forces it to be recalculated,
#      by simply setting it to zero.
#   2) __tcp_trim_head() changes skb->len thus we need
#      to recalculate tso_factor
#   3) BUG check that tcp_retrans_try_collapse() does not
#      try to collapse packets with non-1 tso_factor
#   4) The Solaris FIN workaround in tcp_retransmit_skb()
#      changes packet size, need to fixup tso_factor
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/29 16:06:54-07:00 davem@nuts.davemloft.net +15 -5
#   [TCP]: Fix inaccuracies in tso_factor settings.
# 
# net/ipv4/tcp.c
#   2004/09/29 16:06:54-07:00 davem@nuts.davemloft.net +2 -0
#   [TCP]: Fix inaccuracies in tso_factor settings.
# 
diff -Nru a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c	2004-09-29 16:09:42 -07:00
+++ b/net/ipv4/tcp.c	2004-09-29 16:09:42 -07:00
@@ -691,6 +691,7 @@
 		skb->ip_summed = CHECKSUM_HW;
 		tp->write_seq += copy;
 		TCP_SKB_CB(skb)->end_seq += copy;
+		TCP_SKB_CB(skb)->tso_factor = 0;
 
 		if (!copied)
 			TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
@@ -937,6 +938,7 @@
 
 			tp->write_seq += copy;
 			TCP_SKB_CB(skb)->end_seq += copy;
+			TCP_SKB_CB(skb)->tso_factor = 0;
 
 			from += copy;
 			copied += copy;
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-29 16:09:42 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-29 16:09:42 -07:00
@@ -553,7 +553,7 @@
 	return skb->tail;
 }
 
-static int __tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+static int __tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
 {
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
@@ -567,12 +567,18 @@
 	}
 
 	skb->ip_summed = CHECKSUM_HW;
+
+	/* Any change of skb->len requires recalculation of tso
+	 * factor and mss.
+	 */
+	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
+
 	return 0;
 }
 
-static inline int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+static inline int tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
 {
-	int err = __tcp_trim_head(sk, skb, len);
+	int err = __tcp_trim_head(tp, skb, len);
 
 	if (!err)
 		TCP_SKB_CB(skb)->seq += len;
@@ -897,6 +903,9 @@
 		    ((skb_size + next_skb_size) > mss_now))
 			return;
 
+		BUG_ON(TCP_SKB_CB(skb)->tso_factor != 1 ||
+		       TCP_SKB_CB(next_skb)->tso_factor != 1);
+
 		/* Ok.  We will be able to collapse the packet. */
 		__skb_unlink(next_skb, next_skb->list);
 
@@ -1018,7 +1027,7 @@
 	if (skb->len > (data_end_seq - data_seq)) {
 		u32 to_trim = skb->len - (data_end_seq - data_seq);
 
-		if (__tcp_trim_head(sk, skb, to_trim))
+		if (__tcp_trim_head(tp, skb, to_trim))
 			return -ENOMEM;
 	}		
 
@@ -1032,7 +1041,7 @@
 			tp->mss_cache = tp->mss_cache_std;
 		}
 
-		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(tp, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
 			return -ENOMEM;
 	}
 
@@ -1080,6 +1089,7 @@
 	   tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) {
 		if (!pskb_trim(skb, 0)) {
 			TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1;
+			TCP_SKB_CB(skb)->tso_factor = 1;
 			skb->ip_summed = CHECKSUM_NONE;
 			skb->csum = 0;
 		}

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 23:29                                                           ` David S. Miller
@ 2004-09-29 23:51                                                             ` John Heffner
  2004-09-30  0:03                                                               ` David S. Miller
  2004-09-30  0:10                                                               ` John Heffner
  2004-09-30  0:05                                                             ` Herbert Xu
  1 sibling, 2 replies; 97+ messages in thread
From: John Heffner @ 2004-09-29 23:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, niv, herbert, andy.grover, anton, netdev

On Wed, 29 Sep 2004, David S. Miller wrote:

> Let me know if this cures the issue, and if it does we can
> move back to Andi's performance issue and the MSS stuff
> John Heffner just discovered.

Seems to work for me.

Using iperf, I'm getting ~ the same speed to a slow p3 receiver (680
Mbits) with TSO on or off right now.  Haven't tried netperf.

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 23:51                                                             ` John Heffner
@ 2004-09-30  0:03                                                               ` David S. Miller
  2004-09-30  0:10                                                                 ` Herbert Xu
  2004-09-30  0:10                                                               ` John Heffner
  1 sibling, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-30  0:03 UTC (permalink / raw)
  To: John Heffner; +Cc: ak, niv, herbert, andy.grover, anton, netdev

On Wed, 29 Sep 2004 19:51:21 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> On Wed, 29 Sep 2004, David S. Miller wrote:
> 
> > Let me know if this cures the issue, and if it does we can
> > move back to Andi's performance issue and the MSS stuff
> > John Heffner just discovered.
> 
> Seems to work for me.
> 
> Using iperf, I'm getting ~ the same speed to a slow p3 receiver (680
> Mbits) with TSO on or off right now.  Haven't tried netperf.

Great, thanks for testing.

I just pushed these changes into my tree at:

	bk://kernel.bkbits.net/davem/net-2.6

and asked Linus to pull them in.

I think I'm going to make tcp_tso_acked() call tcp_trim_head()
direclty so that skb->len and (end_seq - seq) are kept in sync.
This will also correct a bug in tcp_tso_acked() wrt. URG processing.
It uses the wrong sequence number currently.  Luckily that code never
runs currently because all URG packets are built non-TSO.  Better to
fix this than to let it bite us later.

I think there are some other things we can do to make TSO work
even better.  We turn off TSO currently when we get SACKs, that
stinks and is really unnecessary.  We can keep track of sacking
of sub-TSO frames by simply using a bitmask of some kind.  I will
have space for this if I move the tso_factor/tso_mss out of
tcp_skb_cb[] and just use the tso_{size,segs} in skb_shinfo(skb)

Anyways, I'll work on that stuff while the dust settles on the
current bug fix.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  0:03                                                               ` David S. Miller
@ 2004-09-30  0:10                                                                 ` Herbert Xu
  2004-10-01  0:34                                                                   ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-30  0:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: John Heffner, ak, niv, andy.grover, anton, netdev

On Wed, Sep 29, 2004 at 05:03:10PM -0700, David S. Miller wrote:
>
> I think there are some other things we can do to make TSO work
> even better.  We turn off TSO currently when we get SACKs, that

Great.  This way we can also fix the fack_count in
tcp_sacktag_write_queue().  Currently it always counts
tso_factor packets even though the sack may only cover
part of the TSO packet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  0:10                                                                 ` Herbert Xu
@ 2004-10-01  0:34                                                                   ` David S. Miller
  2004-10-01  1:12                                                                     ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-10-01  0:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

So I was wondering what in 2.6.5 vs. current 2.6.x TCP could
be tripping up Andi's setup and I think I just found it.

If I disable /proc/sys/net/tcp_moderate_rcvbuf performance
goes down from ~634Mbit/sec to ~495Mbit/sec.

Andi, I know you said that with TSO disabled things go 
more smoothly.  But could you try upping the TCP socket
receive buffer sizes on the 2.6.5 box to see if that gives
you the performance back with TSO enabled?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-10-01  0:34                                                                   ` David S. Miller
@ 2004-10-01  1:12                                                                     ` David S. Miller
  2004-10-01  3:40                                                                       ` David S. Miller
  2004-10-01 10:23                                                                       ` Andi Kleen
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-10-01  1:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, jheffner, ak, niv, andy.grover, anton, netdev

On Thu, 30 Sep 2004 17:34:39 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> If I disable /proc/sys/net/tcp_moderate_rcvbuf performance
> goes down from ~634Mbit/sec to ~495Mbit/sec.
> 
> Andi, I know you said that with TSO disabled things go 
> more smoothly.  But could you try upping the TCP socket
> receive buffer sizes on the 2.6.5 box to see if that gives
> you the performance back with TSO enabled?

Ok, here is something to play with.  This adds a sysctl
to moderate the percentage of the congestion window we'll
limit TSO segmenting to.

It defaults to 2, but setting of 3 or 4 seem to make
Andi's case behave much better.

With such small receive buffers, netperf simply can't clear
the receive queue fast enough when a burst of TSO created
frames come in.

This is also where the stretch ACKs come from.  We defer
the ACK to recvmsg making progress, because we cannot
advertise a larger window and thus the connection is
application limited.

I'm also thinking about whether this sysctl should be
a divisor instead of a shift, and also whether it should
be in terms of the snd_cwnd or the advertised receiver
window whichever is smaller.

Basically, receivers with too small socket receive buffers
crap out if TSO bursts are too large.  This effect is
minimized the further the receiver is (rtt wise) from
the sender since the path tends to smooth out the bursts.
But on local gigabit lans, the effect is quite pronounced.

Ironically, this case is a great example of how powerful
and incredibly effective John's receive buffer moderation
code is.  2.6.5 performance is severely hampered due to lack
of this code.

===== include/linux/sysctl.h 1.88 vs edited =====
--- 1.88/include/linux/sysctl.h	2004-09-23 14:34:12 -07:00
+++ edited/include/linux/sysctl.h	2004-09-30 17:17:49 -07:00
@@ -341,6 +341,7 @@
 	NET_TCP_BIC_LOW_WINDOW=104,
 	NET_TCP_DEFAULT_WIN_SCALE=105,
 	NET_TCP_MODERATE_RCVBUF=106,
+	NET_TCP_TSO_CWND_SHIFT=107,
 };
 
 enum {
===== include/net/tcp.h 1.92 vs edited =====
--- 1.92/include/net/tcp.h	2004-09-29 21:11:52 -07:00
+++ edited/include/net/tcp.h	2004-09-30 17:18:02 -07:00
@@ -609,6 +609,7 @@
 extern int sysctl_tcp_bic_fast_convergence;
 extern int sysctl_tcp_bic_low_window;
 extern int sysctl_tcp_moderate_rcvbuf;
+extern int sysctl_tcp_tso_cwnd_shift;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
===== net/ipv4/sysctl_net_ipv4.c 1.25 vs edited =====
--- 1.25/net/ipv4/sysctl_net_ipv4.c	2004-08-26 13:55:36 -07:00
+++ edited/net/ipv4/sysctl_net_ipv4.c	2004-09-30 17:19:32 -07:00
@@ -674,6 +674,14 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= NET_TCP_TSO_CWND_SHIFT,
+		.procname	= "tcp_tso_cwnd_shift",
+		.data		= &sysctl_tcp_tso_cwnd_shift,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
===== net/ipv4/tcp_output.c 1.65 vs edited =====
--- 1.65/net/ipv4/tcp_output.c	2004-09-29 21:11:53 -07:00
+++ edited/net/ipv4/tcp_output.c	2004-09-30 17:27:32 -07:00
@@ -44,6 +44,7 @@
 
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse = 1;
+int sysctl_tcp_tso_cwnd_shift = 2;
 
 static __inline__
 void update_send_head(struct sock *sk, struct tcp_opt *tp, struct sk_buff *skb)
@@ -673,7 +674,7 @@
 		    !tp->urg_mode);
 
 	if (do_large) {
-		int large_mss, factor;
+		int large_mss, factor, limit;
 
 		large_mss = 65535 - tp->af_specific->net_header_len -
 			tp->ext_header_len - tp->ext2_header_len -
@@ -688,8 +689,10 @@
 		 * can keep the ACK clock ticking.
 		 */
 		factor = large_mss / mss_now;
-		if (factor > (tp->snd_cwnd >> 2))
-			factor = max(1, tp->snd_cwnd >> 2);
+		limit = tp->snd_cwnd >> sysctl_tcp_tso_cwnd_shift;
+		limit = max(1, limit);
+		if (factor > limit)
+			factor = limit;
 
 		tp->mss_cache = mss_now * factor;
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-10-01  1:12                                                                     ` David S. Miller
@ 2004-10-01  3:40                                                                       ` David S. Miller
  2004-10-01 10:35                                                                         ` Andi Kleen
  2004-10-01 10:23                                                                       ` Andi Kleen
  1 sibling, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-10-01  3:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, jheffner, ak, niv, andy.grover, anton, netdev

On Thu, 30 Sep 2004 18:12:48 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> Ok, here is something to play with.  This adds a sysctl
> to moderate the percentage of the congestion window we'll
> limit TSO segmenting to.

I've done some tweaking and this is the patch I actually
checked into my tree.  I made it a divisor and the default
is 8.

I tried to play around with taking the send window and the
congestion window both into account, but that did not help
at all.

My current setup is Ultra-III 750Mhz w/tg3 sending to
Ultra-II 360Mhz w/tg3 through a D-Link DGS 1008-T gigabit
switch.  I'm using 32-bit binaries of netperf 2.3pl1
built with -DUSE_PROC_STAT and -DHAVE_SENDFILE.

The MTU being used is 1500.

Each run is made via "netperf -fM -H ${IP_OF_ULTRA-II}".
I did 3 runs each for 4 different configurations.  The
parameters are "TSO on/off" (sender side) and "TCP rcvbuf
moderation on/off" (receiver side).

With this patch I'm seeing these results:

TSO off + rbuf off:	63.15 MBytes/sec
			64.78 MBytes/sec
			64.53 MBytes/sec

TSO on  + rbuf off:	62.76 MBytes/sec
			63.36 MBytes/sec
			63.79 MBytes/sec

TSO off + rbuf on:	71.98 MBytes/sec
			73.52 MBytes/sec
			73.57 MBytes/sec

TSO on  + rbuf on:	75.70 MBytes/sec
			76.05 MBytes/sec
			75.42 MBytes/sec

The "rbuf off" cases are meant to emulate Andi's 2.6.5
case, and "rbuf on" is current 2.6.x.

How do things look for you with this change Andi?
If things are still out of whack, play around with
different values of /proc/sys/net/ipv4/tcp_tso_win_divisor

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 20:09:28-07:00 davem@nuts.davemloft.net 
#   [TCP]: Add tcp_tso_win_divisor sysctl.
#   
#   This allows control over what percentage of
#   the congestion window can be consumed by a
#   single TSO frame.
#   
#   The setting of this parameter is a choice
#   between burstiness and building larger TSO
#   frames.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +19 -7
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# net/ipv4/sysctl_net_ipv4.c
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +8 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# include/net/tcp.h
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# include/linux/sysctl.h
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h	2004-09-30 20:19:49 -07:00
+++ b/include/linux/sysctl.h	2004-09-30 20:19:49 -07:00
@@ -341,6 +341,7 @@
 	NET_TCP_BIC_LOW_WINDOW=104,
 	NET_TCP_DEFAULT_WIN_SCALE=105,
 	NET_TCP_MODERATE_RCVBUF=106,
+	NET_TCP_TSO_WIN_DIVISOR=107,
 };
 
 enum {
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 20:19:49 -07:00
+++ b/include/net/tcp.h	2004-09-30 20:19:49 -07:00
@@ -609,6 +609,7 @@
 extern int sysctl_tcp_bic_fast_convergence;
 extern int sysctl_tcp_bic_low_window;
 extern int sysctl_tcp_moderate_rcvbuf;
+extern int sysctl_tcp_tso_win_divisor;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
diff -Nru a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
--- a/net/ipv4/sysctl_net_ipv4.c	2004-09-30 20:19:49 -07:00
+++ b/net/ipv4/sysctl_net_ipv4.c	2004-09-30 20:19:49 -07:00
@@ -674,6 +674,14 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= NET_TCP_TSO_WIN_DIVISOR,
+		.procname	= "tcp_tso_win_divisor",
+		.data		= &sysctl_tcp_tso_win_divisor,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 20:19:49 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 20:19:49 -07:00
@@ -45,6 +45,12 @@
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse = 1;
 
+/* This limits the percentage of the congestion window which we
+ * will allow a single TSO frame to consume.  Building TSO frames
+ * which are too large can cause TCP streams to be bursty.
+ */
+int sysctl_tcp_tso_win_divisor = 8;
+
 static __inline__
 void update_send_head(struct sock *sk, struct tcp_opt *tp, struct sk_buff *skb)
 {
@@ -658,7 +664,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
-	int do_large, mss_now;
+	unsigned int do_large, mss_now;
 
 	mss_now = tp->mss_cache_std;
 	if (dst) {
@@ -673,7 +679,7 @@
 		    !tp->urg_mode);
 
 	if (do_large) {
-		int large_mss, factor;
+		unsigned int large_mss, factor, limit;
 
 		large_mss = 65535 - tp->af_specific->net_header_len -
 			tp->ext_header_len - tp->ext2_header_len -
@@ -683,13 +689,19 @@
 			large_mss = max((tp->max_window>>1),
 					68U - tp->tcp_header_len);
 
+		factor = large_mss / mss_now;
+
 		/* Always keep large mss multiple of real mss, but
-		 * do not exceed 1/4 of the congestion window so we
-		 * can keep the ACK clock ticking.
+		 * do not exceed 1/tso_win_divisor of the congestion window
+		 * so we can keep the ACK clock ticking and minimize
+		 * bursting.
 		 */
-		factor = large_mss / mss_now;
-		if (factor > (tp->snd_cwnd >> 2))
-			factor = max(1, tp->snd_cwnd >> 2);
+		limit = tp->snd_cwnd;
+		if (sysctl_tcp_tso_win_divisor)
+			limit /= sysctl_tcp_tso_win_divisor;
+		limit = max(1U, limit);
+		if (factor > limit)
+			factor = limit;
 
 		tp->mss_cache = mss_now * factor;
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-10-01  3:40                                                                       ` David S. Miller
@ 2004-10-01 10:35                                                                         ` Andi Kleen
  0 siblings, 0 replies; 97+ messages in thread
From: Andi Kleen @ 2004-10-01 10:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, jheffner, ak, niv, andy.grover, anton, netdev

> How do things look for you with this change Andi?
> If things are still out of whack, play around with
> different values of /proc/sys/net/ipv4/tcp_tso_win_divisor

Still slower, like previously reported. But I tried tweaking
the sysctl now. Result is that 2 is pretty good (only 3MB/s) 
slower and >20 is also pretty good (2MB/s slower). Everything
inbetween is a lot slower, varying a bit. I wasn't able to find a setting
that gave the same results as TSO off though, although the
difference is not that dramatic anymore.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-10-01  1:12                                                                     ` David S. Miller
  2004-10-01  3:40                                                                       ` David S. Miller
@ 2004-10-01 10:23                                                                       ` Andi Kleen
  1 sibling, 0 replies; 97+ messages in thread
From: Andi Kleen @ 2004-10-01 10:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, jheffner, ak, niv, andy.grover, anton, netdev

> With such small receive buffers, netperf simply can't clear
> the receive queue fast enough when a burst of TSO created
> frames come in.

I increased the receive buffers on the target and the difference
between TSO and non TSO is much less now (only 5MB/s instead of 20MB/s)

Your theory seems to make some sense.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 23:51                                                             ` John Heffner
  2004-09-30  0:03                                                               ` David S. Miller
@ 2004-09-30  0:10                                                               ` John Heffner
  2004-09-30 17:25                                                                 ` John Heffner
  1 sibling, 1 reply; 97+ messages in thread
From: John Heffner @ 2004-09-30  0:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, niv, herbert, andy.grover, anton, netdev

On Wed, 29 Sep 2004, John Heffner wrote:

> Using iperf, I'm getting ~ the same speed to a slow p3 receiver (680
> Mbits) with TSO on or off right now.  Haven't tried netperf.

Netperf does not work well for me (350 Mbits).  Something to investigate.

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  0:10                                                               ` John Heffner
@ 2004-09-30 17:25                                                                 ` John Heffner
  2004-09-30 20:23                                                                   ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: John Heffner @ 2004-09-30 17:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, niv, herbert, andy.grover, anton, netdev

On Wed, 29 Sep 2004, John Heffner wrote:

> On Wed, 29 Sep 2004, John Heffner wrote:
>
> > Using iperf, I'm getting ~ the same speed to a slow p3 receiver (680
> > Mbits) with TSO on or off right now.  Haven't tried netperf.
>
> Netperf does not work well for me (350 Mbits).  Something to investigate.

I tried again, and it does not even work this well.  The connection hangs
at the end in state FIN_WAIT1 with 101361 bytes in the Send-Q.  It seems
that the whole interface dies.  (Maybe it's sending something invalid to
the TSO engine?)  When I bring the interface down then back up again, the
connection terminates normally.

Don't have this problem with iperf, strange.

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30 17:25                                                                 ` John Heffner
@ 2004-09-30 20:23                                                                   ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-30 20:23 UTC (permalink / raw)
  To: John Heffner; +Cc: ak, niv, herbert, andy.grover, anton, netdev

On Thu, 30 Sep 2004 13:25:46 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> On Wed, 29 Sep 2004, John Heffner wrote:
> 
> > On Wed, 29 Sep 2004, John Heffner wrote:
> >
> > > Using iperf, I'm getting ~ the same speed to a slow p3 receiver (680
> > > Mbits) with TSO on or off right now.  Haven't tried netperf.
> >
> > Netperf does not work well for me (350 Mbits).  Something to investigate.
> 
> I tried again, and it does not even work this well.  The connection hangs
> at the end in state FIN_WAIT1 with 101361 bytes in the Send-Q.  It seems
> that the whole interface dies.  (Maybe it's sending something invalid to
> the TSO engine?)  When I bring the interface down then back up again, the
> connection terminates normally.
> 
> Don't have this problem with iperf, strange.

Even stranger is that with current netperf tg3-->tg3 works perfectly fine
for me with TSO enabled.  I'm getting clean 700Mbit transfers through
my D-Link DGS-1008T switch as long as netperf doesn't do something silly
like use tiny send/receiver buffers.

I've never had a transfer stall either.

Please help debug this John since I'm not seeing this here.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 23:29                                                           ` David S. Miller
  2004-09-29 23:51                                                             ` John Heffner
@ 2004-09-30  0:05                                                             ` Herbert Xu
  2004-09-30  4:33                                                               ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-30  0:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, niv, jheffner, andy.grover, anton, netdev

On Wed, Sep 29, 2004 at 04:29:23PM -0700, David S. Miller wrote:
>
> @@ -567,12 +567,18 @@
>  	}
>  
>  	skb->ip_summed = CHECKSUM_HW;
> +
> +	/* Any change of skb->len requires recalculation of tso
> +	 * factor and mss.
> +	 */
> +	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);

Minor optimsations: __tcp_trim_head is only called directly when
tso_factor has already been adjusted by tcp_tso_acked.  So you can
move this setting into tcp_trim_head.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  0:05                                                             ` Herbert Xu
@ 2004-09-30  4:33                                                               ` David S. Miller
  2004-09-30  5:47                                                                 ` Herbert Xu
  2004-09-30  9:29                                                                 ` Andi Kleen
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-30  4:33 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, jheffner, andy.grover, anton, netdev

On Thu, 30 Sep 2004 10:05:15 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Sep 29, 2004 at 04:29:23PM -0700, David S. Miller wrote:
> >
> > @@ -567,12 +567,18 @@
> >  	skb->ip_summed = CHECKSUM_HW;
> > +
> > +	/* Any change of skb->len requires recalculation of tso
> > +	 * factor and mss.
> > +	 */
> > +	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
> 
> Minor optimsations: __tcp_trim_head is only called directly when
> tso_factor has already been adjusted by tcp_tso_acked.  So you can
> move this setting into tcp_trim_head.

Right.  This patch below combines that with adjustment of socket
send queue usage when we trim the head.

I also added John Heffner's snd_cwnd TSO factor tweak.  I adjusted
it down to 1/4 of the congestion window because it gave the best
ramp-up performance for a cross-continental transfer test.

John, this might make your netperf case go better.  Give it a try
and let me know how it goes.

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/29 21:12:18-07:00 davem@nuts.davemloft.net 
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/29 21:11:53-07:00 davem@nuts.davemloft.net +15 -35
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_input.c
#   2004/09/29 21:11:53-07:00 davem@nuts.davemloft.net +9 -13
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# include/net/tcp.h
#   2004/09/29 21:11:52-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-29 21:12:59 -07:00
+++ b/include/net/tcp.h	2004-09-29 21:12:59 -07:00
@@ -944,6 +944,7 @@
 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
+extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
 
 extern void tcp_send_probe0(struct sock *);
 extern void tcp_send_partial(struct sock *);
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-29 21:12:59 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-29 21:12:59 -07:00
@@ -2364,13 +2364,14 @@
  * then making a write space wakeup callback is a possible
  * future enhancement.  WARNING: it is not trivial to make.
  */
-static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+static int tcp_tso_acked(struct sock *sk, struct sk_buff *skb,
 			 __u32 now, __s32 *seq_rtt)
 {
+	struct tcp_opt *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
 	__u32 mss = scb->tso_mss;
 	__u32 snd_una = tp->snd_una;
-	__u32 seq = scb->seq;
+	__u32 orig_seq, seq;
 	__u32 packets_acked = 0;
 	int acked = 0;
 
@@ -2379,22 +2380,18 @@
 	 */
 	BUG_ON(!after(scb->end_seq, snd_una));
 
+	seq = orig_seq = scb->seq;
 	while (!after(seq + mss, snd_una)) {
 		packets_acked++;
 		seq += mss;
 	}
 
+	if (tcp_trim_head(sk, skb, (seq - orig_seq)))
+		return 0;
+
 	if (packets_acked) {
 		__u8 sacked = scb->sacked;
 
-		/* We adjust scb->seq but we do not pskb_pull() the
-		 * SKB.  We let tcp_retransmit_skb() handle this case
-		 * by checking skb->len against the data sequence span.
-		 * This way, we avoid the pskb_pull() work unless we
-		 * actually need to retransmit the SKB.
-		 */
-		scb->seq = seq;
-
 		acked |= FLAG_DATA_ACKED;
 		if (sacked) {
 			if (sacked & TCPCB_RETRANS) {
@@ -2413,7 +2410,7 @@
 							packets_acked);
 			if (sacked & TCPCB_URG) {
 				if (tp->urg_mode &&
-				    !before(scb->seq, tp->snd_up))
+				    !before(orig_seq, tp->snd_up))
 					tp->urg_mode = 0;
 			}
 		} else if (*seq_rtt < 0)
@@ -2425,7 +2422,6 @@
 			tcp_dec_pcount_explicit(&tp->fackets_out, dval);
 		}
 		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
-		scb->tso_factor -= packets_acked;
 
 		BUG_ON(scb->tso_factor == 0);
 		BUG_ON(!before(scb->seq, scb->end_seq));
@@ -2455,7 +2451,7 @@
 		 */
 		if (after(scb->end_seq, tp->snd_una)) {
 			if (scb->tso_factor > 1)
-				acked |= tcp_tso_acked(tp, skb,
+				acked |= tcp_tso_acked(sk, skb,
 						       now, &seq_rtt);
 			break;
 		}
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-29 21:12:59 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-29 21:12:59 -07:00
@@ -525,7 +525,7 @@
  * eventually). The difference is that pulled data not copied, but
  * immediately discarded.
  */
-unsigned char * __pskb_trim_head(struct sk_buff *skb, int len)
+static unsigned char *__pskb_trim_head(struct sk_buff *skb, int len)
 {
 	int i, k, eat;
 
@@ -553,8 +553,10 @@
 	return skb->tail;
 }
 
-static int __tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
+int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
+	struct tcp_opt *tp = tcp_sk(sk);
+
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
@@ -566,8 +568,14 @@
 			return -ENOMEM;
 	}
 
+	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_HW;
 
+	skb->truesize	     -= len;
+	sk->sk_queue_shrunk   = 1;
+	sk->sk_wmem_queued   -= len;
+	sk->sk_forward_alloc += len;
+
 	/* Any change of skb->len requires recalculation of tso
 	 * factor and mss.
 	 */
@@ -576,16 +584,6 @@
 	return 0;
 }
 
-static inline int tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
-{
-	int err = __tcp_trim_head(tp, skb, len);
-
-	if (!err)
-		TCP_SKB_CB(skb)->seq += len;
-
-	return err;
-}
-
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
    tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -686,11 +684,12 @@
 					68U - tp->tcp_header_len);
 
 		/* Always keep large mss multiple of real mss, but
-		 * do not exceed congestion window.
+		 * do not exceed 1/4 of the congestion window so we
+		 * can keep the ACK clock ticking.
 		 */
 		factor = large_mss / mss_now;
-		if (factor > tp->snd_cwnd)
-			factor = tp->snd_cwnd;
+		if (factor > (tp->snd_cwnd >> 2))
+			factor = max(1, tp->snd_cwnd >> 2);
 
 		tp->mss_cache = mss_now * factor;
 
@@ -1003,7 +1002,6 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
  	unsigned int cur_mss = tcp_current_mss(sk, 0);
-	__u32 data_seq, data_end_seq;
 	int err;
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
@@ -1013,24 +1011,6 @@
 	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
 		return -EAGAIN;
 
-	/* What is going on here?  When TSO packets are partially ACK'd,
-	 * we adjust the TCP_SKB_CB(skb)->seq value forward but we do
-	 * not adjust the data area of the SKB.  We defer that to here
-	 * so that we can avoid the work unless we really retransmit
-	 * the packet.
-	 */
-	data_seq = TCP_SKB_CB(skb)->seq;
-	data_end_seq = TCP_SKB_CB(skb)->end_seq;
-	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
-		data_end_seq--;
-
-	if (skb->len > (data_end_seq - data_seq)) {
-		u32 to_trim = skb->len - (data_end_seq - data_seq);
-
-		if (__tcp_trim_head(tp, skb, to_trim))
-			return -ENOMEM;
-	}		
-
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
@@ -1041,7 +1021,7 @@
 			tp->mss_cache = tp->mss_cache_std;
 		}
 
-		if (tcp_trim_head(tp, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
 			return -ENOMEM;
 	}
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  4:33                                                               ` David S. Miller
@ 2004-09-30  5:47                                                                 ` Herbert Xu
  2004-09-30  7:39                                                                   ` David S. Miller
  2004-09-30  9:29                                                                 ` Andi Kleen
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-30  5:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, jheffner, andy.grover, anton, netdev

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

On Wed, Sep 29, 2004 at 09:33:10PM -0700, David S. Miller wrote:
>
> @@ -2413,7 +2410,7 @@
>  							packets_acked);
>  			if (sacked & TCPCB_URG) {
>  				if (tp->urg_mode &&
> -				    !before(scb->seq, tp->snd_up))
> +				    !before(orig_seq, tp->snd_up))
>  					tp->urg_mode = 0;

That looks like a typo.  We should check against the new starting
sequence number, not the original.  We should also change the !before
to after since the original check applied to end_seq.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: p --]
[-- Type: text/plain, Size: 375 bytes --]

--- net-2.6/net/ipv4/tcp_input.c.orig	2004-09-30 15:41:46.000000000 +1000
+++ net-2.6/net/ipv4/tcp_input.c	2004-09-30 15:44:27.000000000 +1000
@@ -2410,7 +2410,7 @@
 							packets_acked);
 			if (sacked & TCPCB_URG) {
 				if (tp->urg_mode &&
-				    !before(orig_seq, tp->snd_up))
+				    after(seq, tp->snd_up))
 					tp->urg_mode = 0;
 			}
 		} else if (*seq_rtt < 0)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  5:47                                                                 ` Herbert Xu
@ 2004-09-30  7:39                                                                   ` David S. Miller
  2004-09-30  8:09                                                                     ` Herbert Xu
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-30  7:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: ak, niv, jheffner, andy.grover, anton, netdev

On Thu, 30 Sep 2004 15:47:38 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Sep 29, 2004 at 09:33:10PM -0700, David S. Miller wrote:
> >
> > @@ -2413,7 +2410,7 @@
> >  							packets_acked);
> >  			if (sacked & TCPCB_URG) {
> >  				if (tp->urg_mode &&
> > -				    !before(scb->seq, tp->snd_up))
> > +				    !before(orig_seq, tp->snd_up))
> >  					tp->urg_mode = 0;
> 
> That looks like a typo.  We should check against the new starting
> sequence number, not the original.  We should also change the !before
> to after since the original check applied to end_seq.

I agree about the first part, but the second I do not.

The new 'seq' is equivalent to what end_seq would be of the
TSO sub-packet.  Therefore the correct test type would be
!before(seq, tp->snd_up), right?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  7:39                                                                   ` David S. Miller
@ 2004-09-30  8:09                                                                     ` Herbert Xu
  0 siblings, 0 replies; 97+ messages in thread
From: Herbert Xu @ 2004-09-30  8:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, niv, jheffner, andy.grover, anton, netdev

On Thu, Sep 30, 2004 at 12:39:22AM -0700, David S. Miller wrote:
> 
> The new 'seq' is equivalent to what end_seq would be of the
> TSO sub-packet.  Therefore the correct test type would be
> !before(seq, tp->snd_up), right?

You're absolutely right.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  4:33                                                               ` David S. Miller
  2004-09-30  5:47                                                                 ` Herbert Xu
@ 2004-09-30  9:29                                                                 ` Andi Kleen
  2004-09-30 20:20                                                                   ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-30  9:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: Herbert Xu, ak, niv, jheffner, andy.grover, anton, netdev

> Right.  This patch below combines that with adjustment of socket
> send queue usage when we trim the head.

I tested this patch on rc3. First it doesn't crash anymore. Thanks.

Talking to another rc3 kernel with tg3 tso on is as fast as tso off now.
Unfortunately the ACKs still look funny. 

Talking to a 2.6.5 kernel tso on is still about 20 MB/s slower than
the non TSO case, so the problem seems to be still there.

-Andi

2.6.9rc3/e1000/CSA/TSO -> 2.6.9rc3/tg3/33MhzPCI
Fast.

Still stretch ack. 

11:13:25.721704 10.23.202.15.32777 > 10.23.202.31.32784: . ack 7241 win 5068 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721709 10.23.202.15.32777 > 10.23.202.31.32784: . ack 8689 win 5792 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721715 10.23.202.15.32777 > 10.23.202.31.32784: . ack 10137 win 6516 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721726 10.23.202.15.32777 > 10.23.202.31.32784: . ack 11585 win 7240 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721731 10.23.202.15.32777 > 10.23.202.31.32784: . ack 13033 win 7964 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721882 10.23.202.31.32784 > 10.23.202.15.32777: . 13033:14481(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721887 10.23.202.31.32784 > 10.23.202.15.32777: . 14481:15929(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721893 10.23.202.31.32784 > 10.23.202.15.32777: P 15929:16385(456) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721940 10.23.202.31.32784 > 10.23.202.15.32777: . 16385:17833(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721943 10.23.202.31.32784 > 10.23.202.15.32777: . 17833:19281(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721947 10.23.202.31.32784 > 10.23.202.15.32777: . 19281:20729(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721951 10.23.202.31.32784 > 10.23.202.15.32777: . 20729:22177(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.721970 10.23.202.15.32777 > 10.23.202.31.32784: . ack 14481 win 8688 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721975 10.23.202.15.32777 > 10.23.202.31.32784: . ack 15929 win 9412 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721980 10.23.202.15.32777 > 10.23.202.31.32784: . ack 16385 win 9412 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721985 10.23.202.15.32777 > 10.23.202.31.32784: . ack 17833 win 10136 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721991 10.23.202.15.32777 > 10.23.202.31.32784: . ack 19281 win 10860 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.721996 10.23.202.15.32777 > 10.23.202.31.32784: . ack 20729 win 11584 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722018 10.23.202.15.32777 > 10.23.202.31.32784: . ack 22177 win 12308 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722127 10.23.202.31.32784 > 10.23.202.15.32777: . 22177:23625(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.722134 10.23.202.15.32777 > 10.23.202.31.32784: . ack 23625 win 13032 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722139 10.23.202.31.32784 > 10.23.202.15.32777: . 23625:25073(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.722142 10.23.202.15.32777 > 10.23.202.31.32784: . ack 25073 win 13756 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722147 10.23.202.31.32784 > 10.23.202.15.32777: . 25073:26521(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.722151 10.23.202.15.32777 > 10.23.202.31.32784: . ack 26521 win 14480 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722156 10.23.202.31.32784 > 10.23.202.15.32777: . 26521:27969(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.722160 10.23.202.15.32777 > 10.23.202.31.32784: . ack 27969 win 14962 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722228 10.23.202.31.32784 > 10.23.202.15.32777: . 27969:29417(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.722236 10.23.202.31.32784 > 10.23.202.15.32777: . 29417:30865(1448) ack 1 win 1460 <nop,nop,timestamp 4294879786 4294921018> (DF)
11:13:25.722243 10.23.202.31.32784 > 10.23.202.15.32777: . 30865:32313(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722249 10.23.202.31.32784 > 10.23.202.15.32777: . 32313:33761(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722273 10.23.202.31.32784 > 10.23.202.15.32777: P 33761:35209(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722277 10.23.202.31.32784 > 10.23.202.15.32777: . 35209:36657(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722282 10.23.202.31.32784 > 10.23.202.15.32777: . 36657:38105(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722339 10.23.202.15.32777 > 10.23.202.31.32784: . ack 38105 win 15204 <nop,nop,timestamp 4294921018 4294879786> (DF)
11:13:25.722415 10.23.202.31.32784 > 10.23.202.15.32777: . 38105:39553(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722419 10.23.202.31.32784 > 10.23.202.15.32777: . 39553:41001(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722425 10.23.202.31.32784 > 10.23.202.15.32777: . 41001:42449(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722429 10.23.202.31.32784 > 10.23.202.15.32777: . 42449:43897(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722433 10.23.202.31.32784 > 10.23.202.15.32777: . 43897:45345(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722436 10.23.202.31.32784 > 10.23.202.15.32777: . 45345:46793(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722459 10.23.202.31.32784 > 10.23.202.15.32777: . 46793:48241(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722463 10.23.202.31.32784 > 10.23.202.15.32777: . 48241:49689(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722472 10.23.202.15.32777 > 10.23.202.31.32784: . ack 39553 win 15928 <nop,nop,timestamp 4294921019 4294879787> (DF)
11:13:25.722504 10.23.202.15.32777 > 10.23.202.31.32784: . ack 49689 win 15928 <nop,nop,timestamp 4294921019 4294879787> (DF)
11:13:25.722616 10.23.202.31.32784 > 10.23.202.15.32777: . 49689:51137(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722622 10.23.202.31.32784 > 10.23.202.15.32777: . 51137:52585(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722690 10.23.202.31.32784 > 10.23.202.15.32777: . 52585:54033(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722697 10.23.202.31.32784 > 10.23.202.15.32777: . 54033:55481(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921018> (DF)
11:13:25.722703 10.23.202.31.32784 > 10.23.202.15.32777: . 55481:56929(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921019> (DF)
11:13:25.722710 10.23.202.31.32784 > 10.23.202.15.32777: . 56929:58377(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921019> (DF)
11:13:25.722716 10.23.202.31.32784 > 10.23.202.15.32777: . 58377:59825(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921019> (DF)
11:13:25.722786 10.23.202.31.32784 > 10.23.202.15.32777: P 59825:61273(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921019> (DF)
11:13:25.722793 10.23.202.31.32784 > 10.23.202.15.32777: . 61273:62721(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921019> (DF)
11:13:25.722798 10.23.202.31.32784 > 10.23.202.15.32777: . 62721:64169(1448) ack 1 win 1460 <nop,nop,timestamp 4294879787 4294921019> (DF)




2.6.9rc3/e1000/CSA/TSO -> 2.6.5/tg3/33MhzPCI
Slow.

Extreme stretch ack: 

11:19:45.486497 10.23.202.31.32782 > 10.23.202.10.32978: . 27969:29417(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486506 10.23.202.10.32978 > 10.23.202.31.32782: . ack 29417 win 62264 <nop,nop,timestamp 5148288 4294857431> (DF)
11:19:45.486510 10.23.202.31.32782 > 10.23.202.10.32978: P 29417:30865(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486536 10.23.202.31.32782 > 10.23.202.10.32978: . 30865:32313(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486543 10.23.202.31.32782 > 10.23.202.10.32978: . 32313:33761(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486642 10.23.202.10.32978 > 10.23.202.31.32782: . ack 33761 win 63712 <nop,nop,timestamp 5148288 4294857431> (DF)
11:19:45.486678 10.23.202.31.32782 > 10.23.202.10.32978: . 33761:35209(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486686 10.23.202.10.32978 > 10.23.202.31.32782: . ack 35209 win 63712 <nop,nop,timestamp 5148288 4294857431> (DF)
11:19:45.486691 10.23.202.31.32782 > 10.23.202.10.32978: . 35209:36657(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486697 10.23.202.31.32782 > 10.23.202.10.32978: . 36657:38105(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486704 10.23.202.31.32782 > 10.23.202.10.32978: . 38105:39553(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486710 10.23.202.31.32782 > 10.23.202.10.32978: . 39553:41001(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486716 10.23.202.31.32782 > 10.23.202.10.32978: . 41001:42449(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486722 10.23.202.31.32782 > 10.23.202.10.32978: . 42449:43897(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486747 10.23.202.31.32782 > 10.23.202.10.32978: . 43897:45345(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486754 10.23.202.31.32782 > 10.23.202.10.32978: . 45345:46793(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486761 10.23.202.31.32782 > 10.23.202.10.32978: . 46793:48241(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486768 10.23.202.31.32782 > 10.23.202.10.32978: . 48241:49689(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486775 10.23.202.31.32782 > 10.23.202.10.32978: . 49689:51137(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486859 10.23.202.31.32782 > 10.23.202.10.32978: . 51137:52585(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486867 10.23.202.31.32782 > 10.23.202.10.32978: . 52585:54033(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486873 10.23.202.31.32782 > 10.23.202.10.32978: . 54033:55481(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.486880 10.23.202.31.32782 > 10.23.202.10.32978: P 55481:56929(1448) ack 1 win 1460 <nop,nop,timestamp 4294857431 5148288> (DF)
11:19:45.487018 10.23.202.10.32978 > 10.23.202.31.32782: . ack 56929 win 63712 <nop,nop,timestamp 5148289 4294857431> (DF)
11:19:45.487279 10.23.202.31.32782 > 10.23.202.10.32978: . 56929:58377(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487288 10.23.202.31.32782 > 10.23.202.10.32978: . 58377:59825(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487294 10.23.202.31.32782 > 10.23.202.10.32978: . 59825:61273(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487299 10.23.202.31.32782 > 10.23.202.10.32978: . 61273:62721(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487305 10.23.202.31.32782 > 10.23.202.10.32978: . 62721:64169(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487311 10.23.202.31.32782 > 10.23.202.10.32978: . 64169:65617(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487316 10.23.202.31.32782 > 10.23.202.10.32978: . 65617:67065(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487370 10.23.202.31.32782 > 10.23.202.10.32978: . 67065:68513(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487376 10.23.202.31.32782 > 10.23.202.10.32978: . 68513:69961(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487382 10.23.202.31.32782 > 10.23.202.10.32978: . 69961:71409(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487388 10.23.202.31.32782 > 10.23.202.10.32978: . 71409:72857(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487393 10.23.202.31.32782 > 10.23.202.10.32978: . 72857:74305(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487398 10.23.202.31.32782 > 10.23.202.10.32978: . 74305:75753(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487421 10.23.202.31.32782 > 10.23.202.10.32978: . 75753:77201(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487427 10.23.202.31.32782 > 10.23.202.10.32978: . 77201:78649(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487433 10.23.202.31.32782 > 10.23.202.10.32978: . 78649:80097(1448) ack 1 win 1460 <nop,nop,timestamp 4294857432 5148289> (DF)
11:19:45.487563 10.23.202.10.32978 > 10.23.202.31.32782: . ack 80097 win 63712 <nop,nop,timestamp 5148289 4294857432> (DF)



2.6.9rc3/e1000/CSA/noTSO -> 2.6.5/tg3/33MhzPCI 
Fast.

Also stretch ACK, but less extreme.

11:26:12.886397 10.23.202.10.32979 > 10.23.202.31.32788: . ack 8689 win 23168 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886405 10.23.202.10.32979 > 10.23.202.31.32788: . ack 10137 win 26064 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886420 10.23.202.10.32979 > 10.23.202.31.32788: . ack 11585 win 28960 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886440 10.23.202.10.32979 > 10.23.202.31.32788: . ack 13033 win 31856 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886488 10.23.202.31.32788 > 10.23.202.10.32979: . 13033:14481(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886500 10.23.202.10.32979 > 10.23.202.31.32788: . ack 14481 win 34752 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886504 10.23.202.31.32788 > 10.23.202.10.32979: . 14481:15929(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886513 10.23.202.10.32979 > 10.23.202.31.32788: . ack 15929 win 37648 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886749 10.23.202.31.32788 > 10.23.202.10.32979: . 15929:17377(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886757 10.23.202.31.32788 > 10.23.202.10.32979: . 17377:18825(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886763 10.23.202.31.32788 > 10.23.202.10.32979: . 18825:20273(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886769 10.23.202.31.32788 > 10.23.202.10.32979: . 20273:21721(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886775 10.23.202.31.32788 > 10.23.202.10.32979: . 21721:23169(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886780 10.23.202.31.32788 > 10.23.202.10.32979: . 23169:24617(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886786 10.23.202.31.32788 > 10.23.202.10.32979: P 24617:26065(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886792 10.23.202.31.32788 > 10.23.202.10.32979: P 26065:27513(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886798 10.23.202.31.32788 > 10.23.202.10.32979: . 27513:28961(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886805 10.23.202.31.32788 > 10.23.202.10.32979: . 28961:30409(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886810 10.23.202.31.32788 > 10.23.202.10.32979: . 30409:31857(1448) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886816 10.23.202.31.32788 > 10.23.202.10.32979: P 31857:32769(912) ack 1 win 1460 <nop,nop,timestamp 277526 5535752> (DF)
11:26:12.886892 10.23.202.10.32979 > 10.23.202.31.32788: . ack 17377 win 40544 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886901 10.23.202.10.32979 > 10.23.202.31.32788: . ack 18825 win 43440 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886910 10.23.202.10.32979 > 10.23.202.31.32788: . ack 20273 win 46336 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886917 10.23.202.10.32979 > 10.23.202.31.32788: . ack 21721 win 49232 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886925 10.23.202.10.32979 > 10.23.202.31.32788: . ack 23169 win 52128 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886933 10.23.202.10.32979 > 10.23.202.31.32788: . ack 24617 win 55024 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886960 10.23.202.10.32979 > 10.23.202.31.32788: . ack 26065 win 57920 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886976 10.23.202.10.32979 > 10.23.202.31.32788: . ack 27513 win 60816 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886985 10.23.202.10.32979 > 10.23.202.31.32788: . ack 28961 win 63712 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.886995 10.23.202.10.32979 > 10.23.202.31.32788: . ack 30409 win 63712 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.887027 10.23.202.10.32979 > 10.23.202.31.32788: . ack 31857 win 63712 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.887035 10.23.202.10.32979 > 10.23.202.31.32788: . ack 32769 win 63712 <nop,nop,timestamp 5535752 277526> (DF)
11:26:12.887151 10.23.202.31.32788 > 10.23.202.10.32979: . 32769:34217(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887161 10.23.202.31.32788 > 10.23.202.10.32979: . 34217:35665(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887167 10.23.202.31.32788 > 10.23.202.10.32979: . 35665:37113(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887172 10.23.202.31.32788 > 10.23.202.10.32979: . 37113:38561(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887225 10.23.202.31.32788 > 10.23.202.10.32979: . 38561:40009(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887231 10.23.202.31.32788 > 10.23.202.10.32979: . 40009:41457(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887237 10.23.202.31.32788 > 10.23.202.10.32979: . 41457:42905(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887283 10.23.202.31.32788 > 10.23.202.10.32979: . 42905:44353(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887289 10.23.202.31.32788 > 10.23.202.10.32979: . 44353:45801(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887343 10.23.202.31.32788 > 10.23.202.10.32979: . 45801:47249(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887349 10.23.202.31.32788 > 10.23.202.10.32979: . 47249:48697(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887355 10.23.202.31.32788 > 10.23.202.10.32979: . 48697:50145(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887362 10.23.202.31.32788 > 10.23.202.10.32979: P 50145:51593(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887369 10.23.202.31.32788 > 10.23.202.10.32979: . 51593:53041(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887377 10.23.202.31.32788 > 10.23.202.10.32979: . 53041:54489(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887383 10.23.202.31.32788 > 10.23.202.10.32979: . 54489:55937(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887389 10.23.202.31.32788 > 10.23.202.10.32979: . 55937:57385(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887413 10.23.202.31.32788 > 10.23.202.10.32979: . 57385:58833(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887419 10.23.202.31.32788 > 10.23.202.10.32979: . 58833:60281(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887425 10.23.202.31.32788 > 10.23.202.10.32979: . 60281:61729(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887431 10.23.202.31.32788 > 10.23.202.10.32979: . 61729:63177(1448) ack 1 win 1460 <nop,nop,timestamp 277527 5535752> (DF)
11:26:12.887622 10.23.202.10.32979 > 10.23.202.31.32788: . ack 63177 win 63712 <nop,nop,timestamp 5535753 277527> (DF)
11:26:12.887765 10.23.202.31.32788 > 10.23.202.10.32979: . 63177:64625(1448) ack 1 win 1460 <nop,nop,timestamp 277528 5535753> (DF)
11:26:12.887774 10.23.202.31.32788 > 10.23.202.10.32979: . 64625:66073(1448) ack 1 win 1460 <nop,nop,timestamp 277528 5535753> (DF)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-30  9:29                                                                 ` Andi Kleen
@ 2004-09-30 20:20                                                                   ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-30 20:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: herbert, ak, niv, jheffner, andy.grover, anton, netdev

On Thu, 30 Sep 2004 11:29:26 +0200
Andi Kleen <ak@suse.de> wrote:

> Talking to another rc3 kernel with tg3 tso on is as fast as tso off now.
> Unfortunately the ACKs still look funny. 

Can you help us debug this instead of just running tests over
and over Andi?  It's frustrating because I cannot reproduce
the problem here, else I would be adding checks to the receiver
on my systems here. :-/

Please put some checks into the ACK tests on your 2.6.5 receiver
to determine why it does not want to ACK every other frame like
it is supposed to.  We need to figure out where the stretch
ACKs are coming from.

Thanks.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28 22:33                                             ` Andi Kleen
  2004-09-28 22:57                                               ` David S. Miller
@ 2004-09-29  3:27                                               ` John Heffner
  2004-09-29  9:01                                                 ` Andi Kleen
  2004-09-29 21:00                                                 ` David S. Miller
  1 sibling, 2 replies; 97+ messages in thread
From: John Heffner @ 2004-09-29  3:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev

On Wed, 29 Sep 2004, Andi Kleen wrote:

> I'm afraid I must report it's still not completely solved for me yet.
> 10s netperf with TSO on with your patches gives now ~10MB/s less than
> with TSO off (57 vs 67). It's better than before, but not really
> fixed yet.
>
> Looking at my tcpdumps and comparing TSO on/off I see a quite
> strange effect. It only acks on every ~25th packet with TSO off
> but every ~16th packet with TSO on.
>
> Receiver is a 2.6.5 kernel, it's weird that it violates the
> ack every two MSS rule.

Does this help?

===== net/ipv4/tcp_input.c 1.73 vs edited =====
--- 1.73/net/ipv4/tcp_input.c	2004-09-12 20:30:58 -04:00
+++ edited/net/ipv4/tcp_input.c	2004-09-28 23:23:40 -04:00
@@ -3936,7 +4048,7 @@
 	     /* ... and right edge of window advances far enough.
 	      * (tcp_recvmsg() will send ACK otherwise). Or...
 	      */
-	     && __tcp_select_window(sk) >= tp->rcv_wnd) ||
+	    /* && __tcp_select_window(sk) >= tp->rcv_wnd */) ||
 	    /* We ACK each frame or... */
 	    tcp_in_quickack_mode(tp) ||
 	    /* We have out of order data. */

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29  3:27                                               ` John Heffner
@ 2004-09-29  9:01                                                 ` Andi Kleen
  2004-09-29 19:56                                                   ` David S. Miller
  2004-09-29 21:00                                                 ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-29  9:01 UTC (permalink / raw)
  To: John Heffner; +Cc: Andi Kleen, netdev

> Does this help?

Possible. I don't have any plans to change the receiver because
it works well for most cases.

The problem must be still in the sender.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29  9:01                                                 ` Andi Kleen
@ 2004-09-29 19:56                                                   ` David S. Miller
  2004-09-29 20:56                                                     ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-29 19:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jheffner, ak, netdev

On Wed, 29 Sep 2004 11:01:03 +0200
Andi Kleen <ak@suse.de> wrote:

> > Does this help?
> 
> Possible. I don't have any plans to change the receiver because
> it works well for most cases.
> 
> The problem must be still in the sender.

"Must"?  So you've proven that the ack-every-22 packets behavior
is due entirely to the sender?

If you put 2.6.9-current on both ends and this makes things go smoothly
that is an important data point and nobody else is seeing the behavior
you are at the moment.

Personally, all of my bets are on the 2.6.5 frankenstein kernel
as the culprit :-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 19:56                                                   ` David S. Miller
@ 2004-09-29 20:56                                                     ` Andi Kleen
  2004-09-29 21:17                                                       ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-29 20:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: jheffner, netdev

On Wed, 29 Sep 2004 12:56:44 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> On Wed, 29 Sep 2004 11:01:03 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > > Does this help?
> > 
> > Possible. I don't have any plans to change the receiver because
> > it works well for most cases.
> > 
> > The problem must be still in the sender.
> 
> "Must"?  So you've proven that the ack-every-22 packets behavior
> is due entirely to the sender?

Just saying the the kernel was tested in a very wide range 
of very wide range of workloads and environments and there are no 
known TCP issues.

Also the test that John wanted to diable has been in Linux practically
forever (irc dates back to Eric Schenk's work in 2.0) 

And other kernels talking to it get ok performance, all the problems
only started with the TSO changes in 2.6.9rc.

> 
> If you put 2.6.9-current on both ends and this makes things go smoothly
> that is an important data point and nobody else is seeing the behavior
> you are at the moment.
> 
> Personally, all of my bets are on the 2.6.5 frankenstein kernel
> as the culprit :-)

Umm, the TCP stack is pretty vanilla 2.6.5. 

Ok, I ran it talking to a 2.6.9rc2bk11+instanto-oops-patchkit
bandaided to avoid oops kernel with a tg3.

Actually I tried to, with TSO on on the sender it doesn't finish the normal
10s netperf standard test even in 30s. Then the sender eventually crashed
even with unit-at-a-time turned off.

And the crash starts looking a bit less like a compiler issue. Does it really 
work for you?

Part of the trace. It looks like the new kernel has really
bad problems with acking.

22:45:06.065031 10.23.202.15.32777 > 10.23.202.31.32775: . ack 11585 win 7240 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065042 10.23.202.15.32777 > 10.23.202.31.32775: . ack 13033 win 7964 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065047 10.23.202.15.32777 > 10.23.202.31.32775: . ack 14481 win 8688 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065052 10.23.202.15.32777 > 10.23.202.31.32775: . ack 15929 win 9412 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065061 10.23.202.15.32777 > 10.23.202.31.32775: . ack 17377 win 10136 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065066 10.23.202.15.32777 > 10.23.202.31.32775: . ack 18825 win 10860 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065091 10.23.202.15.32777 > 10.23.202.31.32775: . ack 20273 win 11584 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065254 10.23.202.31.32775 > 10.23.202.15.32777: . 20273:21721(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065317 10.23.202.15.32777 > 10.23.202.31.32775: . ack 21721 win 12308 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065322 10.23.202.31.32775 > 10.23.202.15.32777: . 21721:23169(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065327 10.23.202.31.32775 > 10.23.202.15.32777: . 23169:24617(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065331 10.23.202.31.32775 > 10.23.202.15.32777: . 24617:26065(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065335 10.23.202.31.32775 > 10.23.202.15.32777: . 26065:27513(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065339 10.23.202.31.32775 > 10.23.202.15.32777: P 27513:28961(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065348 10.23.202.15.32777 > 10.23.202.31.32775: . ack 23169 win 13032 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065353 10.23.202.15.32777 > 10.23.202.31.32775: . ack 24617 win 13756 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065358 10.23.202.15.32777 > 10.23.202.31.32775: . ack 26065 win 14480 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065368 10.23.202.15.32777 > 10.23.202.31.32775: . ack 27513 win 15204 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065372 10.23.202.15.32777 > 10.23.202.31.32775: . ack 28961 win 15928 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065529 10.23.202.31.32775 > 10.23.202.15.32777: . 28961:30409(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065534 10.23.202.31.32775 > 10.23.202.15.32777: . 30409:31857(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065540 10.23.202.31.32775 > 10.23.202.15.32777: . 31857:33305(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065577 10.23.202.31.32775 > 10.23.202.15.32777: . 33305:34753(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065581 10.23.202.31.32775 > 10.23.202.15.32777: . 34753:36201(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065585 10.23.202.31.32775 > 10.23.202.15.32777: P 36201:37649(1448) ack 1 win 1460 <nop,nop,timestamp 325947 352118> (DF)
22:45:06.065603 10.23.202.15.32777 > 10.23.202.31.32775: . ack 30409 win 16022 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065608 10.23.202.15.32777 > 10.23.202.31.32775: . ack 31857 win 16022 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065613 10.23.202.15.32777 > 10.23.202.31.32775: . ack 33305 win 16022 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065622 10.23.202.15.32777 > 10.23.202.31.32775: . ack 37649 win 16022 <nop,nop,timestamp 352118 325947> (DF)
22:45:06.065771 10.23.202.31.32775 > 10.23.202.15.32777: . 37649:39097(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065778 10.23.202.31.32775 > 10.23.202.15.32777: . 39097:40545(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065784 10.23.202.31.32775 > 10.23.202.15.32777: . 40545:41993(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065847 10.23.202.31.32775 > 10.23.202.15.32777: . 41993:43441(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065851 10.23.202.31.32775 > 10.23.202.15.32777: . 43441:44889(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065855 10.23.202.31.32775 > 10.23.202.15.32777: . 44889:46337(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065858 10.23.202.31.32775 > 10.23.202.15.32777: . 46337:47785(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065862 10.23.202.31.32775 > 10.23.202.15.32777: . 47785:49233(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352118> (DF)
22:45:06.065900 10.23.202.15.32777 > 10.23.202.31.32775: . ack 49233 win 16022 <nop,nop,timestamp 352119 325948> (DF)
22:45:06.066152 10.23.202.31.32775 > 10.23.202.15.32777: . 49233:50681(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066159 10.23.202.31.32775 > 10.23.202.15.32777: . 50681:52129(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066165 10.23.202.31.32775 > 10.23.202.15.32777: . 52129:53577(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066172 10.23.202.31.32775 > 10.23.202.15.32777: . 53577:55025(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066208 10.23.202.31.32775 > 10.23.202.15.32777: . 55025:56473(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066212 10.23.202.31.32775 > 10.23.202.15.32777: . 56473:57921(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066216 10.23.202.31.32775 > 10.23.202.15.32777: . 57921:59369(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066219 10.23.202.31.32775 > 10.23.202.15.32777: P 59369:60817(1448) ack 1 win 1460 <nop,nop,timestamp 325948 352119> (DF)
22:45:06.066263 10.23.202.15.32777 > 10.23.202.31.32775: . ack 60817 win 16022 <nop,nop,timestamp 352119 325948> (DF)



-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 20:56                                                     ` Andi Kleen
@ 2004-09-29 21:17                                                       ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-29 21:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jheffner, netdev

On Wed, 29 Sep 2004 22:56:35 +0200
Andi Kleen <ak@suse.de> wrote:

> Also the test that John wanted to diable has been in Linux practically
> forever (irc dates back to Eric Schenk's work in 2.0) 

Yes, but it could prove something about either the sender
or the receiver, namely that the receive window is not openning
up fast enough for some reason, thus delaying the ACKs.

> Part of the trace. It looks like the new kernel has really
> bad problems with acking.

It looks, again, like the receiver's window is not openning
up fast enough, which controls ACK response decisions, and
that is what jheffner's patch is supposed to verify.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29  3:27                                               ` John Heffner
  2004-09-29  9:01                                                 ` Andi Kleen
@ 2004-09-29 21:00                                                 ` David S. Miller
  2004-09-29 21:16                                                   ` Nivedita Singhvi
  1 sibling, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-29 21:00 UTC (permalink / raw)
  To: John Heffner; +Cc: ak, netdev

On Tue, 28 Sep 2004 23:27:21 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> On Wed, 29 Sep 2004, Andi Kleen wrote:
> 
> > I'm afraid I must report it's still not completely solved for me yet.
> > 10s netperf with TSO on with your patches gives now ~10MB/s less than
> > with TSO off (57 vs 67). It's better than before, but not really
> > fixed yet.
> >
> > Looking at my tcpdumps and comparing TSO on/off I see a quite
> > strange effect. It only acks on every ~25th packet with TSO off
> > but every ~16th packet with TSO on.
> >
> > Receiver is a 2.6.5 kernel, it's weird that it violates the
> > ack every two MSS rule.
> 
> Does this help?

I think you hit the jackpot John... or at least you're
on the right trail.

It seems I'll have to do some send buffer liberation when
we partially ACK TSO frames.  Since that isn't happening
currently, this window advancing test never passes until
the full TSO frame is freed up at the sender side.

Patch coming...

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:00                                                 ` David S. Miller
@ 2004-09-29 21:16                                                   ` Nivedita Singhvi
  2004-09-29 21:22                                                     ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-29 21:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: John Heffner, ak, netdev

David S. Miller wrote:

  > I think you hit the jackpot John... or at least you're
> on the right trail.
> 
> It seems I'll have to do some send buffer liberation when
> we partially ACK TSO frames.  Since that isn't happening
> currently, this window advancing test never passes until
> the full TSO frame is freed up at the sender side.
> 
> Patch coming...

That was my point to Herbert, Dave - that we can't
rely on Nagle - either we're triggering too early
and not utilizing TSO MTU or we're triggering too
late (waiting for the full TSO frame) depending
on whether we use standard or TSO mss..

We need some heuristic to do partial sends under
TSO. Is that what you are addressing?


thanks,
Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:16                                                   ` Nivedita Singhvi
@ 2004-09-29 21:22                                                     ` David S. Miller
  2004-09-29 21:43                                                       ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-29 21:22 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: jheffner, ak, netdev

On Wed, 29 Sep 2004 14:16:55 -0700
Nivedita Singhvi <niv@us.ibm.com> wrote:

> That was my point to Herbert, Dave - that we can't
> rely on Nagle - either we're triggering too early
> and not utilizing TSO MTU or we're triggering too
> late (waiting for the full TSO frame) depending
> on whether we use standard or TSO mss..
> 
> We need some heuristic to do partial sends under
> TSO. Is that what you are addressing?

It's partial "ACK" of a TSO frame that cause congestion
window and socket buffer allocation issues.

We handle partial sends by just limiting the TSO mss
to never be larger than the congestion window.

We need to handle ACK'ing issues by:

1) Keeping track of "real mss" packet ACKs of TSO
   frames.  This is implemented currently by advancing
   the sequence number of the TSO SKB in the retransmit
   queue.

2) What I'm working on now, which is to liberate send buffer
   socket space when we do partial acking in #1

I need to see the crash folks are seeing.  Does it only happen
when you have tcpdump attached to the interface running the
test?  Does it happen only on the sender or the receiver
in the netperf test?  Those would be a good clues, indicating
some SKB sharing bug I've created or similar.

Thanks.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:22                                                     ` David S. Miller
@ 2004-09-29 21:43                                                       ` Andi Kleen
  2004-09-29 21:51                                                         ` John Heffner
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-29 21:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: Nivedita Singhvi, jheffner, ak, netdev

> I need to see the crash folks are seeing.  Does it only happen
> when you have tcpdump attached to the interface running the
> test?  Does it happen only on the sender or the receiver

It happens without tcpdump too.

> in the netperf test?  Those would be a good clues, indicating

The sender first gets slow, then eventually crashes.

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:43                                                       ` Andi Kleen
@ 2004-09-29 21:51                                                         ` John Heffner
  2004-09-29 21:52                                                           ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: John Heffner @ 2004-09-29 21:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, Nivedita Singhvi, netdev

On Wed, 29 Sep 2004, Andi Kleen wrote:

> > I need to see the crash folks are seeing.  Does it only happen
> > when you have tcpdump attached to the interface running the
> > test?  Does it happen only on the sender or the receiver
>
> It happens without tcpdump too.

Ditto.

> > in the netperf test?  Those would be a good clues, indicating
>
> The sender first gets slow, then eventually crashes.

I managed to get a tcpdump of one flow.  I accidentally deleted it, but
from memory what I saw was it ramped up to large virtual segments, then
got a partial ack for it (all but the last real segment -- delayed ack
here) then later after the full ack got back sent one small segment.
This segment was acked, but it never seemed to recognize this ack and
repeatedly retransmitted it after timeouts.  (Generating the retrans_out
leaked messages when the d-sacks came back.)

I can't conveniently get a backtrace of the crash right now.

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-29 21:51                                                         ` John Heffner
@ 2004-09-29 21:52                                                           ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-29 21:52 UTC (permalink / raw)
  To: John Heffner; +Cc: ak, niv, netdev

On Wed, 29 Sep 2004 17:51:22 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> I managed to get a tcpdump of one flow.  I accidentally deleted it, but
> from memory what I saw was it ramped up to large virtual segments, then
> got a partial ack for it (all but the last real segment -- delayed ack
> here) then later after the full ack got back sent one small segment.
> This segment was acked, but it never seemed to recognize this ack and
> repeatedly retransmitted it after timeouts.  (Generating the retrans_out
> leaked messages when the d-sacks came back.)
> 
> I can't conveniently get a backtrace of the crash right now.

I can get one now, thanks for the trace description it's
a good clue.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-23 23:11                     ` David S. Miller
  2004-09-23 23:41                       ` Herbert Xu
@ 2004-09-24  8:30                       ` Andi Kleen
  2004-09-27 22:38                       ` John Heffner
  2 siblings, 0 replies; 97+ messages in thread
From: Andi Kleen @ 2004-09-24  8:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, niv, andy.grover, anton, netdev

> Anyways, for testing, something like the patch below.  If things
> still stink a bit, try using a limit of "2" in this patch instead
> of "4".

Doesn't help unfortunately. Anything >= 2 gives 22MB/s or even below
(e.g. 3 gives 16MB/s). The only factor that works is 1, but that 
turns TSO effectively off, doesn't it? 

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-23 23:11                     ` David S. Miller
  2004-09-23 23:41                       ` Herbert Xu
  2004-09-24  8:30                       ` Andi Kleen
@ 2004-09-27 22:38                       ` John Heffner
  2004-09-27 23:04                         ` David S. Miller
  2004-09-28  7:23                         ` Nivedita Singhvi
  2 siblings, 2 replies; 97+ messages in thread
From: John Heffner @ 2004-09-27 22:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, niv, andy.grover, anton, netdev

On Thu, 23 Sep 2004, David S. Miller wrote:

>
> I think I know what may be going on here.
>
> Let's say that we even get the congestion window openned up
> so that we can build 64K TSO frames, that's around 43 or 44
> 1500 mtu frames.
>
> That means as the window fills up, we have to see 44 ACKs
> before we are able to send the next TSO frame.  Needless to
> say that breaks ACK clocking completely.


More specifically, I think it is an interaction with delayed ack (acking
less than 1 virtual segment), and the small cwnd.  This works for me, but
I'm not sure that aren't some lurking problems still.

===== net/ipv4/tcp_output.c 1.58 vs edited =====
--- 1.58/net/ipv4/tcp_output.c	2004-09-14 00:39:17 -04:00
+++ edited/net/ipv4/tcp_output.c	2004-09-27 18:26:43 -04:00
@@ -642,8 +657,8 @@
 		 * do not exceed congestion window.
 		 */
 		factor = large_mss / mss_now;
-		if (factor > tp->snd_cwnd)
-			factor = tp->snd_cwnd;
+		if (factor > (tp->snd_cwnd>>3))
+			factor = max(tp->snd_cwnd>>3, 1);

 		tp->mss_cache = mss_now * factor;
 	}

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 22:38                       ` John Heffner
@ 2004-09-27 23:04                         ` David S. Miller
  2004-09-27 23:25                           ` Andi Kleen
  2004-09-27 23:36                           ` Herbert Xu
  2004-09-28  7:23                         ` Nivedita Singhvi
  1 sibling, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-27 23:04 UTC (permalink / raw)
  To: John Heffner; +Cc: ak, niv, andy.grover, anton, netdev

On Mon, 27 Sep 2004 18:38:42 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> On Thu, 23 Sep 2004, David S. Miller wrote:
> 
> >
> > I think I know what may be going on here.
> >
> > Let's say that we even get the congestion window openned up
> > so that we can build 64K TSO frames, that's around 43 or 44
> > 1500 mtu frames.
> >
> > That means as the window fills up, we have to see 44 ACKs
> > before we are able to send the next TSO frame.  Needless to
> > say that breaks ACK clocking completely.
> 
> More specifically, I think it is an interaction with delayed ack (acking
> less than 1 virtual segment), and the small cwnd.  This works for me, but
> I'm not sure that aren't some lurking problems still.

Yes, this is supposed to work around the problem, but:

1) It is a hack :-)

2) It doesn't help Andi's case, and I think I know why.

The reason Andi Kleen didn't see any improvements from limiting
'factor' is that he is using short lived connections.  If you
have a connection up for long enough, this allows the congestion
window to grow and then it doesn't matter.

Something like the following is what I have been talking about.
I am able to reproduce the problem here locally and the following
makes it go away.

Andi, Anton, and niv, can you confirm it does so for you too?

If tcp_clean_rtx_queue() doesn't return DATA
acked then no congestion growth is allowed to occur.  So we only
get a snd_cwnd bump once for every tso_factor frames, that stinks :)

This is not the final fix.  I need to do something like record the
upper-most virtual ACK within a TSO frame so we don't say DATA acked
for dup-acks that happen to fall in the middle of a TSO frame.

===== net/ipv4/tcp_input.c 1.75 vs edited =====
--- 1.75/net/ipv4/tcp_input.c	2004-09-27 12:00:32 -07:00
+++ edited/net/ipv4/tcp_input.c	2004-09-27 15:35:12 -07:00
@@ -2373,8 +2373,12 @@
 		 * discard it as it's confirmed to have arrived at
 		 * the other end.
 		 */
-		if (after(scb->end_seq, tp->snd_una))
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (scb->tso_factor &&
+			    after(tp->snd_una, scb->seq))
+				acked |= FLAG_DATA_ACKED;
 			break;
+		}

 		/* Initial outgoing SYN's get put onto the write_queue
 		 * just like anything else we transmit.  It is not

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 23:04                         ` David S. Miller
@ 2004-09-27 23:25                           ` Andi Kleen
  2004-09-27 23:37                             ` David S. Miller
  2004-09-27 23:36                           ` Herbert Xu
  1 sibling, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-27 23:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: John Heffner, ak, niv, andy.grover, anton, netdev

> The reason Andi Kleen didn't see any improvements from limiting
> 'factor' is that he is using short lived connections.  If you

The netperf test is about 10s - not really short lived and should
be enough for slow start.

> have a connection up for long enough, this allows the congestion
> window to grow and then it doesn't matter.
> 
> Something like the following is what I have been talking about.
> I am able to reproduce the problem here locally and the following

Cool.

> makes it go away.
> 
> Andi, Anton, and niv, can you confirm it does so for you too?

Unfortunately not - with the patch applied I still get 27MB/s

Looking at the tcpdump the ack clock goes completely out of sync,
the ratio is 4 packets per ack. The sender seems to send packets
faster than the target can ack them. It eventually stops for a 
short time until it gets the ack


01:16:26.034669 10.23.202.31.32777 > 10.23.202.10.34491: P 1983065:1984513(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.034713 10.23.202.10.34491 > 10.23.202.31.32777: . ack 1984513 win 63712 <nop,nop,timestamp 53583052 4294889148> (DF)
01:16:26.034902 10.23.202.31.32777 > 10.23.202.10.34491: . 1984513:1985961(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.034910 10.23.202.31.32777 > 10.23.202.10.34491: . 1985961:1987409(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.034916 10.23.202.31.32777 > 10.23.202.10.34491: . 1987409:1988857(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.034921 10.23.202.31.32777 > 10.23.202.10.34491: . 1988857:1990305(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.034926 10.23.202.31.32777 > 10.23.202.10.34491: . 1990305:1991753(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.034998 10.23.202.10.34491 > 10.23.202.31.32777: . ack 1991753 win 63712 <nop,nop,timestamp 53583052 4294889148> (DF)
01:16:26.035144 10.23.202.31.32777 > 10.23.202.10.34491: . 1991753:1993201(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035152 10.23.202.31.32777 > 10.23.202.10.34491: . 1993201:1994649(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035158 10.23.202.31.32777 > 10.23.202.10.34491: . 1994649:1996097(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035163 10.23.202.31.32777 > 10.23.202.10.34491: . 1996097:1997545(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035169 10.23.202.31.32777 > 10.23.202.10.34491: P 1997545:1998993(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035214 10.23.202.10.34491 > 10.23.202.31.32777: . ack 1998993 win 63712 <nop,nop,timestamp 53583052 4294889148> (DF)
01:16:26.035393 10.23.202.31.32777 > 10.23.202.10.34491: . 1998993:2000441(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035401 10.23.202.31.32777 > 10.23.202.10.34491: . 2000441:2001889(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)
01:16:26.035406 10.23.202.31.32777 > 10.23.202.10.34491: . 2001889:2003337(1448) ack 1 win 1460 <nop,nop,timestamp 4294889148 53583052> (DF)


-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 23:25                           ` Andi Kleen
@ 2004-09-27 23:37                             ` David S. Miller
  2004-09-27 23:51                               ` Andi Kleen
  0 siblings, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-27 23:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 01:25:55 +0200
Andi Kleen <ak@suse.de> wrote:

> Unfortunately not - with the patch applied I still get 27MB/s

And without TSO you get?  How exactly are you running netperf
and are you going through a switch?  I want to reproduce your
test case exactly here although using tg3 instead of e1000 :)

> Looking at the tcpdump the ack clock goes completely out of sync,
> the ratio is 4 packets per ack. The sender seems to send packets
> faster than the target can ack them. It eventually stops for a 
> short time until it gets the ack

Hmmm, thanks for the trace.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 23:37                             ` David S. Miller
@ 2004-09-27 23:51                               ` Andi Kleen
  2004-09-28  0:15                                 ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: Andi Kleen @ 2004-09-27 23:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, jheffner, niv, andy.grover, anton, netdev

On Mon, Sep 27, 2004 at 04:37:51PM -0700, David S. Miller wrote:
> On Tue, 28 Sep 2004 01:25:55 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > Unfortunately not - with the patch applied I still get 27MB/s
> 
> And without TSO you get?  How exactly are you running netperf

~66MB/s

> and are you going through a switch?  I want to reproduce your

Yes, a buffalo gigabit switch.

> test case exactly here although using tg3 instead of e1000 :)

I'm running the test from a e1000 through the switch to a tg3
(tg3 with an older kernel).

I also tested it now from a tg3 machine to the other tg3 machine.
That's 48MB/s (with your patch). The tg3 machine is slower though
because the tg3 sits in a 33Mhz PCI slot, no CSA etc. 
>From the tg3 machine with your patch the numbers are the same
both with TSO on and off.

Looks like the e1000 is too fast...

-Andi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 23:51                               ` Andi Kleen
@ 2004-09-28  0:15                                 ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-28  0:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, jheffner, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 01:51:17 +0200
Andi Kleen <ak@suse.de> wrote:

> Looks like the e1000 is too fast...

Yes, the e1000 does TSO expansion much faster than the MIPS
cpus on the tg3.  I believe the e1000 implementation is
in the ASIC instead of being implemented with a firmware
runing on a general purpose on-board processor.

In the 5705 and later revisions, the tg3 implements TSO
in the ASIC as well.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 23:04                         ` David S. Miller
  2004-09-27 23:25                           ` Andi Kleen
@ 2004-09-27 23:36                           ` Herbert Xu
  2004-09-28  0:13                             ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-27 23:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: John Heffner, ak, niv, andy.grover, anton, netdev

On Mon, Sep 27, 2004 at 11:04:11PM +0000, David S. Miller wrote:
> 
> If tcp_clean_rtx_queue() doesn't return DATA
> acked then no congestion growth is allowed to occur.  So we only
> get a snd_cwnd bump once for every tso_factor frames, that stinks :)

Yes that'll do it :)
 
> ===== net/ipv4/tcp_input.c 1.75 vs edited =====
> --- 1.75/net/ipv4/tcp_input.c	2004-09-27 12:00:32 -07:00
> +++ edited/net/ipv4/tcp_input.c	2004-09-27 15:35:12 -07:00
> @@ -2373,8 +2373,12 @@
>  		 * discard it as it's confirmed to have arrived at
>  		 * the other end.
>  		 */
> -		if (after(scb->end_seq, tp->snd_una))
> +		if (after(scb->end_seq, tp->snd_una)) {
> +			if (scb->tso_factor &&
> +			    after(tp->snd_una, scb->seq))
> +				acked |= FLAG_DATA_ACKED;
>  			break;

I think you need to at least decrement packets_out here.  Otherwise
the prior_in_flight >= tp->snd_cwnd check in tcp_ack() might become
incorrect for the next ack.

Even better, you could move the skb->data pointer forward and forget
about that segment altogether.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 23:36                           ` Herbert Xu
@ 2004-09-28  0:13                             ` David S. Miller
  2004-09-28  0:34                               ` Herbert Xu
  2004-09-28  7:20                               ` Nivedita Singhvi
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-28  0:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 09:36:39 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> I think you need to at least decrement packets_out here.  Otherwise
> the prior_in_flight >= tp->snd_cwnd check in tcp_ack() might become
> incorrect for the next ack.
> 
> Even better, you could move the skb->data pointer forward and forget
> about that segment altogether.

Bright minds think alike.  :-)  We have to keep all the other
packet counts in sync as well.

Andi, others, forget my previous hack patch to tcp_clean_rtx_queue()
and give this more complete patch a try.

I'm getting really good results here on my tg3<-->tg3 setup using
this patch.

===== include/net/tcp.h 1.89 vs edited =====
--- 1.89/include/net/tcp.h	2004-09-27 11:57:52 -07:00
+++ edited/include/net/tcp.h	2004-09-27 15:56:24 -07:00
@@ -1180,7 +1180,8 @@
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
-	__u32		tso_factor;
+	__u16		tso_factor;	/* If > 1, TSO frame		*/
+	__u16		tso_offset;	/* ACK within TSO frame		*/
 };
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
===== net/ipv4/tcp_input.c 1.75 vs edited =====
--- 1.75/net/ipv4/tcp_input.c	2004-09-27 12:00:32 -07:00
+++ edited/net/ipv4/tcp_input.c	2004-09-27 16:36:29 -07:00
@@ -2355,6 +2355,60 @@
 	}
 }
 
+static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+			 __u32 now, __s32 *seq_rtt)
+{
+	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
+	__u32 tso_seq = scb->seq + scb->tso_offset;
+	__u32 mss = tp->mss_cache_std;
+	__u32 snd_una = tp->snd_una;
+	int acked = 0;
+
+	/* If we get here, the whole TSO packet has not been
+	 * acked.
+	 */
+	BUG_ON(!after(scb->end_seq, snd_una) ||
+	       tso_seq == scb->end_seq);
+
+	while (!after(tso_seq + mss, snd_una)) {
+		__u8 sacked = scb->sacked;
+
+		tso_seq += mss;
+		acked |= FLAG_DATA_ACKED;
+		BUG_ON(pskb_pull(skb, mss) == NULL);
+		if (sacked) {
+			if (sacked & TCPCB_RETRANS) {
+				if(sacked & TCPCB_SACKED_RETRANS)
+					tcp_dec_pcount_explicit(&tp->retrans_out, 1);
+				acked |= FLAG_RETRANS_DATA_ACKED;
+				*seq_rtt = -1;
+			} else if (*seq_rtt < 0)
+				*seq_rtt = now - scb->when;
+			if (sacked & TCPCB_SACKED_ACKED)
+				tcp_dec_pcount_explicit(&tp->sacked_out, 1);
+			if (sacked & TCPCB_LOST)
+				tcp_dec_pcount_explicit(&tp->lost_out, 1);
+			/* We can only exit URG mode for full TSO packet ack,
+			 * by definition, so we need not do that check here.
+			 */
+		} else if (*seq_rtt < 0)
+			*seq_rtt = now - scb->when;
+
+		if (tcp_get_pcount(&tp->fackets_out))
+			tcp_dec_pcount_explicit(&tp->fackets_out, 1);
+		tcp_dec_pcount_explicit(&tp->packets_out, 1);
+		scb->tso_factor--;
+
+		BUG_ON(scb->tso_factor == 0);
+		BUG_ON(!before(tso_seq, scb->end_seq));
+	}
+
+	scb->tso_offset = (tso_seq - scb->seq);
+
+	return acked;
+}
+
+
 /* Remove acknowledged frames from the retransmission queue. */
 static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
 {
@@ -2373,8 +2427,12 @@
 		 * discard it as it's confirmed to have arrived at
 		 * the other end.
 		 */
-		if (after(scb->end_seq, tp->snd_una))
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (scb->tso_factor)
+				acked |= tcp_tso_acked(tp, skb,
+						       now, &seq_rtt);
 			break;
+		}
 
 		/* Initial outgoing SYN's get put onto the write_queue
 		 * just like anything else we transmit.  It is not
===== net/ipv4/tcp_output.c 1.59 vs edited =====
--- 1.59/net/ipv4/tcp_output.c	2004-09-27 11:57:52 -07:00
+++ edited/net/ipv4/tcp_output.c	2004-09-27 15:52:15 -07:00
@@ -436,6 +436,7 @@
 		factor /= mss_std;
 		TCP_SKB_CB(skb)->tso_factor = factor;
 	}
+	TCP_SKB_CB(skb)->tso_offset = 0;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -1191,6 +1192,7 @@
 		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
 		TCP_SKB_CB(skb)->sacked = 0;
 		TCP_SKB_CB(skb)->tso_factor = 1;
+		TCP_SKB_CB(skb)->tso_offset = 1;
 
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		TCP_SKB_CB(skb)->seq = tp->write_seq;
@@ -1223,6 +1225,7 @@
 	TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_offset = 1;
 
 	/* Send it off. */
 	TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk, tp);
@@ -1304,6 +1307,7 @@
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_offset = 1;
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(req->rcv_isn + 1);
 	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
@@ -1406,6 +1410,7 @@
 	TCP_ECN_send_syn(sk, tp, buff);
 	TCP_SKB_CB(buff)->sacked = 0;
 	TCP_SKB_CB(buff)->tso_factor = 1;
+	TCP_SKB_CB(buff)->tso_offset = 1;
 	buff->csum = 0;
 	TCP_SKB_CB(buff)->seq = tp->write_seq++;
 	TCP_SKB_CB(buff)->end_seq = tp->write_seq;
@@ -1506,6 +1511,7 @@
 		TCP_SKB_CB(buff)->flags = TCPCB_FLAG_ACK;
 		TCP_SKB_CB(buff)->sacked = 0;
 		TCP_SKB_CB(buff)->tso_factor = 1;
+		TCP_SKB_CB(buff)->tso_offset = 1;
 
 		/* Send it off, this clears delayed acks for us. */
 		TCP_SKB_CB(buff)->seq = TCP_SKB_CB(buff)->end_seq = tcp_acceptable_seq(sk, tp);
@@ -1541,6 +1547,7 @@
 	TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(skb)->sacked = urgent;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_offset = 1;
 
 	/* Use a previous sequence.  This should cause the other
 	 * end to send an ack.  Don't queue or clone SKB, just

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  0:13                             ` David S. Miller
@ 2004-09-28  0:34                               ` Herbert Xu
  2004-09-28  4:59                                 ` David S. Miller
  2004-09-28  7:20                               ` Nivedita Singhvi
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-28  0:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

On Mon, Sep 27, 2004 at 05:13:56PM -0700, David S. Miller wrote:
> 
> I'm getting really good results here on my tg3<-->tg3 setup using
> this patch.

Yes this looks very good.

Just a few minor things below.

> ===== net/ipv4/tcp_input.c 1.75 vs edited =====
> --- 1.75/net/ipv4/tcp_input.c	2004-09-27 12:00:32 -07:00
> +++ edited/net/ipv4/tcp_input.c	2004-09-27 16:36:29 -07:00
> @@ -2355,6 +2355,60 @@
>  	}
>  }
>  
> +static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
> +			 __u32 now, __s32 *seq_rtt)
> +{
> +	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
> +	__u32 tso_seq = scb->seq + scb->tso_offset;
> +	__u32 mss = tp->mss_cache_std;

In future we should probably record the MSS in the scb just in case
it changes.

> @@ -2373,8 +2427,12 @@
>  		 * discard it as it's confirmed to have arrived at
>  		 * the other end.
>  		 */
> -		if (after(scb->end_seq, tp->snd_una))
> +		if (after(scb->end_seq, tp->snd_una)) {
> +			if (scb->tso_factor)

tso_factor > 1

> ===== net/ipv4/tcp_output.c 1.59 vs edited =====
> --- 1.59/net/ipv4/tcp_output.c	2004-09-27 11:57:52 -07:00
> +++ edited/net/ipv4/tcp_output.c	2004-09-27 15:52:15 -07:00
> @@ -1191,6 +1192,7 @@
>  		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
>  		TCP_SKB_CB(skb)->sacked = 0;
>  		TCP_SKB_CB(skb)->tso_factor = 1;
> +		TCP_SKB_CB(skb)->tso_offset = 1;

Is this a clever trick that I don't understand? :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  0:34                               ` Herbert Xu
@ 2004-09-28  4:59                                 ` David S. Miller
  2004-09-28  5:15                                   ` Herbert Xu
  2004-09-28  6:45                                   ` Nivedita Singhvi
  0 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-28  4:59 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 10:34:12 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Yes this looks very good.

It's got a nasty bug though, I'm not updating
TCP_SKB_CB(skb)->seq so retransmits have corrupted
sequence numbers, doh!  And hey if I update that
then I do not need this tso_offset thingy.

> > +static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
> > +			 __u32 now, __s32 *seq_rtt)
> > +{
> > +	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
> > +	__u32 tso_seq = scb->seq + scb->tso_offset;
> > +	__u32 mss = tp->mss_cache_std;
> 
> In future we should probably record the MSS in the scb just in case
> it changes.

Fixed in my updated patch below, the above bug made me
consider this exact issue.

> > -		if (after(scb->end_seq, tp->snd_una))
> > +		if (after(scb->end_seq, tp->snd_una)) {
> > +			if (scb->tso_factor)
> 
> tso_factor > 1

Good catch, fixed below as well.

> >  		TCP_SKB_CB(skb)->tso_factor = 1;
> > +		TCP_SKB_CB(skb)->tso_offset = 1;
> 
> Is this a clever trick that I don't understand? :)

What a dumb bug, also fixed below.

This should do it for RTX queue purging and congestion
window growth.

Next I'll work on the other issues.  In particular:

1) Moving TSO mss calculations to tcp_current_mss()
   as you suggested.  Also integrating the 'large'
   usage bug fixes you sent a patch for earlier.

   I'm making this thing non-inline too, it gets
   expanded a lot.

2) The tcp_init_metrics() consistency fix you made.

3) Consider limiting 'factor' such as jheffner and myself
   have suggested over the past few days.  It might not
   be necessary, I'll do some tests.

Note the trick below wrt. adjusting the SKB data
area lazily.  I defer it to tcp_retransmit_skb()
otherwise we copy the data area around in the packet
several times, and that would undo the gain of
zerocopy + TSO wouldn't it? :-)

I mention that we might want mess with skb->truesize
and liberate space from sk->sk_wmem_alloc... but on
further reflection that might not really gain us
anything.  We'll make less work for the process by
waking up only on full TSO packet liberation.

Please scream loudly if you can find a bug in this
current patch.

===== include/net/tcp.h 1.89 vs edited =====
--- 1.89/include/net/tcp.h	2004-09-27 11:57:52 -07:00
+++ edited/include/net/tcp.h	2004-09-27 18:02:21 -07:00
@@ -1180,7 +1180,8 @@
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
-	__u32		tso_factor;
+	__u16		tso_factor;	/* If > 1, TSO frame		*/
+	__u16		tso_mss;	/* MSS that FACTOR's in terms of*/
 };
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
===== net/ipv4/tcp_input.c 1.75 vs edited =====
--- 1.75/net/ipv4/tcp_input.c	2004-09-27 12:00:32 -07:00
+++ edited/net/ipv4/tcp_input.c	2004-09-27 21:27:08 -07:00
@@ -2355,6 +2355,86 @@
 	}
 }
 
+/* There is one downside to this scheme.  Although we keep the
+ * ACK clock ticking, adjusting packet counters and advancing
+ * congestion window, we do not liberate socket send buffer
+ * space.
+ *
+ * Mucking with skb->truesize and sk->sk_wmem_alloc et al.
+ * then making a write space wakeup callback is a possible
+ * future enhancement.  WARNING: it is not trivial to make.
+ */
+static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+			 __u32 now, __s32 *seq_rtt)
+{
+	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
+	__u32 mss = scb->tso_mss;
+	__u32 snd_una = tp->snd_una;
+	__u32 seq = scb->seq;
+	__u32 packets_acked = 0;
+	int acked = 0;
+
+	/* If we get here, the whole TSO packet has not been
+	 * acked.
+	 */
+	BUG_ON(!after(scb->end_seq, snd_una));
+
+	while (!after(seq + mss, snd_una)) {
+		packets_acked++;
+		seq += mss;
+	}
+
+	if (packets_acked) {
+		__u8 sacked = scb->sacked;
+
+		/* We adjust scb->seq but we do not pskb_pull() the
+		 * SKB.  We let tcp_retransmit_skb() handle this case
+		 * by checking skb->len against the data sequence span.
+		 * This way, we avoid the pskb_pull() work unless we
+		 * actually need to retransmit the SKB.
+		 */
+		scb->seq = seq;
+
+		acked |= FLAG_DATA_ACKED;
+		if (sacked) {
+			if (sacked & TCPCB_RETRANS) {
+				if (sacked & TCPCB_SACKED_RETRANS)
+					tcp_dec_pcount_explicit(&tp->retrans_out,
+								packets_acked);
+				acked |= FLAG_RETRANS_DATA_ACKED;
+				*seq_rtt = -1;
+			} else if (*seq_rtt < 0)
+				*seq_rtt = now - scb->when;
+			if (sacked & TCPCB_SACKED_ACKED)
+				tcp_dec_pcount_explicit(&tp->sacked_out,
+							packets_acked);
+			if (sacked & TCPCB_LOST)
+				tcp_dec_pcount_explicit(&tp->lost_out,
+							packets_acked);
+			if (sacked & TCPCB_URG) {
+				if (tp->urg_mode &&
+				    !before(scb->seq, tp->snd_up))
+					tp->urg_mode = 0;
+			}
+		} else if (*seq_rtt < 0)
+			*seq_rtt = now - scb->when;
+
+		if (tcp_get_pcount(&tp->fackets_out)) {
+			__u32 dval = min(tcp_get_pcount(&tp->fackets_out),
+					 packets_acked);
+			tcp_dec_pcount_explicit(&tp->fackets_out, dval);
+		}
+		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
+		scb->tso_factor -= packets_acked;
+
+		BUG_ON(scb->tso_factor == 0);
+		BUG_ON(!before(scb->seq, scb->end_seq));
+	}
+
+	return acked;
+}
+
+
 /* Remove acknowledged frames from the retransmission queue. */
 static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
 {
@@ -2373,8 +2453,12 @@
 		 * discard it as it's confirmed to have arrived at
 		 * the other end.
 		 */
-		if (after(scb->end_seq, tp->snd_una))
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (scb->tso_factor > 1)
+				acked |= tcp_tso_acked(tp, skb,
+						       now, &seq_rtt);
 			break;
+		}
 
 		/* Initial outgoing SYN's get put onto the write_queue
 		 * just like anything else we transmit.  It is not
===== net/ipv4/tcp_output.c 1.59 vs edited =====
--- 1.59/net/ipv4/tcp_output.c	2004-09-27 11:57:52 -07:00
+++ edited/net/ipv4/tcp_output.c	2004-09-27 21:27:50 -07:00
@@ -436,6 +436,7 @@
 		factor /= mss_std;
 		TCP_SKB_CB(skb)->tso_factor = factor;
 	}
+	TCP_SKB_CB(skb)->tso_mss = mss_std;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -552,7 +553,7 @@
 	return skb->tail;
 }
 
-static int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+static int __tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
@@ -565,11 +566,20 @@
 			return -ENOMEM;
 	}
 
-	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_HW;
 	return 0;
 }
 
+static inline int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+{
+	int err = __tcp_trim_head(sk, skb, len);
+
+	if (!err)
+		TCP_SKB_CB(skb)->seq += len;
+
+	return err;
+}
+
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
    tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -949,6 +959,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
  	unsigned int cur_mss = tcp_current_mss(sk, 0);
+	__u32 data_seq, data_end_seq;
 	int err;
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
@@ -958,6 +969,22 @@
 	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
 		return -EAGAIN;
 
+	/* What is going on here?  When TSO packets are partially ACK'd,
+	 * we adjust the TCP_SKB_CB(skb)->seq value forward but we do
+	 * not adjust the data area of the SKB.  We defer that to here
+	 * so that we can avoid the work unless we really retransmit
+	 * the packet.
+	 */
+	data_seq = TCP_SKB_CB(skb)->seq;
+	data_end_seq = TCP_SKB_CB(skb)->end_seq;
+	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
+		data_end_seq--;
+
+	if (skb->len != (data_end_seq - data_seq)) {
+		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))
+			return -ENOMEM;
+	}		
+
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
@@ -1191,6 +1218,7 @@
 		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
 		TCP_SKB_CB(skb)->sacked = 0;
 		TCP_SKB_CB(skb)->tso_factor = 1;
+		TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		TCP_SKB_CB(skb)->seq = tp->write_seq;
@@ -1223,6 +1251,7 @@
 	TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 	/* Send it off. */
 	TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk, tp);
@@ -1304,6 +1333,7 @@
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(req->rcv_isn + 1);
 	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
@@ -1406,6 +1436,7 @@
 	TCP_ECN_send_syn(sk, tp, buff);
 	TCP_SKB_CB(buff)->sacked = 0;
 	TCP_SKB_CB(buff)->tso_factor = 1;
+	TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
 	buff->csum = 0;
 	TCP_SKB_CB(buff)->seq = tp->write_seq++;
 	TCP_SKB_CB(buff)->end_seq = tp->write_seq;
@@ -1506,6 +1537,7 @@
 		TCP_SKB_CB(buff)->flags = TCPCB_FLAG_ACK;
 		TCP_SKB_CB(buff)->sacked = 0;
 		TCP_SKB_CB(buff)->tso_factor = 1;
+		TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
 
 		/* Send it off, this clears delayed acks for us. */
 		TCP_SKB_CB(buff)->seq = TCP_SKB_CB(buff)->end_seq = tcp_acceptable_seq(sk, tp);
@@ -1541,6 +1573,7 @@
 	TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(skb)->sacked = urgent;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 	/* Use a previous sequence.  This should cause the other
 	 * end to send an ack.  Don't queue or clone SKB, just

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  4:59                                 ` David S. Miller
@ 2004-09-28  5:15                                   ` Herbert Xu
  2004-09-28  5:58                                     ` David S. Miller
  2004-09-28  6:45                                   ` Nivedita Singhvi
  1 sibling, 1 reply; 97+ messages in thread
From: Herbert Xu @ 2004-09-28  5:15 UTC (permalink / raw)
  To: David S. Miller; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

On Mon, Sep 27, 2004 at 09:59:01PM -0700, David S. Miller wrote:
> 
> It's got a nasty bug though, I'm not updating
> TCP_SKB_CB(skb)->seq so retransmits have corrupted
> sequence numbers, doh!  And hey if I update that
> then I do not need this tso_offset thingy.

Nice work.

> +	if (skb->len != (data_end_seq - data_seq)) {

Please make that > so that I can sleep at night :)

> +		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))

The argument to __tcp_trim_head should be

skb->len - (data_end_seq - data_seq)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  5:15                                   ` Herbert Xu
@ 2004-09-28  5:58                                     ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-28  5:58 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jheffner, ak, niv, andy.grover, anton, netdev

On Tue, 28 Sep 2004 15:15:39 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> > +	if (skb->len != (data_end_seq - data_seq)) {
> 
> Please make that > so that I can sleep at night :)
> 
> > +		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))
> 
> The argument to __tcp_trim_head should be
> 
> skb->len - (data_end_seq - data_seq)

Good catch, fixed as follows:

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/27 22:37:27-07:00 davem@nuts.davemloft.net 
#   [TCP]: Fix third arg to __tcp_trim_head().
#   
#   Noted by Herbert Xu <herbert@gondor.apana.org.au>
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/27 22:36:41-07:00 davem@nuts.davemloft.net +4 -2
#   [TCP]: Fix third arg to __tcp_trim_head().
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-27 22:37:55 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-27 22:37:55 -07:00
@@ -980,8 +980,10 @@
 	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
 		data_end_seq--;
 
-	if (skb->len != (data_end_seq - data_seq)) {
-		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))
+	if (skb->len > (data_end_seq - data_seq)) {
+		u32 to_trim = skb->len - (data_end_seq - data_seq);
+
+		if (__tcp_trim_head(sk, skb, to_trim))
 			return -ENOMEM;
 	}		
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  4:59                                 ` David S. Miller
  2004-09-28  5:15                                   ` Herbert Xu
@ 2004-09-28  6:45                                   ` Nivedita Singhvi
  1 sibling, 0 replies; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-28  6:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: Herbert Xu, jheffner, ak, andy.grover, anton, netdev

David S. Miller wrote:

> It's got a nasty bug though, I'm not updating
> TCP_SKB_CB(skb)->seq so retransmits have corrupted
> sequence numbers, doh!  And hey if I update that
> then I do not need this tso_offset thingy.

Hmm, I got sidetracked by seeing corrupted sequence
numbers in my tcpdump traces from before, although
that shouldn't have been the case..(sorry, don't
have access to boxes at the moment, will post it
tmrw). Was seeing the sequence numbers go backwards
(simple scp testing, bk10)..

Will redo testing in the morning on latest patches...

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  0:13                             ` David S. Miller
  2004-09-28  0:34                               ` Herbert Xu
@ 2004-09-28  7:20                               ` Nivedita Singhvi
  2004-09-28 20:38                                 ` David S. Miller
  1 sibling, 1 reply; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-28  7:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: Herbert Xu, jheffner, ak, andy.grover, anton, netdev

David S. Miller wrote:

> Bright minds think alike.  :-)  We have to keep all the other
> packet counts in sync as well.
> 
> Andi, others, forget my previous hack patch to tcp_clean_rtx_queue()
> and give this more complete patch a try.
> 
> I'm getting really good results here on my tg3<-->tg3 setup using
> this patch.

Dave, were you seeing a significant number of retransmissions
and sacks in your tests?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  7:20                               ` Nivedita Singhvi
@ 2004-09-28 20:38                                 ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-28 20:38 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: herbert, jheffner, ak, andy.grover, anton, netdev

On Tue, 28 Sep 2004 00:20:50 -0700
Nivedita Singhvi <niv@us.ibm.com> wrote:

> David S. Miller wrote:
> 
> > Bright minds think alike.  :-)  We have to keep all the other
> > packet counts in sync as well.
> > 
> > Andi, others, forget my previous hack patch to tcp_clean_rtx_queue()
> > and give this more complete patch a try.
> > 
> > I'm getting really good results here on my tg3<-->tg3 setup using
> > this patch.
> 
> Dave, were you seeing a significant number of retransmissions
> and sacks in your tests?

None.  I am working on a local network through a gigabit switch.

I will work on making sure cases involving loss work correctly,
via the netem module, before I submit these fixes upstream to
Linus.  Likely I will complete this work today.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-27 22:38                       ` John Heffner
  2004-09-27 23:04                         ` David S. Miller
@ 2004-09-28  7:23                         ` Nivedita Singhvi
  2004-09-28  8:23                           ` Herbert Xu
  2004-09-28 12:53                           ` John Heffner
  1 sibling, 2 replies; 97+ messages in thread
From: Nivedita Singhvi @ 2004-09-28  7:23 UTC (permalink / raw)
  To: John Heffner; +Cc: David S. Miller, Andi Kleen, andy.grover, anton, netdev

John Heffner wrote:

> On Thu, 23 Sep 2004, David S. Miller wrote:
> 
> 
>>I think I know what may be going on here.
>>
>>Let's say that we even get the congestion window openned up
>>so that we can build 64K TSO frames, that's around 43 or 44
>>1500 mtu frames.
>>
>>That means as the window fills up, we have to see 44 ACKs
>>before we are able to send the next TSO frame.  Needless to
>>say that breaks ACK clocking completely.
> 
> 
> 
> More specifically, I think it is an interaction with delayed ack (acking
> less than 1 virtual segment), and the small cwnd.  This works for me, but
> I'm not sure that aren't some lurking problems still.

In terms of what goes out over the wire from the
sender, there is (or should be) no difference between
the TSO and non-TSO case. The sequence of regular sized
packets should be the same, and the only difference
might be the delays between the frames, at most.

So the sequence of acks coming back from the
receiver should be the same, TSO and non-TSO case.
If we've sent out say 44 1500MTU frames, we should
probably see 22 acks back, roughly (acking every
second packet if delayed acks are on) in both
the TSO and non-TSO case.

In terms of overall throughput, assuming we were doing
no other work other than this connection, we would see
a gain in the TSO case only if by the time the
congestion window opened fully for us to send another
virtual MTU frame, the application had written another
frame's worth of data (minus the extra delta
that would take for driver handoff and send at that
point). In the non-TSO case, the finer granularity is
helping us to utilize the channel more efficiently,
(although not the path down the stack or the CPU)..
actually, I think - although that is just another way
to say ack clocking is bumpy.

But I guess my question is - don't we need some
heuristics to figure out when we should send partial
(i.e. abandoning waiting for full TSO)?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  7:23                         ` Nivedita Singhvi
@ 2004-09-28  8:23                           ` Herbert Xu
  2004-09-28 12:53                           ` John Heffner
  1 sibling, 0 replies; 97+ messages in thread
From: Herbert Xu @ 2004-09-28  8:23 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: jheffner, davem, ak, andy.grover, anton, netdev

Nivedita Singhvi <niv@us.ibm.com> wrote:
> 
> But I guess my question is - don't we need some
> heuristics to figure out when we should send partial
> (i.e. abandoning waiting for full TSO)?

Currently we're relying on Nagle to do that.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-28  7:23                         ` Nivedita Singhvi
  2004-09-28  8:23                           ` Herbert Xu
@ 2004-09-28 12:53                           ` John Heffner
  1 sibling, 0 replies; 97+ messages in thread
From: John Heffner @ 2004-09-28 12:53 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: David S. Miller, Andi Kleen, andy.grover, anton, netdev

On Tue, 28 Sep 2004, Nivedita Singhvi wrote:

> John Heffner wrote:
> >
> > More specifically, I think it is an interaction with delayed ack (acking
> > less than 1 virtual segment), and the small cwnd.  This works for me, but
> > I'm not sure that aren't some lurking problems still.
>
> In terms of what goes out over the wire from the
> sender, there is (or should be) no difference between
> the TSO and non-TSO case. The sequence of regular sized
> packets should be the same, and the only difference
> might be the delays between the frames, at most.
>
> So the sequence of acks coming back from the
> receiver should be the same, TSO and non-TSO case.
> If we've sent out say 44 1500MTU frames, we should
> probably see 22 acks back, roughly (acking every
> second packet if delayed acks are on) in both
> the TSO and non-TSO case.

I was referring to a problem I saw that had really terrible performance
(around 1 Mbit).  It would send out one virtual segment, and all but the
last of its real segments would be acked.  The receiver will wait for the
delayed ack timer to go off before acking the last segment, and the sender
will wait for that last segment to be acked before sending out the next
virtual segment if the cwnd is equal to 1 virtual segment.

Dave's patch seems to correct this problem for me, but I'm not convinced
this state could never occur.

  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-09-22 19:55         ` Andi Kleen
  2004-09-22 20:07           ` Nivedita Singhvi
  2004-09-22 20:12           ` Andrew Grover
@ 2004-09-22 20:28           ` David S. Miller
  2 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-22 20:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, anton, netdev

On Wed, 22 Sep 2004 21:55:15 +0200
Andi Kleen <ak@suse.de> wrote:

> I must add that this is a CSA e1000 (directly integrated into
> the chipset and doesn't use PCI) and TSO doesn't seem to be 
> bring any advantage. On 2.6.5 the performance is the same
> with both TSO on or off.

That's expected.  It decreases the cpu and bus load, but if
those are capable of running the card full tilt already TSO
buys you nothing except spare cpu and memory cycles for
other work on your system.

^ permalink raw reply	[flat|nested] 97+ messages in thread

[parent not found: <Pine.NEB.4.33.0409301625560.13549-100000@dexter.psc.edu>]

* Re: bad TSO performance in 2.6.9-rc2-BK
       [not found] <Pine.NEB.4.33.0409301625560.13549-100000@dexter.psc.edu>
@ 2004-10-02  1:32 ` John Heffner
  2004-10-04 20:07   ` David S. Miller
  0 siblings, 1 reply; 97+ messages in thread
From: John Heffner @ 2004-10-02  1:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

> On Thu, 30 Sep 2004, David S. Miller wrote:
>
> > Please help debug this John since I'm not seeing this here.

Here's what I've observed now: on one machine I have interface hangs after
a non-deterministic amount of time (usually within a few seconds) with TSO
on if I'm sending at a reasonably high rate with a 1500 byte MTU.  I get
no hangs with a 9000 byte mtu, or at a low rate (sending to a FastE host).
Interestingly, if the interface mtu is 9000, but I'm sending 1500 byte
packets, it does not seem to hang.  On a different machine (both P4's with
e1000's) I get no hangs.  The hangs happen even with 2.6.8.1 before the
recent TSO changes, so it's not a bug there.  I looked at some tcpdumps
and saw no strangeness in the last segments sent.

The good machine has an e7500 chipset and a 82544EI. The bad machine has a
ServerWorks chipset and a 82545GM.

Any suggestions on where to look for what's going wrong with the
interface?

Thanks,
  -John

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: bad TSO performance in 2.6.9-rc2-BK
  2004-10-02  1:32 ` John Heffner
@ 2004-10-04 20:07   ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-10-04 20:07 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev, jgarzik

On Fri, 1 Oct 2004 21:32:25 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> The good machine has an e7500 chipset and a 82544EI. The bad machine has a
> ServerWorks chipset and a 82545GM.
> 
> Any suggestions on where to look for what's going wrong with the
> interface?

Time to consult the e1000 driver maintainers probably.
There are threee folks mentioned in linux/MAINTAINERS
as contacts.

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2004-10-04 20:07 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-20  6:30 bad TSO performance in 2.6.9-rc2-BK Anton Blanchard
2004-09-20 15:54 ` Nivedita Singhvi
2004-09-21 15:55   ` Anton Blanchard
2004-09-20 20:30 ` Andi Kleen
2004-09-21 22:58   ` David S. Miller
2004-09-22 14:00     ` Andi Kleen
2004-09-22 18:12       ` David S. Miller
2004-09-22 19:55         ` Andi Kleen
2004-09-22 20:07           ` Nivedita Singhvi
2004-09-22 20:30             ` David S. Miller
2004-09-22 20:56               ` Nivedita Singhvi
2004-09-22 21:56               ` Andi Kleen
2004-09-22 22:04                 ` David S. Miller
2004-09-22 20:12           ` Andrew Grover
2004-09-22 20:39             ` David S. Miller
2004-09-22 22:06               ` Andi Kleen
2004-09-22 22:25                 ` David S. Miller
2004-09-22 22:47                   ` Andi Kleen
2004-09-22 22:50                     ` David S. Miller
2004-09-23 23:11                     ` David S. Miller
2004-09-23 23:41                       ` Herbert Xu
2004-09-23 23:41                         ` David S. Miller
2004-09-24  0:12                           ` Herbert Xu
2004-09-24  0:40                             ` Herbert Xu
2004-09-24  1:07                               ` Herbert Xu
2004-09-24  1:17                                 ` David S. Miller
2004-09-27  1:27                           ` Herbert Xu
2004-09-27  2:50                             ` Herbert Xu
2004-09-27  4:00                               ` David S. Miller
2004-09-27  5:45                                 ` Herbert Xu
2004-09-27 19:01                                   ` David S. Miller
2004-09-27 21:32                                     ` Herbert Xu
2004-09-28 21:10                                       ` David S. Miller
2004-09-28 21:34                                         ` Andi Kleen
2004-09-28 21:53                                           ` David S. Miller
2004-09-28 22:33                                             ` Andi Kleen
2004-09-28 22:57                                               ` David S. Miller
2004-09-28 23:27                                                 ` Andi Kleen
2004-09-28 23:35                                                   ` David S. Miller
2004-09-28 23:55                                                     ` Andi Kleen
2004-09-29  0:04                                                       ` David S. Miller
2004-09-29 20:58                                                   ` John Heffner
2004-09-29 21:10                                                     ` Nivedita Singhvi
2004-09-29 21:50                                                       ` David S. Miller
2004-09-29 21:56                                                         ` Andi Kleen
2004-09-29 23:29                                                           ` David S. Miller
2004-09-29 23:51                                                             ` John Heffner
2004-09-30  0:03                                                               ` David S. Miller
2004-09-30  0:10                                                                 ` Herbert Xu
2004-10-01  0:34                                                                   ` David S. Miller
2004-10-01  1:12                                                                     ` David S. Miller
2004-10-01  3:40                                                                       ` David S. Miller
2004-10-01 10:35                                                                         ` Andi Kleen
2004-10-01 10:23                                                                       ` Andi Kleen
2004-09-30  0:10                                                               ` John Heffner
2004-09-30 17:25                                                                 ` John Heffner
2004-09-30 20:23                                                                   ` David S. Miller
2004-09-30  0:05                                                             ` Herbert Xu
2004-09-30  4:33                                                               ` David S. Miller
2004-09-30  5:47                                                                 ` Herbert Xu
2004-09-30  7:39                                                                   ` David S. Miller
2004-09-30  8:09                                                                     ` Herbert Xu
2004-09-30  9:29                                                                 ` Andi Kleen
2004-09-30 20:20                                                                   ` David S. Miller
2004-09-29  3:27                                               ` John Heffner
2004-09-29  9:01                                                 ` Andi Kleen
2004-09-29 19:56                                                   ` David S. Miller
2004-09-29 20:56                                                     ` Andi Kleen
2004-09-29 21:17                                                       ` David S. Miller
2004-09-29 21:00                                                 ` David S. Miller
2004-09-29 21:16                                                   ` Nivedita Singhvi
2004-09-29 21:22                                                     ` David S. Miller
2004-09-29 21:43                                                       ` Andi Kleen
2004-09-29 21:51                                                         ` John Heffner
2004-09-29 21:52                                                           ` David S. Miller
2004-09-24  8:30                       ` Andi Kleen
2004-09-27 22:38                       ` John Heffner
2004-09-27 23:04                         ` David S. Miller
2004-09-27 23:25                           ` Andi Kleen
2004-09-27 23:37                             ` David S. Miller
2004-09-27 23:51                               ` Andi Kleen
2004-09-28  0:15                                 ` David S. Miller
2004-09-27 23:36                           ` Herbert Xu
2004-09-28  0:13                             ` David S. Miller
2004-09-28  0:34                               ` Herbert Xu
2004-09-28  4:59                                 ` David S. Miller
2004-09-28  5:15                                   ` Herbert Xu
2004-09-28  5:58                                     ` David S. Miller
2004-09-28  6:45                                   ` Nivedita Singhvi
2004-09-28  7:20                               ` Nivedita Singhvi
2004-09-28 20:38                                 ` David S. Miller
2004-09-28  7:23                         ` Nivedita Singhvi
2004-09-28  8:23                           ` Herbert Xu
2004-09-28 12:53                           ` John Heffner
2004-09-22 20:28           ` David S. Miller
     [not found] <Pine.NEB.4.33.0409301625560.13549-100000@dexter.psc.edu>
2004-10-02  1:32 ` John Heffner
2004-10-04 20:07   ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).