Re: 2.6.10 TCP troubles -- suggested patch

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-11 21:55 Hubert Tonneau
  2005-02-11 22:54 ` Rick Jones
  2005-02-11 23:04 ` Stephen Hemminger
  0 siblings, 2 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-11 21:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: shemminger, romieu, kuznet, Nivedita Singhvi, Rick Jones, netdev

Sorry, it still does not work, unless I made a mistake:
Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.

You can pick the new tcpdump report, created through:
tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz

Here is the connection summary:

Dell PowerEdge 2600 (dual Xeon with hyper threading) running libsmbclient
on Linux 2.6.x, IP for eth1 (Intel pro 1000) is 10.107.96.7 (full
duplex, flow control is enabled)
     |
     |
gigabit switch
     |
     |
100 Mbps switch
     |
     |
Mac running Samba server on OSX,
IP is 10.107.96.230


David S. Miller wrote:
>
> Hubert, try this patch instead.
> 
> ===== net/ipv4/tcp_output.c 1.77 vs edited =====
> --- 1.77/net/ipv4/tcp_output.c	2005-01-18 12:23:36 -08:00
> +++ edited/net/ipv4/tcp_output.c	2005-02-10 16:42:42 -08:00
> @@ -408,6 +408,16 @@
>  		sk->sk_send_head = skb;
>  }
>  
> +static inline void tcp_tso_set_push(struct sk_buff *skb)
> +{
> +	/* Force push to be on for any TSO frames to workaround
> +	 * problems with busted implementations like Mac OS-X that
> +	 * hold off socket reveive wakeups until push is seen.
> +	 */
> +	if (tcp_skb_pcount(skb) > 1)
> +		TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> +}
> +
>  /* Send _single_ skb sitting at the send head. This function requires
>   * true push pending frames to setup probe timer etc.
>   */
> @@ -419,6 +429,7 @@
>  	if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) {
>  		/* Send it out now. */
>  		TCP_SKB_CB(skb)->when = tcp_time_stamp;
> +		tcp_tso_set_push(skb);
>  		if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) {
>  			sk->sk_send_head = NULL;
>  			tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
> @@ -755,6 +766,7 @@
>  			}
>  
>  			TCP_SKB_CB(skb)->when = tcp_time_stamp;
> +			tcp_tso_set_push(skb);
>  			if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
>  				break;
>  
> @@ -1096,6 +1108,7 @@
>  	 * is still in somebody's hands, else make a clone.
>  	 */
>  	TCP_SKB_CB(skb)->when = tcp_time_stamp;
> +	tcp_tso_set_push(skb);
>  
>  	err = tcp_transmit_skb(sk, (skb_cloned(skb) ?
>  				    pskb_copy(skb, GFP_ATOMIC):
> @@ -1668,6 +1681,7 @@
>  
>  			TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
>  			TCP_SKB_CB(skb)->when = tcp_time_stamp;
> +			tcp_tso_set_push(skb);
>  			err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC));
>  			if (!err) {
>  				update_send_head(sk, tp, skb);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 21:55 2.6.10 TCP troubles -- suggested patch Hubert Tonneau
@ 2005-02-11 22:54 ` Rick Jones
  2005-02-11 23:09   ` Nivedita Singhvi
  2005-02-12  1:09   ` David S. Miller
  2005-02-11 23:04 ` Stephen Hemminger
  1 sibling, 2 replies; 40+ messages in thread
From: Rick Jones @ 2005-02-11 22:54 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: David S. Miller, shemminger, romieu, kuznet, netdev

Hubert Tonneau wrote:
> Sorry, it still does not work, unless I made a mistake:
> Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
> Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.
> 
> You can pick the new tcpdump report, created through:
> tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
> at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz
> 
> Here is the connection summary:
> 
> Dell PowerEdge 2600 (dual Xeon with hyper threading) running libsmbclient
> on Linux 2.6.x, IP for eth1 (Intel pro 1000) is 10.107.96.7 (full
> duplex, flow control is enabled)
>      |
>      |
> gigabit switch
>      |
>      |
> 100 Mbps switch
>      |
>      |
> Mac running Samba server on OSX,
> IP is 10.107.96.230

"Cooking" the trace with tcpdump -ttt to give the relative timestamdps makes 
things look like Mac OSX has an ACK avoidance heuristic in it?  I figured there 
was one in their OX <= 9 stack that came from a third-party, wasn't sure if they 
put that into their OSX stack - IIRC that one is not from the third-party.

FWIW, there are two or three other stacks that have ACK avoidance heuristics as 
well, it isn't an OSX only thing.

000780 10.107.96.230.139 > 10.107.96.7.32801: P 753:822(69) ack 1556 win 65535 
<nop,nop,timestamp 1709240657 534173> NBT Packet (DF)
000579 10.107.96.7.32801 > 10.107.96.230.139: . 1556:3004(1448) ack 822 win 1460 
<nop,nop,timestamp 534175 1709240657> NBT Packet (DF)
000027 10.107.96.7.32801 > 10.107.96.230.139: . 3004:4452(1448) ack 822 win 1460 
<nop,nop,timestamp 534175 1709240657> NBT Packet (DF)
000005 10.107.96.7.32801 > 10.107.96.230.139: . 4452:5900(1448) ack 822 win 1460 
<nop,nop,timestamp 534175 1709240657> NBT Packet (DF)
074685 10.107.96.230.139 > 10.107.96.7.32801: . ack 5900 win 62268 
<nop,nop,timestamp 1709240657 534175> (DF)

delack above

000012 10.107.96.7.32801 > 10.107.96.230.139: . 5900:7348(1448) ack 822 win 1460 
<nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
000003 10.107.96.7.32801 > 10.107.96.230.139: . 7348:8796(1448) ack 822 win 1460 
<nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
000002 10.107.96.7.32801 > 10.107.96.230.139: . 8796:10244(1448) ack 822 win 
1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
000002 10.107.96.7.32801 > 10.107.96.230.139: . 10244:11692(1448) ack 822 win 
1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
200024 10.107.96.230.139 > 10.107.96.7.32801: . ack 11692 win 56476 
<nop,nop,timestamp 1709240658 534249> (DF)

and again above.

000010 10.107.96.7.32801 > 10.107.96.230.139: . 11692:13140(1448) ack 822 win 
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000004 10.107.96.7.32801 > 10.107.96.230.139: . 13140:14588(1448) ack 822 win 
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000002 10.107.96.7.32801 > 10.107.96.230.139: P 14588:16036(1448) ack 822 win 
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000022 10.107.96.7.32801 > 10.107.96.230.139: . 16036:17484(1448) ack 822 win 
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000004 10.107.96.7.32801 > 10.107.96.230.139: P 17484:18192(708) ack 822 win 
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000994 10.107.96.230.139 > 10.107.96.7.32801: . ack 18192 win 65535 
<nop,nop,timestamp 1709240658 534449> (DF)
0

And then other cases where the ACK seems to take a rather long time to arrive, 
seems to correlate a bit with slowly increasing numbers of segments before the 
ACK is sent, and something along the lines of a 200 millisecond delayed ACK timer.

In some cases at least if the sender does not completely fill cwnd the ACKs will 
be delayed.  And IIRC under 2.6.10 with TSO enabled, the sender does not always 
fill cwnd.

hth,

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 22:54 ` Rick Jones
@ 2005-02-11 23:09   ` Nivedita Singhvi
  2005-02-11 23:40     ` Rick Jones
  2005-02-12  1:08     ` David S. Miller
  2005-02-12  1:09   ` David S. Miller
  1 sibling, 2 replies; 40+ messages in thread
From: Nivedita Singhvi @ 2005-02-11 23:09 UTC (permalink / raw)
  To: Rick Jones
  Cc: Hubert Tonneau, David S. Miller, shemminger, romieu, kuznet,
	netdev

Rick Jones wrote:

> 000010 10.107.96.7.32801 > 10.107.96.230.139: . 11692:13140(1448) ack 
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000004 10.107.96.7.32801 > 10.107.96.230.139: . 13140:14588(1448) ack 
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000002 10.107.96.7.32801 > 10.107.96.230.139: P 14588:16036(1448) ack 
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000022 10.107.96.7.32801 > 10.107.96.230.139: . 16036:17484(1448) ack 
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000004 10.107.96.7.32801 > 10.107.96.230.139: P 17484:18192(708) ack 822 
> win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000994 10.107.96.230.139 > 10.107.96.7.32801: . ack 18192 win 65535 
> <nop,nop,timestamp 1709240658 534449> (DF)
> 0
> 
> And then other cases where the ACK seems to take a rather long time to 
> arrive, seems to correlate a bit with slowly increasing numbers of 
> segments before the ACK is sent, and something along the lines of a 200 
> millisecond delayed ACK timer.
> 
> In some cases at least if the sender does not completely fill cwnd the 
> ACKs will be delayed.  And IIRC under 2.6.10 with TSO enabled, the 
> sender does not always fill cwnd.

Er, how is this compliant with 2581 (yes, I know, it's only
a SHOULD, not a MUST)  - an ACK should be generated for at
least every second full-sized segment received? Don't see
that happening. In many cases it's receiving quite a few
more packets. It should not be waiting for the delayed
ack timer to go off, surely?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 23:09   ` Nivedita Singhvi
@ 2005-02-11 23:40     ` Rick Jones
  2005-02-12  1:08     ` David S. Miller
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2005-02-11 23:40 UTC (permalink / raw)
  To: netdev; +Cc: Hubert Tonneau, shemminger, romieu, kuznet

> Er, how is this compliant with 2581 (yes, I know, it's only a SHOULD, not a
> MUST)  - an ACK should be generated for at least every second full-sized
> segment received? Don't see that happening. In many cases it's receiving
> quite a few more packets. It should not be waiting for the delayed ack timer
> to go off, surely?

Certainly it would make for an interesting disuscion.  Indeed it is a
SHOULD which leaves-open the door to compliance of other ACK policies.  Those 
might result in an ACK for more than two segments, or even an ACK for fewer than 
two segments, and there are folks in either camp/faction/sect/pick your favorite 
term.

I would say that it is still compliant with 2581.  The must there is that no 
matter what, an ACK must be generated within 500 milliseconds.

I suspect that had a full cwnd's worth of data been sent there would have been 
no lengthy delay in ACKs even with fewer than ACK-every-other.  I suspect that 
had TSO been disabled the full cwnd would have been sent and these delayed ACKs 
would not have happened and the transfer speed would have been happiness and joy.

FWIW, as the industry has added features such as CKO (ChecKsum Offload), 
copy-avoidance, and now TSO, the pie chart of time spent has been shifting more 
and more to ACK processing.  If we go back far enough, the writeups talk about 
how delayed ACK to increase ACK piggybacking was added in the first place - 
specifically (IIRC) for the purpose of minimizing ACK overhead.

rick jones

BTW, I'd be happy to trim emails that are already on netdev to avoid message 
duplications, is netdev a "closed" list?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 23:09   ` Nivedita Singhvi
  2005-02-11 23:40     ` Rick Jones
@ 2005-02-12  1:08     ` David S. Miller
  1 sibling, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12  1:08 UTC (permalink / raw)
  To: Nivedita Singhvi
  Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev

On Fri, 11 Feb 2005 15:09:11 -0800
Nivedita Singhvi <niv@us.ibm.com> wrote:

> Er, how is this compliant with 2581 (yes, I know, it's only
> a SHOULD, not a MUST)  - an ACK should be generated for at
> least every second full-sized segment received?

It's compliant but stupid.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 22:54 ` Rick Jones
  2005-02-11 23:09   ` Nivedita Singhvi
@ 2005-02-12  1:09   ` David S. Miller
  2005-02-12 14:31     ` Alexey Kuznetsov
  1 sibling, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-12  1:09 UTC (permalink / raw)
  To: Rick Jones; +Cc: hubert.tonneau, shemminger, romieu, kuznet, netdev

On Fri, 11 Feb 2005 14:54:27 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> In some cases at least if the sender does not completely fill cwnd the
> ACKs will  be delayed.  And IIRC under 2.6.10 with TSO enabled, the
> sender does not always  fill cwnd.

At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left
empty.

By default, this is 1/8 of the cwnd.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12  1:09   ` David S. Miller
@ 2005-02-12 14:31     ` Alexey Kuznetsov
  2005-02-12 19:28       ` David S. Miller
  2005-02-12 20:19       ` rick jones
  0 siblings, 2 replies; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 14:31 UTC (permalink / raw)
  To: David S. Miller
  Cc: Rick Jones, hubert.tonneau, shemminger, romieu, kuznet, netdev

Hello!

> > In some cases at least if the sender does not completely fill cwnd the
> > ACKs will  be delayed.  And IIRC under 2.6.10 with TSO enabled, the
> > sender does not always  fill cwnd.
> 
> At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left
> empty.
> 
> By default, this is 1/8 of the cwnd.

In any case, receiver cannot know sender cwnd, so that "fill" or "not fill"
is is not a question.

What is broken in that implementation is that it does not feel slow start.
ACK avoidance while slow start is certain disaster. Currrent theory is that
MacOS X thinks that we do not do slow start.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 14:31     ` Alexey Kuznetsov
@ 2005-02-12 19:28       ` David S. Miller
  2005-02-12 19:44         ` Leonid Grossman
  2005-02-12 19:52         ` Alexey Kuznetsov
  2005-02-12 20:19       ` rick jones
  1 sibling, 2 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12 19:28 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev

On Sat, 12 Feb 2005 17:31:05 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:

> In any case, receiver cannot know sender cwnd, so that "fill" or "not fill"
> is is not a question.
>
> What is broken in that implementation is that it does not feel slow start.
> ACK avoidance while slow start is certain disaster. Currrent theory is that
> MacOS X thinks that we do not do slow start.

It is correct.  Although, I am still believing that setting PSH
is the avenue of investigation.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 19:28       ` David S. Miller
@ 2005-02-12 19:44         ` Leonid Grossman
  2005-02-12 19:52         ` Alexey Kuznetsov
  1 sibling, 0 replies; 40+ messages in thread
From: Leonid Grossman @ 2005-02-12 19:44 UTC (permalink / raw)
  To: 'David S. Miller', 'Alexey Kuznetsov'
  Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev

Typically, a TSO engine sets PSH in the last packet that it builds for the
TSO+PSH request.
Leonid

> -----Original Message-----
> From: netdev-bounce@oss.sgi.com 
> [mailto:netdev-bounce@oss.sgi.com] On Behalf Of David S. Miller
> Sent: Saturday, February 12, 2005 11:28 AM
> To: Alexey Kuznetsov
> Cc: rick.jones2@hp.com; hubert.tonneau@fullpliant.org; 
> shemminger@osdl.org; romieu@fr.zoreil.com; 
> kuznet@ms2.inr.ac.ru; netdev@oss.sgi.com
> Subject: Re: 2.6.10 TCP troubles -- suggested patch
> 
> On Sat, 12 Feb 2005 17:31:05 +0300
> Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> 
> > In any case, receiver cannot know sender cwnd, so that 
> "fill" or "not fill"
> > is is not a question.
> >
> > What is broken in that implementation is that it does not 
> feel slow start.
> > ACK avoidance while slow start is certain disaster. 
> Currrent theory is 
> > that MacOS X thinks that we do not do slow start.
> 
> It is correct.  Although, I am still believing that setting 
> PSH is the avenue of investigation.
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 19:28       ` David S. Miller
  2005-02-12 19:44         ` Leonid Grossman
@ 2005-02-12 19:52         ` Alexey Kuznetsov
  2005-02-15 23:25           ` David S. Miller
  1 sibling, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, rick.jones2, hubert.tonneau, shemminger, romieu,
	netdev

Hello!

> It is correct.  Although, I am still believing that setting PSH
> is the avenue of investigation.

Exactly. That's why the next test should be with disabled TSO in 2.6.9.
If too rare PSHs were a problem, it will show as the same disaster there.

[ And, to be honest, in this case, I daresay MacOS X may be left with its bugs
  alone. Or we could help it with something like setting PSH when we are in slow
  start and each half of CWND after completion of slow start. Or just set
  PSH on each frame. ]

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 19:52         ` Alexey Kuznetsov
@ 2005-02-15 23:25           ` David S. Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-15 23:25 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: kuznet, rick.jones2, hubert.tonneau, shemminger, romieu, netdev

On Sat, 12 Feb 2005 22:52:46 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:

> Exactly. That's why the next test should be with disabled TSO in 2.6.9.
> If too rare PSHs were a problem, it will show as the same disaster there.
> 
> [ And, to be honest, in this case, I daresay MacOS X may be left with its bugs
>   alone. Or we could help it with something like setting PSH when we are in slow
>   start and each half of CWND after completion of slow start. Or just set
>   PSH on each frame. ]

Setting it every other frame would fix the problem, just forcing it to
miss header prediction path is what is needed to avoid the silly delayed
ACK behavior.  And PSH is one way to do that.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 14:31     ` Alexey Kuznetsov
  2005-02-12 19:28       ` David S. Miller
@ 2005-02-12 20:19       ` rick jones
  2005-02-12 20:28         ` David S. Miller
  2005-02-12 20:56         ` Alexey Kuznetsov
  1 sibling, 2 replies; 40+ messages in thread
From: rick jones @ 2005-02-12 20:19 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, romieu, hubert.tonneau, shemminger

On Feb 12, 2005, at 6:31 AM, Alexey Kuznetsov wrote:

> Hello!
>
>>> In some cases at least if the sender does not completely fill cwnd 
>>> the
>>> ACKs will  be delayed.  And IIRC under 2.6.10 with TSO enabled, the
>>> sender does not always  fill cwnd.
>>
>> At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left
>> empty.
>>
>> By default, this is 1/8 of the cwnd.
>
> In any case, receiver cannot know sender cwnd, so that "fill" or "not 
> fill"
> is is not a question.

How is that?  Isn't cwnd based on the ACKs the sender receives from the 
receiver?

> What is broken in that implementation is that it does not feel slow 
> start.
> ACK avoidance while slow start is certain disaster. Currrent theory is 
> that
> MacOS X thinks that we do not do slow start.

Actually, it may think slow start is being done - there was enough 
small packet back and forth on the connection before the "heavy 
transfer" to get cwnd opened - I just didn't quote that in the "cooked" 
output.  All the stacks with ACK avoidance with which I am familiar do 
not make the assumption that the sender is not doing slow-start.  They 
make sure to send enough ACKs at the beginning (or after packet loss) 
to allow the sender's cwnd to grow.

rick jones
wisdom teeth are impacted, people are affected by the effects of events

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 20:19       ` rick jones
@ 2005-02-12 20:28         ` David S. Miller
  2005-02-12 20:56         ` Alexey Kuznetsov
  1 sibling, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12 20:28 UTC (permalink / raw)
  To: rick jones; +Cc: kuznet, netdev, romieu, hubert.tonneau, shemminger

On Sat, 12 Feb 2005 12:19:35 -0800
rick jones <rick.jones2@hp.com> wrote:

> How is that?  Isn't cwnd based on the ACKs the sender receives from the 
> receiver?

ACKs go from sender to receiver, first of all.

It is based upon congestion as seen "by receiver", something which is
impossible for sender.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 20:19       ` rick jones
  2005-02-12 20:28         ` David S. Miller
@ 2005-02-12 20:56         ` Alexey Kuznetsov
  2005-02-12 21:27           ` Nivedita Singhvi
  2005-02-12 21:43           ` rick jones
  1 sibling, 2 replies; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 20:56 UTC (permalink / raw)
  To: rick jones; +Cc: Alexey Kuznetsov, netdev, romieu, hubert.tonneau, shemminger

Hello!

> Actually, it may think slow start is being done - there was enough 
> small packet back and forth on the connection before the "heavy 
> transfer" to get cwnd opened

If receiver sent an ACK it still does not mean that sender used it
to increase its cwnd. Particularly, small packet exchange definitely
does not inflate cwnd.

> output.  All the stacks with ACK avoidance with which I am familiar do 
> not make the assumption that the sender is not doing slow-start.  They 
> make sure to send enough ACKs at the beginning (or after packet loss) 
> to allow the sender's cwnd to grow.

Well, we do similar thing with delayed ACKs. And it took a few of runs
of testing to understand that we cannot detect even packet loss reliably
enough. :-)

Actually, those receivers could use the first delayed ACK event as
a sign of failure of their heuristics and block stretching acks for
this connection.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 20:56         ` Alexey Kuznetsov
@ 2005-02-12 21:27           ` Nivedita Singhvi
  2005-02-12 21:43           ` rick jones
  1 sibling, 0 replies; 40+ messages in thread
From: Nivedita Singhvi @ 2005-02-12 21:27 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: rick jones, netdev, romieu, hubert.tonneau, shemminger

Alexey Kuznetsov wrote:

> If receiver sent an ACK it still does not mean that sender used it
> to increase its cwnd. Particularly, small packet exchange definitely
> does not inflate cwnd.

Simplest way to go about this is simply compare it to the
trace of the "good/fast" connection - Hubert, could you
provide the "good" trace as well? That would show where
the differences in time are taken up..

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 20:56         ` Alexey Kuznetsov
  2005-02-12 21:27           ` Nivedita Singhvi
@ 2005-02-12 21:43           ` rick jones
  2005-02-12 22:00             ` Alexey Kuznetsov
  1 sibling, 1 reply; 40+ messages in thread
From: rick jones @ 2005-02-12 21:43 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: netdev, romieu, hubert.tonneau, shemminger


> If receiver sent an ACK it still does not mean that sender used it
> to increase its cwnd. Particularly, small packet exchange definitely
> does not inflate cwnd.

Is that in general, or in Linux?

>> output.  All the stacks with ACK avoidance with which I am familiar do
>> not make the assumption that the sender is not doing slow-start.  They
>> make sure to send enough ACKs at the beginning (or after packet loss)
>> to allow the sender's cwnd to grow.
>
> Well, we do similar thing with delayed ACKs. And it took a few of runs
> of testing to understand that we cannot detect even packet loss 
> reliably
> enough. :-)

I never claimed it was easy :)

> Actually, those receivers could use the first delayed ACK event as
> a sign of failure of their heuristics and block stretching acks for
> this connection.

The ones with which I am familiar do - after N delayed ACK events where 
N is something other than one though.  And they still send immediate 
ACKs to the senders upon out of order data and all that.

rick jones
Wisdom teeth are impacted, people are affected by the effects of events

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 21:43           ` rick jones
@ 2005-02-12 22:00             ` Alexey Kuznetsov
  2005-02-13  1:29               ` rick jones
  0 siblings, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 22:00 UTC (permalink / raw)
  To: rick jones; +Cc: Alexey Kuznetsov, netdev, romieu, hubert.tonneau, shemminger

Hello!

> Is that in general, or in Linux?

Any which follows some of congestion window validation recommendations.
Even canonical bsd restarts slow start after rtt.


> N is something other than one though.

Well, 1 is quite enough to be sure that something is very wrong.
You see a proof here.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 22:00             ` Alexey Kuznetsov
@ 2005-02-13  1:29               ` rick jones
  0 siblings, 0 replies; 40+ messages in thread
From: rick jones @ 2005-02-13  1:29 UTC (permalink / raw)
  To: netdev; +Cc: romieu, hubert.tonneau, shemminger

On Feb 12, 2005, at 2:00 PM, Alexey Kuznetsov wrote:
> Any which follows some of congestion window validation recommendations.

If you could point me at the chapter and verse that would be great.

> Even canonical bsd restarts slow start after rtt.

Did we have >= one RTT of idle in the packet trace?

>> N is something other than one though.
>
> Well, 1 is quite enough to be sure that something is very wrong.
> You see a proof here.

The debate of course is what :)

In and of _itself_, a delayed ACK does not guarantee something is very 
wrong.  For example, in a request/response situation when the response 
takes longer than the delayed ACK interval to generate.  And if it was 
not request/response, and the sender simply didn't have any more to 
send at that point.

Going back to the quantity of cwnd which may be left unused when TSO is 
enabled.  If when TSO is enabled, the sender does not take full 
advantage of the cwnd doesn't that then mean that to deal with the same 
bandwidth delay product, one needs a larger TCP window when TSO is 
enabled than when it is not?  In the default case of 
tcp_tso_win_divisor being 8 that would be another 12.5% right?

rick jones
there is no rest for the wicked, yet the virtuous have no pillows

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 21:55 2.6.10 TCP troubles -- suggested patch Hubert Tonneau
  2005-02-11 22:54 ` Rick Jones
@ 2005-02-11 23:04 ` Stephen Hemminger
  2005-02-12  1:07   ` David S. Miller
  2005-02-15 23:23   ` David S. Miller
  1 sibling, 2 replies; 40+ messages in thread
From: Stephen Hemminger @ 2005-02-11 23:04 UTC (permalink / raw)
  To: Hubert Tonneau
  Cc: David S. Miller, romieu, kuznet, Nivedita Singhvi, Rick Jones,
	netdev

On Fri, 11 Feb 2005 21:55:49 GMT
Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote:

> Sorry, it still does not work, unless I made a mistake:
> Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
> Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.
> 
> You can pick the new tcpdump report, created through:
> tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
> at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz

Still not setting Push sufficiently to keep MacOSX happy.

13:40:35.027124 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 924:975(51) ack 67344 win 50728 
13:40:35.027186 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 67344 win 65535 
13:40:35.027328 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 975:1026(51) ack 67344 win 65535 
13:40:35.027363 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 67344:68792(1448) ack 1026 win 1460 
13:40:35.027367 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 68792:70240(1448) ack 1026 win 1460 
13:40:35.027370 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 70240:71688(1448) ack 1026 win 1460 
13:40:35.027373 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 71688:73136(1448) ack 1026 win 1460 
13:40:35.027375 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 73136:74584(1448) ack 1026 win 1460 
13:40:35.027378 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 74584:76032(1448) ack 1026 win 1460 
13:40:35.027381 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 76032:77480(1448) ack 1026 win 1460 
13:40:35.027384 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 77480:78928(1448) ack 1026 win 1460 
13:40:35.027387 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 78928:80376(1448) ack 1026 win 1460 
13:40:35.027390 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 80376:81824(1448) ack 1026 win 1460 
13:40:35.027393 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 81824:83272(1448) ack 1026 win 1460 
13:40:35.027397 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: P 83272:83980(708) ack 1026 win 1460 

okay burst with push

13:40:35.034930 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 1179:1230(51) ack 133132 win 65535 
13:40:35.035304 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 133132:134580(1448) ack 1230 win 1460 
13:40:35.035312 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 134580:136028(1448) ack 1230 win 1460

Big gap... because of missing P

13:40:35.219175 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 136028 win 63716 
13:40:35.219193 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 136028:137476(1448) ack 1230 win 1460 
13:40:35.219197 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 137476:138924(1448) ack 1230 win 1460 
13:40:35.419193 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 138924 win 60820 
13:40:35.419202 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 138924:140372(1448) ack 1230 win 1460 
13:40:35.419205 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 140372:141820(1448) ack 1230 win 1460 
13:40:35.419207 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 141820:143268(1448) ack 1230 win 1460 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 23:04 ` Stephen Hemminger
@ 2005-02-12  1:07   ` David S. Miller
  2005-02-12 12:11     ` Andi Kleen
  2005-02-12 14:16     ` Alexey Kuznetsov
  2005-02-15 23:23   ` David S. Miller
  1 sibling, 2 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12  1:07 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev

On Fri, 11 Feb 2005 15:04:20 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:

> Still not setting Push sufficiently to keep MacOSX happy.

I don't think it's the kernel's fault in this case.

This set of data frames you quoted are all full, and
are tightly interspaced.  It looks exactly like a TSO
frame, which we certainly set PSH on, but the TSO
engine is dropping it aparently.

I guess this is e1000.  Any e1000 internals experts reading
here who can comment on how e1000's TSO engine treats the
PSH flag?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12  1:07   ` David S. Miller
@ 2005-02-12 12:11     ` Andi Kleen
  2005-02-12 19:23       ` David S. Miller
  2005-02-12 14:16     ` Alexey Kuznetsov
  1 sibling, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2005-02-12 12:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev

"David S. Miller" <davem@davemloft.net> writes:
>
> I guess this is e1000.  Any e1000 internals experts reading
> here who can comment on how e1000's TSO engine treats the
> PSH flag?

If that is the problem it should be easy to test for. Just
disable TSO with ethtool -K ethX tso off

Hubert, does that make the problem go away?

-Andi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 12:11     ` Andi Kleen
@ 2005-02-12 19:23       ` David S. Miller
  2005-02-12 21:30         ` Andi Kleen
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-12 19:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev

On Sat, 12 Feb 2005 13:11:43 +0100
Andi Kleen <ak@muc.de> wrote:

> "David S. Miller" <davem@davemloft.net> writes:
> >
> > I guess this is e1000.  Any e1000 internals experts reading
> > here who can comment on how e1000's TSO engine treats the
> > PSH flag?
> 
> If that is the problem it should be easy to test for. Just
> disable TSO with ethtool -K ethX tso off
> 
> Hubert, does that make the problem go away?

We're testing the new code that sets PSH on every TSO frame.
If we disable TSO, the new code won't be exercised nor tested.
:-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 19:23       ` David S. Miller
@ 2005-02-12 21:30         ` Andi Kleen
  0 siblings, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2005-02-12 21:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev

> We're testing the new code that sets PSH on every TSO frame.
> If we disable TSO, the new code won't be exercised nor tested.
> :-)

Sorry, I read the thread out of order (shouldn't do that) Ignore my mail.

-Andi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12  1:07   ` David S. Miller
  2005-02-12 12:11     ` Andi Kleen
@ 2005-02-12 14:16     ` Alexey Kuznetsov
  2005-02-12 19:41       ` David S. Miller
  1 sibling, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 14:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: Stephen Hemminger, hubert.tonneau, romieu, kuznet, niv,
	rick.jones2, netdev

Hello!

> This set of data frames you quoted are all full, and
> are tightly interspaced.  It looks exactly like a TSO
> frame, which we certainly set PSH on, but the TSO
> engine is dropping it aparently.
> 
> I guess this is e1000.  Any e1000 internals experts reading
> here who can comment on how e1000's TSO engine treats the
> PSH flag?

Or it was two one-segment frames.

Before blaming on e1000 it would be easier to confirm that
linux never worked with MacOS X, except for those kernels which
had congestion avoidance mostly supppressed.

I.e. let's disable TSO in 2.6.9 and look.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 14:16     ` Alexey Kuznetsov
@ 2005-02-12 19:41       ` David S. Miller
  2005-02-12 20:03         ` Alexey Kuznetsov
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-12 19:41 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: shemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2,
	netdev

On Sat, 12 Feb 2005 17:16:41 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:

> > This set of data frames you quoted are all full, and
> > are tightly interspaced.  It looks exactly like a TSO
> > frame, which we certainly set PSH on, but the TSO
> > engine is dropping it aparently.
 ...
> Or it was two one-segment frames.

Even ignoring my TSO changes, we should be seeing at a minimum
1/2 window PSH settings which we're not as far as I can tell.
(this is due to the forced_push() test in net/ipv4/tcp.c)

This also points out a bug in my TSO PSH patch, I should be
updating tp->pushed_seq shouldn't I?  Question is, what to
set it to?  I think correct value is TCP_SKB_CB(skb)->end_seq.

> I.e. let's disable TSO in 2.6.9 and look.

I believe this experiment had been performed already.  Stephen,
isn't that the case?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 19:41       ` David S. Miller
@ 2005-02-12 20:03         ` Alexey Kuznetsov
  2005-02-15 23:26           ` David S. Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 20:03 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, shemminger, hubert.tonneau, romieu, niv,
	rick.jones2, netdev

Hello!

> set it to?  I think correct value is TCP_SKB_CB(skb)->end_seq.

Yup. But it does not matter. When it is not advanced, it does not make
PSHs more rare.

Actually, that anti-MacOS never worked well. If segment with forced PSH
was not transmitted in time, even forced PSHs could be deleted.
Your patch with setting PSH right before (or in) tcp_transmit_skb() must
work. Unless these segments are not tso.

> > I.e. let's disable TSO in 2.6.9 and look.
> 
> I believe this experiment had been performed already.

I saw only tests with TSO. And 2.6.9 showed exactly the same weird
behaviour. Only 2.6.9 did not slow start and we had only a few of 200msec
gaps.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-12 20:03         ` Alexey Kuznetsov
@ 2005-02-15 23:26           ` David S. Miller
  2005-02-15 23:42             ` Rick Jones
  0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-15 23:26 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: kuznet, shemminger, hubert.tonneau, romieu, niv, rick.jones2,
	netdev

On Sat, 12 Feb 2005 23:03:18 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:

> Actually, that anti-MacOS never worked well. If segment with forced PSH
> was not transmitted in time, even forced PSHs could be deleted.
> Your patch with setting PSH right before (or in) tcp_transmit_skb() must
> work. Unless these segments are not tso.

Yes, it never did work well.  But now we understand more deeply the
nature of this beast, we can probably refine it.

In short, for properly working TCP stream with no drops and no
reordering, Darwin delays ACKs until delack timer fires or PSH
is seen :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-15 23:26           ` David S. Miller
@ 2005-02-15 23:42             ` Rick Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Rick Jones @ 2005-02-15 23:42 UTC (permalink / raw)
  To: netdev

> In short, for properly working TCP stream with no drops and no
> reordering, Darwin delays ACKs until delack timer fires or PSH
> is seen :-)

As a supporter of ACK avoidance heuristics in general, I will come-out and say 
that the heuristic above does indeed sound quite broken.  It is not the 
heuristic with which I am familiar, which has a configurable maximum number of 
segments for which to delay the ACK.

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-11 23:04 ` Stephen Hemminger
  2005-02-12  1:07   ` David S. Miller
@ 2005-02-15 23:23   ` David S. Miller
  2005-02-16  9:13     ` Alexey Kuznetsov
  1 sibling, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-15 23:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev

On Fri, 11 Feb 2005 15:04:20 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:

> Still not setting Push sufficiently to keep MacOSX happy.
 ...
> 13:40:35.034930 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 1179:1230(51) ack 133132 win 65535 
> 13:40:35.035304 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 133132:134580(1448) ack 1230 win 1460 
> 13:40:35.035312 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 134580:136028(1448) ack 1230 win 1460
> 
> Big gap... because of missing P
> 
> 13:40:35.219175 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 136028 win 63716 

I am starting to understand Darwin's logic.  If header prediction fast path
is hit, ACK is always delayed when delack sysctl is enabled.

One way to miss fast path is for PSH to be set.

This will make ACK not get delayed if ACK is pending already.

At least that is how it looks, and it makes sense given this trace.

How mind boggling a heuristic.  I bet it works by accident rather
than intention and purposeful design.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-15 23:23   ` David S. Miller
@ 2005-02-16  9:13     ` Alexey Kuznetsov
  2005-02-16 17:50       ` David S. Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-16  9:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: Stephen Hemminger, hubert.tonneau, romieu, kuznet, niv,
	rick.jones2, netdev

Hello!

> How mind boggling a heuristic.  I bet it works by accident rather
> than intention and purposeful design.

Yup. It is definitely not an "ack avoidance algorithm" :-) :-)

BTW it is still a puzzle why 2.6.9 works. With disabled TSO it should
insert PSHs quite rarely, similarly to tso.

And it is still a puzzle how that bunch of PSHless segments not followed
by PSH appeared in TSO case.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-16  9:13     ` Alexey Kuznetsov
@ 2005-02-16 17:50       ` David S. Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-16 17:50 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: shemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2,
	netdev

[-- Attachment #1: Type: text/plain, Size: 661 bytes --]

On Wed, 16 Feb 2005 12:13:23 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:

> BTW it is still a puzzle why 2.6.9 works. With disabled TSO it should
> insert PSHs quite rarely, similarly to tso.

Yes.

Hubert, do you have netfilter enabled in the 2.6.10 kernel you are running?

I'm asking because the TCP changes in 2.6.10 are pretty benign
(attached for the curious who want to review along), whereas
netfilter had major updates particularly in the TCP connection
tracking code.

I also reviewed 2.6.10-ac11 for anything interesting wrt. TCP and the
only thing in there is the tcp_retrans_try_collapse() missing check
to avoid collapsing TSO segments.


[-- Attachment #2: tcp-2.6.10 --]
[-- Type: application/octet-stream, Size: 35185 bytes --]

diff -Nru a/include/linux/tcp.h b/include/linux/tcp.h
--- a/include/linux/tcp.h	2004-12-24 13:36:49 -08:00
+++ b/include/linux/tcp.h	2004-12-24 13:36:49 -08:00
@@ -186,6 +186,8 @@
 
 	__u32	tcpi_rcv_rtt;
 	__u32	tcpi_rcv_space;
+
+	__u32	tcpi_total_retrans;
 };
 
 #ifdef __KERNEL__
@@ -363,6 +365,8 @@
 	__u8	pending;	/* Scheduled timer event	*/
 	__u8	urg_mode;	/* In urgent mode		*/
 	__u32	snd_up;		/* Urgent pointer		*/
+
+	__u32	total_retrans;	/* Total retransmits for entire connection */
 
 	/* The syn_wait_lock is necessary only to avoid proc interface having
 	 * to grab the main lock sock while browsing the listening hash
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-12-24 13:36:18 -08:00
+++ b/include/net/tcp.h	2004-12-24 13:36:18 -08:00
@@ -159,7 +159,6 @@
 extern void tcp_bucket_destroy(struct tcp_bind_bucket *tb);
 extern void tcp_bucket_unlock(struct sock *sk);
 extern int tcp_port_rover;
-extern struct sock *tcp_v4_lookup_listener(u32 addr, unsigned short hnum, int dif);
 
 /* These are AF independent. */
 static __inline__ int tcp_bhashfn(__u16 lport)
@@ -362,8 +361,8 @@
 #define TCP_IPV6_MATCH(__sk, __saddr, __daddr, __ports, __dif)	   \
 	(((*((__u32 *)&(inet_sk(__sk)->dport)))== (__ports))   	&& \
 	 ((__sk)->sk_family		== AF_INET6)		&& \
-	 !ipv6_addr_cmp(&inet6_sk(__sk)->daddr, (__saddr))	&& \
-	 !ipv6_addr_cmp(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&& \
+	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&& \
+	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&& \
 	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
 
 /* These can have wildcards, don't try too hard. */
@@ -961,12 +960,14 @@
 extern void tcp_init_xmit_timers(struct sock *);
 extern void tcp_clear_xmit_timers(struct sock *);
 
-extern void tcp_delete_keepalive_timer (struct sock *);
-extern void tcp_reset_keepalive_timer (struct sock *, unsigned long);
+extern void tcp_delete_keepalive_timer(struct sock *);
+extern void tcp_reset_keepalive_timer(struct sock *, unsigned long);
 extern unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu);
 extern unsigned int tcp_current_mss(struct sock *sk, int large);
 
-extern const char timer_bug_msg[];
+#ifdef TCP_DEBUG
+extern const char tcp_timer_bug_msg[];
+#endif
 
 /* tcp_diag.c */
 extern void tcp_get_info(struct sock *, struct tcp_info *);
@@ -999,7 +1000,9 @@
 #endif
 		break;
 	default:
-		printk(timer_bug_msg);
+#ifdef TCP_DEBUG
+		printk(tcp_timer_bug_msg);
+#endif
 		return;
 	};
 
@@ -1034,7 +1037,9 @@
 		break;
 
 	default:
-		printk(timer_bug_msg);
+#ifdef TCP_DEBUG
+		printk(tcp_timer_bug_msg);
+#endif
 	};
 }
 
@@ -1083,7 +1088,7 @@
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
  */
-static __inline__ u32 tcp_receive_window(struct tcp_opt *tp)
+static __inline__ u32 tcp_receive_window(const struct tcp_opt *tp)
 {
 	s32 win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt;
 
@@ -1161,18 +1166,19 @@
 /* Due to TSO, an SKB can be composed of multiple actual
  * packets.  To keep these tracked properly, we use this.
  */
-static inline int tcp_skb_pcount(struct sk_buff *skb)
+static inline int tcp_skb_pcount(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->tso_segs;
 }
 
 /* This is valid iff tcp_skb_pcount() > 1. */
-static inline int tcp_skb_mss(struct sk_buff *skb)
+static inline int tcp_skb_mss(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->tso_size;
 }
 
-static inline void tcp_inc_pcount(tcp_pcount_t *count, struct sk_buff *skb)
+static inline void tcp_inc_pcount(tcp_pcount_t *count,
+				  const struct sk_buff *skb)
 {
 	count->val += tcp_skb_pcount(skb);
 }
@@ -1187,13 +1193,14 @@
 	count->val -= amt;
 }
 
-static inline void tcp_dec_pcount(tcp_pcount_t *count, struct sk_buff *skb)
+static inline void tcp_dec_pcount(tcp_pcount_t *count, 
+				  const struct sk_buff *skb)
 {
 	count->val -= tcp_skb_pcount(skb);
 }
 
 static inline void tcp_dec_pcount_approx(tcp_pcount_t *count,
-					 struct sk_buff *skb)
+					 const struct sk_buff *skb)
 {
 	if (count->val) {
 		count->val -= tcp_skb_pcount(skb);
@@ -1202,7 +1209,7 @@
 	}
 }
 
-static inline __u32 tcp_get_pcount(tcp_pcount_t *count)
+static inline __u32 tcp_get_pcount(const tcp_pcount_t *count)
 {
 	return count->val;
 }
@@ -1212,8 +1219,9 @@
 	count->val = val;
 }
 
-static inline void tcp_packets_out_inc(struct sock *sk, struct tcp_opt *tp,
-				       struct sk_buff *skb)
+static inline void tcp_packets_out_inc(struct sock *sk, 
+				       struct tcp_opt *tp,
+				       const struct sk_buff *skb)
 {
 	int orig = tcp_get_pcount(&tp->packets_out);
 
@@ -1222,7 +1230,8 @@
 		tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS, tp->rto);
 }
 
-static inline void tcp_packets_out_dec(struct tcp_opt *tp, struct sk_buff *skb)
+static inline void tcp_packets_out_dec(struct tcp_opt *tp, 
+				       const struct sk_buff *skb)
 {
 	tcp_dec_pcount(&tp->packets_out, skb);
 }
@@ -1241,7 +1250,7 @@
  *	"Packets left network, but not honestly ACKed yet" PLUS
  *	"Packets fast retransmitted"
  */
-static __inline__ unsigned int tcp_packets_in_flight(struct tcp_opt *tp)
+static __inline__ unsigned int tcp_packets_in_flight(const struct tcp_opt *tp)
 {
 	return (tcp_get_pcount(&tp->packets_out) -
 		tcp_get_pcount(&tp->left_out) +
@@ -1408,18 +1417,19 @@
 /* Slow start with delack produces 3 packets of burst, so that
  * it is safe "de facto".
  */
-static __inline__ __u32 tcp_max_burst(struct tcp_opt *tp)
+static __inline__ __u32 tcp_max_burst(const struct tcp_opt *tp)
 {
 	return 3;
 }
 
-static __inline__ int tcp_minshall_check(struct tcp_opt *tp)
+static __inline__ int tcp_minshall_check(const struct tcp_opt *tp)
 {
 	return after(tp->snd_sml,tp->snd_una) &&
 		!after(tp->snd_sml, tp->snd_nxt);
 }
 
-static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss, struct sk_buff *skb)
+static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss, 
+					   const struct sk_buff *skb)
 {
 	if (skb->len < mss)
 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
@@ -1434,7 +1444,8 @@
  */
 
 static __inline__ int
-tcp_nagle_check(struct tcp_opt *tp, struct sk_buff *skb, unsigned mss_now, int nonagle)
+tcp_nagle_check(const struct tcp_opt *tp, const struct sk_buff *skb, 
+		unsigned mss_now, int nonagle)
 {
 	return (skb->len < mss_now &&
 		!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
@@ -1449,7 +1460,8 @@
 /* This checks if the data bearing packet SKB (usually sk->sk_send_head)
  * should be put on the wire right now.
  */
-static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
+static __inline__ int tcp_snd_test(const struct tcp_opt *tp, 
+				   struct sk_buff *skb,
 				   unsigned cur_mss, int nonagle)
 {
 	int pkts = tcp_skb_pcount(skb);
@@ -1496,7 +1508,8 @@
 		tcp_reset_xmit_timer(sk, TCP_TIME_PROBE0, tp->rto);
 }
 
-static __inline__ int tcp_skb_is_last(struct sock *sk, struct sk_buff *skb)
+static __inline__ int tcp_skb_is_last(const struct sock *sk, 
+				      const struct sk_buff *skb)
 {
 	return skb->next == (struct sk_buff *)&sk->sk_write_queue;
 }
@@ -1547,7 +1560,7 @@
 	tp->snd_wl1 = seq;
 }
 
-extern void			tcp_destroy_sock(struct sock *sk);
+extern void tcp_destroy_sock(struct sock *sk);
 
 
 /*
@@ -1621,7 +1634,7 @@
 #undef STATE_TRACE
 
 #ifdef STATE_TRACE
-static char *statename[]={
+static const char *statename[]={
 	"Unused","Established","Syn Sent","Syn Recv",
 	"Fin Wait 1","Fin Wait 2","Time Wait", "Close",
 	"Close Wait","Last ACK","Listen","Closing"
@@ -1892,17 +1905,17 @@
 		wake_up(&tcp_lhash_wait);
 }
 
-static inline int keepalive_intvl_when(struct tcp_opt *tp)
+static inline int keepalive_intvl_when(const struct tcp_opt *tp)
 {
 	return tp->keepalive_intvl ? : sysctl_tcp_keepalive_intvl;
 }
 
-static inline int keepalive_time_when(struct tcp_opt *tp)
+static inline int keepalive_time_when(const struct tcp_opt *tp)
 {
 	return tp->keepalive_time ? : sysctl_tcp_keepalive_time;
 }
 
-static inline int tcp_fin_time(struct tcp_opt *tp)
+static inline int tcp_fin_time(const struct tcp_opt *tp)
 {
 	int fin_timeout = tp->linger2 ? : sysctl_tcp_fin_timeout;
 
@@ -1912,7 +1925,7 @@
 	return fin_timeout;
 }
 
-static inline int tcp_paws_check(struct tcp_opt *tp, int rst)
+static inline int tcp_paws_check(const struct tcp_opt *tp, int rst)
 {
 	if ((s32)(tp->rcv_tsval - tp->ts_recent) >= 0)
 		return 0;
diff -Nru a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c	2004-12-24 13:36:31 -08:00
+++ b/net/ipv4/tcp.c	2004-12-24 13:36:31 -08:00
@@ -467,7 +467,7 @@
 	sk->sk_max_ack_backlog = 0;
 	sk->sk_ack_backlog = 0;
 	tp->accept_queue = tp->accept_queue_tail = NULL;
-	tp->syn_wait_lock = RW_LOCK_UNLOCKED;
+	rwlock_init(&tp->syn_wait_lock);
 	tcp_delack_init(tp);
 
 	lopt = kmalloc(sizeof(struct tcp_listen_opt), GFP_KERNEL);
@@ -2095,6 +2095,65 @@
 	return err;
 }
 
+/* Return information about state of tcp endpoint in API format. */
+void tcp_get_info(struct sock *sk, struct tcp_info *info)
+{
+	struct tcp_opt *tp = tcp_sk(sk);
+	u32 now = tcp_time_stamp;
+
+	memset(info, 0, sizeof(*info));
+
+	info->tcpi_state = sk->sk_state;
+	info->tcpi_ca_state = tp->ca_state;
+	info->tcpi_retransmits = tp->retransmits;
+	info->tcpi_probes = tp->probes_out;
+	info->tcpi_backoff = tp->backoff;
+
+	if (tp->tstamp_ok)
+		info->tcpi_options |= TCPI_OPT_TIMESTAMPS;
+	if (tp->sack_ok)
+		info->tcpi_options |= TCPI_OPT_SACK;
+	if (tp->wscale_ok) {
+		info->tcpi_options |= TCPI_OPT_WSCALE;
+		info->tcpi_snd_wscale = tp->snd_wscale;
+		info->tcpi_rcv_wscale = tp->rcv_wscale;
+	} 
+
+	if (tp->ecn_flags&TCP_ECN_OK)
+		info->tcpi_options |= TCPI_OPT_ECN;
+
+	info->tcpi_rto = jiffies_to_usecs(tp->rto);
+	info->tcpi_ato = jiffies_to_usecs(tp->ack.ato);
+	info->tcpi_snd_mss = tp->mss_cache_std;
+	info->tcpi_rcv_mss = tp->ack.rcv_mss;
+
+	info->tcpi_unacked = tcp_get_pcount(&tp->packets_out);
+	info->tcpi_sacked = tcp_get_pcount(&tp->sacked_out);
+	info->tcpi_lost = tcp_get_pcount(&tp->lost_out);
+	info->tcpi_retrans = tcp_get_pcount(&tp->retrans_out);
+	info->tcpi_fackets = tcp_get_pcount(&tp->fackets_out);
+
+	info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
+	info->tcpi_last_data_recv = jiffies_to_msecs(now - tp->ack.lrcvtime);
+	info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
+
+	info->tcpi_pmtu = tp->pmtu_cookie;
+	info->tcpi_rcv_ssthresh = tp->rcv_ssthresh;
+	info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3;
+	info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2;
+	info->tcpi_snd_ssthresh = tp->snd_ssthresh;
+	info->tcpi_snd_cwnd = tp->snd_cwnd;
+	info->tcpi_advmss = tp->advmss;
+	info->tcpi_reordering = tp->reordering;
+
+	info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3;
+	info->tcpi_rcv_space = tp->rcvq_space.space;
+
+	info->tcpi_total_retrans = tp->total_retrans;
+}
+
+EXPORT_SYMBOL_GPL(tcp_get_info);
+
 int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
 		   int __user *optlen)
 {
@@ -2250,7 +2309,7 @@
 	if (!tcp_ehash)
 		panic("Failed to allocate TCP established hash table\n");
 	for (i = 0; i < (tcp_ehash_size << 1); i++) {
-		tcp_ehash[i].lock = RW_LOCK_UNLOCKED;
+		rwlock_init(&tcp_ehash[i].lock);
 		INIT_HLIST_HEAD(&tcp_ehash[i].chain);
 	}
 
@@ -2266,7 +2325,7 @@
 	if (!tcp_bhash)
 		panic("Failed to allocate TCP bind hash table\n");
 	for (i = 0; i < tcp_bhash_size; i++) {
-		tcp_bhash[i].lock = SPIN_LOCK_UNLOCKED;
+		spin_lock_init(&tcp_bhash[i].lock);
 		INIT_HLIST_HEAD(&tcp_bhash[i].chain);
 	}
 
@@ -2301,13 +2360,10 @@
 	printk(KERN_INFO "TCP: Hash tables configured "
 	       "(established %d bind %d)\n",
 	       tcp_ehash_size << 1, tcp_bhash_size);
-
-	tcpdiag_init();
 }
 
 EXPORT_SYMBOL(tcp_accept);
 EXPORT_SYMBOL(tcp_close);
-EXPORT_SYMBOL(tcp_close_state);
 EXPORT_SYMBOL(tcp_destroy_sock);
 EXPORT_SYMBOL(tcp_disconnect);
 EXPORT_SYMBOL(tcp_getsockopt);
diff -Nru a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
--- a/net/ipv4/tcp_diag.c	2004-12-24 13:36:17 -08:00
+++ b/net/ipv4/tcp_diag.c	2004-12-24 13:36:17 -08:00
@@ -18,6 +18,7 @@
 #include <linux/random.h>
 #include <linux/cache.h>
 #include <linux/init.h>
+#include <linux/time.h>
 
 #include <net/icmp.h>
 #include <net/tcp.h>
@@ -29,6 +30,16 @@
 
 #include <linux/tcp_diag.h>
 
+struct tcpdiag_entry
+{
+	u32 *saddr;
+	u32 *daddr;
+	u16 sport;
+	u16 dport;
+	u16 family;
+	u16 userlocks;
+};
+
 static struct sock *tcpnl;
 
 
@@ -41,63 +52,8 @@
    rta->rta_len = rtalen;                   \
    RTA_DATA(rta); })
 
-/* Return information about state of tcp endpoint in API format. */
-void tcp_get_info(struct sock *sk, struct tcp_info *info)
-{
-	struct tcp_opt *tp = tcp_sk(sk);
-	u32 now = tcp_time_stamp;
-
-	memset(info, 0, sizeof(*info));
-
-	info->tcpi_state = sk->sk_state;
-	info->tcpi_ca_state = tp->ca_state;
-	info->tcpi_retransmits = tp->retransmits;
-	info->tcpi_probes = tp->probes_out;
-	info->tcpi_backoff = tp->backoff;
-
-	if (tp->tstamp_ok)
-		info->tcpi_options |= TCPI_OPT_TIMESTAMPS;
-	if (tp->sack_ok)
-		info->tcpi_options |= TCPI_OPT_SACK;
-	if (tp->wscale_ok) {
-		info->tcpi_options |= TCPI_OPT_WSCALE;
-		info->tcpi_snd_wscale = tp->snd_wscale;
-		info->tcpi_rcv_wscale = tp->rcv_wscale;
-	} 
-
-	if (tp->ecn_flags&TCP_ECN_OK)
-		info->tcpi_options |= TCPI_OPT_ECN;
-
-	info->tcpi_rto = jiffies_to_usecs(tp->rto);
-	info->tcpi_ato = jiffies_to_usecs(tp->ack.ato);
-	info->tcpi_snd_mss = tp->mss_cache_std;
-	info->tcpi_rcv_mss = tp->ack.rcv_mss;
-
-	info->tcpi_unacked = tcp_get_pcount(&tp->packets_out);
-	info->tcpi_sacked = tcp_get_pcount(&tp->sacked_out);
-	info->tcpi_lost = tcp_get_pcount(&tp->lost_out);
-	info->tcpi_retrans = tcp_get_pcount(&tp->retrans_out);
-	info->tcpi_fackets = tcp_get_pcount(&tp->fackets_out);
-
-	info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
-	info->tcpi_last_data_recv = jiffies_to_msecs(now - tp->ack.lrcvtime);
-	info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
-
-	info->tcpi_pmtu = tp->pmtu_cookie;
-	info->tcpi_rcv_ssthresh = tp->rcv_ssthresh;
-	info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3;
-	info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2;
-	info->tcpi_snd_ssthresh = tp->snd_ssthresh;
-	info->tcpi_snd_cwnd = tp->snd_cwnd;
-	info->tcpi_advmss = tp->advmss;
-	info->tcpi_reordering = tp->reordering;
-
-	info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3;
-	info->tcpi_rcv_space = tp->rcvq_space.space;
-}
-
 static int tcpdiag_fill(struct sk_buff *skb, struct sock *sk,
-			int ext, u32 pid, u32 seq)
+			int ext, u32 pid, u32 seq, u16 nlmsg_flags)
 {
 	struct inet_opt *inet = inet_sk(sk);
 	struct tcp_opt *tp = tcp_sk(sk);
@@ -109,6 +65,7 @@
 	unsigned char	 *b = skb->tail;
 
 	nlh = NLMSG_PUT(skb, pid, seq, TCPDIAG_GETSOCK, sizeof(*r));
+	nlh->nlmsg_flags = nlmsg_flags;
 	r = NLMSG_DATA(nlh);
 	if (sk->sk_state != TCP_TIME_WAIT) {
 		if (ext & (1<<(TCPDIAG_MEMINFO-1)))
@@ -146,7 +103,7 @@
 		r->tcpdiag_wqueue = 0;
 		r->tcpdiag_uid = 0;
 		r->tcpdiag_inode = 0;
-#ifdef CONFIG_IPV6
+#ifdef CONFIG_IP_TCPDIAG_IPV6
 		if (r->tcpdiag_family == AF_INET6) {
 			ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_src,
 				       &tw->tw_v6_rcv_saddr);
@@ -163,7 +120,7 @@
 	r->id.tcpdiag_src[0] = inet->rcv_saddr;
 	r->id.tcpdiag_dst[0] = inet->daddr;
 
-#ifdef CONFIG_IPV6
+#ifdef CONFIG_IP_TCPDIAG_IPV6
 	if (r->tcpdiag_family == AF_INET6) {
 		struct ipv6_pinfo *np = inet6_sk(sk);
 
@@ -231,11 +188,19 @@
 	return -1;
 }
 
-extern struct sock *tcp_v4_lookup(u32 saddr, u16 sport, u32 daddr, u16 dport, int dif);
-#ifdef CONFIG_IPV6
+extern struct sock *tcp_v4_lookup(u32 saddr, u16 sport, u32 daddr, u16 dport,
+				  int dif);
+#ifdef CONFIG_IP_TCPDIAG_IPV6
 extern struct sock *tcp_v6_lookup(struct in6_addr *saddr, u16 sport,
 				  struct in6_addr *daddr, u16 dport,
 				  int dif);
+#else
+static inline struct sock *tcp_v6_lookup(struct in6_addr *saddr, u16 sport,
+					 struct in6_addr *daddr, u16 dport,
+					 int dif)
+{
+	return NULL;
+}
 #endif
 
 static int tcpdiag_get_exact(struct sk_buff *in_skb, const struct nlmsghdr *nlh)
@@ -250,7 +215,7 @@
 				   req->id.tcpdiag_src[0], req->id.tcpdiag_sport,
 				   req->id.tcpdiag_if);
 	}
-#ifdef CONFIG_IPV6
+#ifdef CONFIG_IP_TCPDIAG_IPV6
 	else if (req->tcpdiag_family == AF_INET6) {
 		sk = tcp_v6_lookup((struct in6_addr*)req->id.tcpdiag_dst, req->id.tcpdiag_dport,
 				   (struct in6_addr*)req->id.tcpdiag_src, req->id.tcpdiag_sport,
@@ -280,7 +245,7 @@
 
 	if (tcpdiag_fill(rep, sk, req->tcpdiag_ext,
 			 NETLINK_CB(in_skb).pid,
-			 nlh->nlmsg_seq) <= 0)
+			 nlh->nlmsg_seq, 0) <= 0)
 		BUG();
 
 	err = netlink_unicast(tcpnl, rep, NETLINK_CB(in_skb).pid, MSG_DONTWAIT);
@@ -324,11 +289,11 @@
 }
 
 
-static int tcpdiag_bc_run(const void *bc, int len, struct sock *sk)
+static int tcpdiag_bc_run(const void *bc, int len,
+			  const struct tcpdiag_entry *entry)
 {
 	while (len > 0) {
 		int yes = 1;
-		struct inet_opt *inet = inet_sk(sk);
 		const struct tcpdiag_bc_op *op = bc;
 
 		switch (op->code) {
@@ -338,19 +303,19 @@
 			yes = 0;
 			break;
 		case TCPDIAG_BC_S_GE:
-			yes = inet->num >= op[1].no;
+			yes = entry->sport >= op[1].no;
 			break;
 		case TCPDIAG_BC_S_LE:
-			yes = inet->num <= op[1].no;
+			yes = entry->dport <= op[1].no;
 			break;
 		case TCPDIAG_BC_D_GE:
-			yes = ntohs(inet->dport) >= op[1].no;
+			yes = entry->dport >= op[1].no;
 			break;
 		case TCPDIAG_BC_D_LE:
-			yes = ntohs(inet->dport) <= op[1].no;
+			yes = entry->dport <= op[1].no;
 			break;
 		case TCPDIAG_BC_AUTO:
-			yes = !(sk->sk_userlocks & SOCK_BINDPORT_LOCK);
+			yes = !(entry->userlocks & SOCK_BINDPORT_LOCK);
 			break;
 		case TCPDIAG_BC_S_COND:
 		case TCPDIAG_BC_D_COND:
@@ -360,7 +325,7 @@
 
 			if (cond->port != -1 &&
 			    cond->port != (op->code == TCPDIAG_BC_S_COND ?
-					     inet->num : ntohs(inet->dport))) {
+					     entry->sport : entry->dport)) {
 				yes = 0;
 				break;
 			}
@@ -368,26 +333,14 @@
 			if (cond->prefix_len == 0)
 				break;
 
-#ifdef CONFIG_IPV6
-			if (sk->sk_family == AF_INET6) {
-				struct ipv6_pinfo *np = inet6_sk(sk);
-
-				if (op->code == TCPDIAG_BC_S_COND)
-					addr = (u32*)&np->rcv_saddr;
-				else
-					addr = (u32*)&np->daddr;
-			} else
-#endif
-			{
-				if (op->code == TCPDIAG_BC_S_COND)
-					addr = &inet->rcv_saddr;
-				else
-					addr = &inet->daddr;
-			}
+			if (op->code == TCPDIAG_BC_S_COND)
+				addr = entry->saddr;
+			else
+				addr = entry->daddr;
 
 			if (bitstring_match(addr, cond->addr, cond->prefix_len))
 				break;
-			if (sk->sk_family == AF_INET6 &&
+			if (entry->family == AF_INET6 &&
 			    cond->family == AF_INET) {
 				if (addr[0] == 0 && addr[1] == 0 &&
 				    addr[2] == htonl(0xffff) &&
@@ -466,16 +419,182 @@
 	return len == 0 ? 0 : -EINVAL;
 }
 
+static int tcpdiag_dump_sock(struct sk_buff *skb, struct sock *sk,
+			     struct netlink_callback *cb)
+{
+	struct tcpdiagreq *r = NLMSG_DATA(cb->nlh);
+
+	if (cb->nlh->nlmsg_len > 4 + NLMSG_SPACE(sizeof(*r))) {
+		struct tcpdiag_entry entry;
+		struct rtattr *bc = (struct rtattr *)(r + 1);
+		struct inet_opt *inet = inet_sk(sk);
+
+		entry.family = sk->sk_family;
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+		if (entry.family == AF_INET6) {
+			struct ipv6_pinfo *np = inet6_sk(sk);
+
+			entry.saddr = np->rcv_saddr.s6_addr32;
+			entry.daddr = np->daddr.s6_addr32;
+		} else
+#endif
+		{
+			entry.saddr = &inet->rcv_saddr;
+			entry.daddr = &inet->daddr;
+		}
+		entry.sport = inet->num;
+		entry.dport = ntohs(inet->dport);
+		entry.userlocks = sk->sk_userlocks;
+
+		if (!tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), &entry))
+			return 0;
+	}
+
+	return tcpdiag_fill(skb, sk, r->tcpdiag_ext, NETLINK_CB(cb->skb).pid,
+			    cb->nlh->nlmsg_seq, NLM_F_MULTI);
+}
+
+static int tcpdiag_fill_req(struct sk_buff *skb, struct sock *sk,
+			    struct open_request *req,
+			    u32 pid, u32 seq)
+{
+	struct inet_opt *inet = inet_sk(sk);
+	unsigned char *b = skb->tail;
+	struct tcpdiagmsg *r;
+	struct nlmsghdr *nlh;
+	long tmo;
+
+	nlh = NLMSG_PUT(skb, pid, seq, TCPDIAG_GETSOCK, sizeof(*r));
+	nlh->nlmsg_flags = NLM_F_MULTI;
+	r = NLMSG_DATA(nlh);
+
+	r->tcpdiag_family = sk->sk_family;
+	r->tcpdiag_state = TCP_SYN_RECV;
+	r->tcpdiag_timer = 1;
+	r->tcpdiag_retrans = req->retrans;
+
+	r->id.tcpdiag_if = sk->sk_bound_dev_if;
+	r->id.tcpdiag_cookie[0] = (u32)(unsigned long)req;
+	r->id.tcpdiag_cookie[1] = (u32)(((unsigned long)req >> 31) >> 1);
+
+	tmo = req->expires - jiffies;
+	if (tmo < 0)
+		tmo = 0;
+
+	r->id.tcpdiag_sport = inet->sport;
+	r->id.tcpdiag_dport = req->rmt_port;
+	r->id.tcpdiag_src[0] = req->af.v4_req.loc_addr;
+	r->id.tcpdiag_dst[0] = req->af.v4_req.rmt_addr;
+	r->tcpdiag_expires = jiffies_to_msecs(tmo),
+	r->tcpdiag_rqueue = 0;
+	r->tcpdiag_wqueue = 0;
+	r->tcpdiag_uid = sock_i_uid(sk);
+	r->tcpdiag_inode = 0;
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+	if (r->tcpdiag_family == AF_INET6) {
+		ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_src,
+			       &req->af.v6_req.loc_addr);
+		ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_dst,
+			       &req->af.v6_req.rmt_addr);
+	}
+#endif
+	nlh->nlmsg_len = skb->tail - b;
+
+	return skb->len;
+
+nlmsg_failure:
+	skb_trim(skb, b - skb->data);
+	return -1;
+}
+
+static int tcpdiag_dump_reqs(struct sk_buff *skb, struct sock *sk,
+			     struct netlink_callback *cb)
+{
+	struct tcpdiag_entry entry;
+	struct tcpdiagreq *r = NLMSG_DATA(cb->nlh);
+	struct tcp_opt *tp = tcp_sk(sk);
+	struct tcp_listen_opt *lopt;
+	struct rtattr *bc = NULL;
+	struct inet_opt *inet = inet_sk(sk);
+	int j, s_j;
+	int reqnum, s_reqnum;
+	int err = 0;
+
+	s_j = cb->args[3];
+	s_reqnum = cb->args[4];
+
+	if (s_j > 0)
+		s_j--;
+
+	entry.family = sk->sk_family;
+
+	read_lock_bh(&tp->syn_wait_lock);
+
+	lopt = tp->listen_opt;
+	if (!lopt || !lopt->qlen)
+		goto out;
+
+	if (cb->nlh->nlmsg_len > 4 + NLMSG_SPACE(sizeof(*r))) {
+		bc = (struct rtattr *)(r + 1);
+		entry.sport = inet->num;
+		entry.userlocks = sk->sk_userlocks;
+	}
+
+	for (j = s_j; j < TCP_SYNQ_HSIZE; j++) {
+		struct open_request *req, *head = lopt->syn_table[j];
+
+		reqnum = 0;
+		for (req = head; req; reqnum++, req = req->dl_next) {
+			if (reqnum < s_reqnum)
+				continue;
+			if (r->id.tcpdiag_dport != req->rmt_port &&
+			    r->id.tcpdiag_dport)
+				continue;
+
+			if (bc) {
+				entry.saddr =
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+					(entry.family == AF_INET6) ?
+					req->af.v6_req.loc_addr.s6_addr32 :
+#endif
+					&req->af.v4_req.loc_addr;
+				entry.daddr = 
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+					(entry.family == AF_INET6) ?
+					req->af.v6_req.rmt_addr.s6_addr32 :
+#endif
+					&req->af.v4_req.rmt_addr;
+				entry.dport = ntohs(req->rmt_port);
+
+				if (!tcpdiag_bc_run(RTA_DATA(bc),
+						    RTA_PAYLOAD(bc), &entry))
+					continue;
+			}
+
+			err = tcpdiag_fill_req(skb, sk, req,
+					       NETLINK_CB(cb->skb).pid,
+					       cb->nlh->nlmsg_seq);
+			if (err < 0) {
+				cb->args[3] = j + 1;
+				cb->args[4] = reqnum;
+				goto out;
+			}
+		}
+
+		s_reqnum = 0;
+	}
+
+out:
+	read_unlock_bh(&tp->syn_wait_lock);
+
+	return err;
+}
 
 static int tcpdiag_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int i, num;
 	int s_i, s_num;
 	struct tcpdiagreq *r = NLMSG_DATA(cb->nlh);
-	struct rtattr *bc = NULL;
-
-	if (cb->nlh->nlmsg_len > 4+NLMSG_SPACE(sizeof(struct tcpdiagreq)))
-		bc = (struct rtattr*)(r+1);
 
 	s_i = cb->args[1];
 	s_num = num = cb->args[2];
@@ -488,31 +607,47 @@
 			struct sock *sk;
 			struct hlist_node *node;
 
-			if (i > s_i)
-				s_num = 0;
-
 			num = 0;
 			sk_for_each(sk, node, &tcp_listening_hash[i]) {
 				struct inet_opt *inet = inet_sk(sk);
-				if (num < s_num)
-					goto next_listen;
-				if (!(r->tcpdiag_states&TCPF_LISTEN) ||
-				    r->id.tcpdiag_dport)
-					goto next_listen;
+
+				if (num < s_num) {
+					num++;
+					continue;
+				}
+
 				if (r->id.tcpdiag_sport != inet->sport &&
 				    r->id.tcpdiag_sport)
 					goto next_listen;
-				if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk))
+
+				if (!(r->tcpdiag_states&TCPF_LISTEN) ||
+				    r->id.tcpdiag_dport ||
+				    cb->args[3] > 0)
+					goto syn_recv;
+
+				if (tcpdiag_dump_sock(skb, sk, cb) < 0) {
+					tcp_listen_unlock();
+					goto done;
+				}
+
+syn_recv:
+				if (!(r->tcpdiag_states&TCPF_SYN_RECV))
 					goto next_listen;
-				if (tcpdiag_fill(skb, sk, r->tcpdiag_ext,
-						 NETLINK_CB(cb->skb).pid,
-						 cb->nlh->nlmsg_seq) <= 0) {
+
+				if (tcpdiag_dump_reqs(skb, sk, cb) < 0) {
 					tcp_listen_unlock();
 					goto done;
 				}
+
 next_listen:
+				cb->args[3] = 0;
+				cb->args[4] = 0;
 				++num;
 			}
+
+			s_num = 0;
+			cb->args[3] = 0;
+			cb->args[4] = 0;
 		}
 		tcp_listen_unlock();
 skip_listen_ht:
@@ -546,11 +681,7 @@
 				goto next_normal;
 			if (r->id.tcpdiag_dport != inet->dport && r->id.tcpdiag_dport)
 				goto next_normal;
-			if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk))
-				goto next_normal;
-			if (tcpdiag_fill(skb, sk, r->tcpdiag_ext,
-					 NETLINK_CB(cb->skb).pid,
-					 cb->nlh->nlmsg_seq) <= 0) {
+			if (tcpdiag_dump_sock(skb, sk, cb) < 0) {
 				read_unlock_bh(&head->lock);
 				goto done;
 			}
@@ -571,11 +702,7 @@
 				if (r->id.tcpdiag_dport != inet->dport &&
 				    r->id.tcpdiag_dport)
 					goto next_dying;
-				if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk))
-					goto next_dying;
-				if (tcpdiag_fill(skb, sk, r->tcpdiag_ext,
-						 NETLINK_CB(cb->skb).pid,
-						 cb->nlh->nlmsg_seq) <= 0) {
+				if (tcpdiag_dump_sock(skb, sk, cb) < 0) {
 					read_unlock_bh(&head->lock);
 					goto done;
 				}
@@ -657,9 +784,19 @@
 	}
 }
 
-void __init tcpdiag_init(void)
+static int __init tcpdiag_init(void)
 {
 	tcpnl = netlink_kernel_create(NETLINK_TCPDIAG, tcpdiag_rcv);
 	if (tcpnl == NULL)
-		panic("tcpdiag_init: Cannot create netlink socket.");
+		return -ENOMEM;
+	return 0;
 }
+
+static void __exit tcpdiag_exit(void)
+{
+	sock_release(tcpnl->sk_socket);
+}
+
+module_init(tcpdiag_init);
+module_exit(tcpdiag_exit);
+MODULE_LICENSE("GPL");
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-12-24 13:37:04 -08:00
+++ b/net/ipv4/tcp_input.c	2004-12-24 13:37:04 -08:00
@@ -2369,25 +2369,19 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
-	__u32 mss = tcp_skb_mss(skb);
-	__u32 snd_una = tp->snd_una;
-	__u32 orig_seq, seq;
-	__u32 packets_acked = 0;
+	__u32 seq = tp->snd_una;
+	__u32 packets_acked;
 	int acked = 0;
 
 	/* If we get here, the whole TSO packet has not been
 	 * acked.
 	 */
-	BUG_ON(!after(scb->end_seq, snd_una));
+	BUG_ON(!after(scb->end_seq, seq));
 
-	seq = orig_seq = scb->seq;
-	while (!after(seq + mss, snd_una)) {
-		packets_acked++;
-		seq += mss;
-	}
-
-	if (tcp_trim_head(sk, skb, (seq - orig_seq)))
+	packets_acked = tcp_skb_pcount(skb);
+	if (tcp_trim_head(sk, skb, seq - scb->seq))
 		return 0;
+	packets_acked -= tcp_skb_pcount(skb);
 
 	if (packets_acked) {
 		__u8 sacked = scb->sacked;
@@ -3034,8 +3028,8 @@
 							tp->snd_wscale = *(__u8 *)ptr;
 							if(tp->snd_wscale > 14) {
 								if(net_ratelimit())
-									printk("tcp_parse_options: Illegal window "
-									       "scaling value %d >14 received.",
+									printk(KERN_INFO "tcp_parse_options: Illegal window "
+									       "scaling value %d >14 received.\n",
 									       tp->snd_wscale);
 								tp->snd_wscale = 14;
 							}
@@ -4963,7 +4957,6 @@
 
 EXPORT_SYMBOL(sysctl_tcp_ecn);
 EXPORT_SYMBOL(sysctl_tcp_reordering);
-EXPORT_SYMBOL(tcp_cwnd_application_limited);
 EXPORT_SYMBOL(tcp_parse_options);
 EXPORT_SYMBOL(tcp_rcv_established);
 EXPORT_SYMBOL(tcp_rcv_state_process);
diff -Nru a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c	2004-12-24 13:36:34 -08:00
+++ b/net/ipv4/tcp_ipv4.c	2004-12-24 13:36:34 -08:00
@@ -448,8 +448,8 @@
 }
 
 /* Optimize the common listener case. */
-inline struct sock *tcp_v4_lookup_listener(u32 daddr, unsigned short hnum,
-					   int dif)
+static inline struct sock *tcp_v4_lookup_listener(u32 daddr,
+		unsigned short hnum, int dif)
 {
 	struct sock *sk = NULL;
 	struct hlist_head *head;
@@ -535,6 +535,8 @@
 	return sk;
 }
 
+EXPORT_SYMBOL_GPL(tcp_v4_lookup);
+
 static inline __u32 tcp_v4_init_sequence(struct sock *sk, struct sk_buff *skb)
 {
 	return secure_tcp_sequence_number(skb->nh.iph->daddr,
@@ -2596,6 +2598,7 @@
 
 struct proto tcp_prot = {
 	.name			= "TCP",
+	.owner			= THIS_MODULE,
 	.close			= tcp_close,
 	.connect		= tcp_v4_connect,
 	.disconnect		= tcp_disconnect,
@@ -2653,7 +2656,6 @@
 EXPORT_SYMBOL(tcp_v4_conn_request);
 EXPORT_SYMBOL(tcp_v4_connect);
 EXPORT_SYMBOL(tcp_v4_do_rcv);
-EXPORT_SYMBOL(tcp_v4_lookup_listener);
 EXPORT_SYMBOL(tcp_v4_rebuild_header);
 EXPORT_SYMBOL(tcp_v4_remember_stamp);
 EXPORT_SYMBOL(tcp_v4_send_check);
@@ -2663,8 +2665,7 @@
 EXPORT_SYMBOL(tcp_proc_register);
 EXPORT_SYMBOL(tcp_proc_unregister);
 #endif
-#ifdef CONFIG_SYSCTL
 EXPORT_SYMBOL(sysctl_local_port_range);
 EXPORT_SYMBOL(sysctl_max_syn_backlog);
 EXPORT_SYMBOL(sysctl_tcp_low_latency);
-#endif
+
diff -Nru a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
--- a/net/ipv4/tcp_minisocks.c	2004-12-24 13:37:04 -08:00
+++ b/net/ipv4/tcp_minisocks.c	2004-12-24 13:37:04 -08:00
@@ -706,7 +706,7 @@
 		sock_lock_init(newsk);
 		bh_lock_sock(newsk);
 
-		newsk->sk_dst_lock = RW_LOCK_UNLOCKED;
+		rwlock_init(&newsk->sk_dst_lock);
 		atomic_set(&newsk->sk_rmem_alloc, 0);
 		skb_queue_head_init(&newsk->sk_receive_queue);
 		atomic_set(&newsk->sk_wmem_alloc, 0);
@@ -719,7 +719,7 @@
 		newsk->sk_userlocks = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
 		newsk->sk_backlog.head = newsk->sk_backlog.tail = NULL;
 		newsk->sk_send_head = NULL;
-		newsk->sk_callback_lock = RW_LOCK_UNLOCKED;
+		rwlock_init(&newsk->sk_callback_lock);
 		skb_queue_head_init(&newsk->sk_error_queue);
 		newsk->sk_write_space = sk_stream_write_space;
 
@@ -1075,7 +1075,3 @@
 EXPORT_SYMBOL(tcp_create_openreq_child);
 EXPORT_SYMBOL(tcp_timewait_state_process);
 EXPORT_SYMBOL(tcp_tw_deschedule);
-
-#ifdef CONFIG_SYSCTL
-EXPORT_SYMBOL(sysctl_tcp_tw_recycle);
-#endif
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-12-24 13:37:01 -08:00
+++ b/net/ipv4/tcp_output.c	2004-12-24 13:37:01 -08:00
@@ -455,9 +455,13 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct sk_buff *buff;
-	int nsize = skb->len - len;
+	int nsize;
 	u16 flags;
 
+	nsize = skb_headlen(skb) - len;
+	if (nsize < 0)
+		nsize = 0;
+
 	if (skb_cloned(skb) &&
 	    skb_is_nonlinear(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
@@ -562,8 +566,6 @@
 
 int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
-	struct tcp_opt *tp = tcp_sk(sk);
-
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
@@ -586,7 +588,8 @@
 	/* Any change of skb->len requires recalculation of tso
 	 * factor and mss.
 	 */
-	tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
+	if (tcp_skb_pcount(skb) > 1)
+		tcp_set_skb_tso_segs(skb, tcp_skb_mss(skb));
 
 	return 0;
 }
@@ -1102,6 +1105,8 @@
 		/* Update global TCP statistics. */
 		TCP_INC_STATS(TCP_MIB_RETRANSSEGS);
 
+		tp->total_retrans++;
+
 #if FASTRETRANS_DEBUG > 0
 		if (TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_RETRANS) {
 			if (net_ratelimit())
@@ -1715,12 +1720,7 @@
 	}
 }
 
-EXPORT_SYMBOL(tcp_acceptable_seq);
 EXPORT_SYMBOL(tcp_connect);
-EXPORT_SYMBOL(tcp_connect_init);
 EXPORT_SYMBOL(tcp_make_synack);
-EXPORT_SYMBOL(tcp_send_synack);
 EXPORT_SYMBOL(tcp_simple_retransmit);
 EXPORT_SYMBOL(tcp_sync_mss);
-EXPORT_SYMBOL(tcp_write_wakeup);
-EXPORT_SYMBOL(tcp_write_xmit);
diff -Nru a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
--- a/net/ipv4/tcp_timer.c	2004-12-24 13:37:19 -08:00
+++ b/net/ipv4/tcp_timer.c	2004-12-24 13:37:19 -08:00
@@ -36,7 +36,9 @@
 static void tcp_delack_timer(unsigned long);
 static void tcp_keepalive_timer (unsigned long data);
 
-const char timer_bug_msg[] = KERN_DEBUG "tcpbug: unknown timer value\n";
+#ifdef TCP_DEBUG
+const char tcp_timer_bug_msg[] = KERN_DEBUG "tcpbug: unknown timer value\n";
+#endif
 
 /*
  * Using different timers for retransmit, delayed acks and probes
@@ -651,3 +653,6 @@
 EXPORT_SYMBOL(tcp_delete_keepalive_timer);
 EXPORT_SYMBOL(tcp_init_xmit_timers);
 EXPORT_SYMBOL(tcp_reset_keepalive_timer);
+#ifdef TCP_DEBUG
+EXPORT_SYMBOL(tcp_timer_bug_msg);
+#endif
diff -Nru a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c	2004-12-24 13:36:56 -08:00
+++ b/net/ipv6/tcp_ipv6.c	2004-12-24 13:36:56 -08:00
@@ -262,7 +262,7 @@
 			
 			score = 1;
 			if (!ipv6_addr_any(&np->rcv_saddr)) {
-				if (ipv6_addr_cmp(&np->rcv_saddr, daddr))
+				if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
 					continue;
 				score++;
 			}
@@ -321,8 +321,8 @@
 
 		if(*((__u32 *)&(tw->tw_dport))	== ports	&&
 		   sk->sk_family		== PF_INET6) {
-			if(!ipv6_addr_cmp(&tw->tw_v6_daddr, saddr)	&&
-			   !ipv6_addr_cmp(&tw->tw_v6_rcv_saddr, daddr)	&&
+			if(ipv6_addr_equal(&tw->tw_v6_daddr, saddr)	&&
+			   ipv6_addr_equal(&tw->tw_v6_rcv_saddr, daddr)	&&
 			   (!sk->sk_bound_dev_if || sk->sk_bound_dev_if == dif))
 				goto hit;
 		}
@@ -364,6 +364,8 @@
 	return sk;
 }
 
+EXPORT_SYMBOL_GPL(tcp_v6_lookup);
+
 
 /*
  * Open request hash tables.
@@ -404,8 +406,8 @@
 	     prev = &req->dl_next) {
 		if (req->rmt_port == rport &&
 		    req->class->family == AF_INET6 &&
-		    !ipv6_addr_cmp(&req->af.v6_req.rmt_addr, raddr) &&
-		    !ipv6_addr_cmp(&req->af.v6_req.loc_addr, laddr) &&
+		    ipv6_addr_equal(&req->af.v6_req.rmt_addr, raddr) &&
+		    ipv6_addr_equal(&req->af.v6_req.loc_addr, laddr) &&
 		    (!req->af.v6_req.iif || req->af.v6_req.iif == iif)) {
 			BUG_TRAP(req->sk == NULL);
 			*prevp = prev;
@@ -461,8 +463,8 @@
 
 		if(*((__u32 *)&(tw->tw_dport))	== ports	&&
 		   sk2->sk_family		== PF_INET6	&&
-		   !ipv6_addr_cmp(&tw->tw_v6_daddr, saddr)	&&
-		   !ipv6_addr_cmp(&tw->tw_v6_rcv_saddr, daddr)	&&
+		   ipv6_addr_equal(&tw->tw_v6_daddr, saddr)	&&
+		   ipv6_addr_equal(&tw->tw_v6_rcv_saddr, daddr)	&&
 		   sk2->sk_bound_dev_if == sk->sk_bound_dev_if) {
 			struct tcp_opt *tp = tcp_sk(sk);
 
@@ -608,7 +610,7 @@
 	}
 
 	if (tp->ts_recent_stamp &&
-	    ipv6_addr_cmp(&np->daddr, &usin->sin6_addr)) {
+	    !ipv6_addr_equal(&np->daddr, &usin->sin6_addr)) {
 		tp->ts_recent = 0;
 		tp->ts_recent_stamp = 0;
 		tp->write_seq = 0;
@@ -1802,6 +1804,7 @@
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct flowi fl;
 	struct dst_entry *dst;
+	struct in6_addr *final_p = NULL, final;
 
 	memset(&fl, 0, sizeof(fl));
 	fl.proto = IPPROTO_TCP;
@@ -1815,7 +1818,9 @@
 
 	if (np->opt && np->opt->srcrt) {
 		struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
+		ipv6_addr_copy(&final, &fl.fl6_dst);
 		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
+		final_p = &final;
 	}
 
 	dst = __sk_dst_check(sk, np->dst_cookie);
@@ -1828,6 +1833,9 @@
 			return err;
 		}
 
+		if (final_p)
+			ipv6_addr_copy(&fl.fl6_dst, final_p);
+
 		if ((err = xfrm_lookup(&dst, &fl, sk, 0)) < 0) {
 			sk->sk_route_caps = 0;
 			dst_release(dst);
@@ -2124,6 +2132,7 @@
 
 struct proto tcpv6_prot = {
 	.name			= "TCPv6",
+	.owner			= THIS_MODULE,
 	.close			= tcp_close,
 	.connect		= tcp_v6_connect,
 	.disconnect		= tcp_disconnect,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-20 23:06 Hubert Tonneau
  0 siblings, 0 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-20 23:06 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Nivedita Singhvi
  Cc: Stephen Hemminger, romieu, kuznet, niv, rick.jones2, netdev

I've noticed something very interesting:
if trying to send to a gigabit connected Mac OSX instead of 100 Mbps connected,
then there is no drastic slowdown when switching Linux 2.6.9 to 2.6.10


> Any chance you could
> send me just the following from your boxes:
> (Before and after the transfer)
>
> - /proc/net/snmp
> - /proc/net/netstat

Here are the requested extra informations:

2.6.10-ac10 before:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 47336 0 0 0 0 0 47197 127721 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 2 0 0 0 0 0 0 0 2 0 0 0 0 417 0 417 0 0 0 0 0 0 0 0 0 0
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 40 209 0 2 7 46158 126953 156 0 243
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 332 417 0 336

TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLoss TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnSyn TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory T
 CPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures
TcpExt: 0 0 0 0 0 0 0 0 0 0 94 0 0 0 0 0 452 0 0 0 0 9499 215 241030 0 7583 377 16696 3330 123 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 0 0 7 0 0 0 0 0 0 0 0 0 90 0 0 2 0 0 0

2.6.10-ac10 after sending to the 100 Mbps connected Mac OSX:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 70100 0 0 0 0 0 69901 214176 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 2 0 0 0 0 0 0 0 2 0 0 0 0 421 0 421 0 0 0 0 0 0 0 0 0 0
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 49 263 0 2 9 68728 213354 284 0 315
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 382 421 0 386

TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLoss TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnSyn TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory T
 CPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures
TcpExt: 0 0 0 0 0 0 0 0 0 0 105 0 0 0 0 0 804 0 0 0 0 12808 215 310763 0 11460 472 26236 5086 247 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 247 0 0 11 0 0 0 0 0 0 0 0 0 123 0 0 2 0 0 0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-16 20:00 Hubert Tonneau
  0 siblings, 0 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-16 20:00 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov
  Cc: shemminger, romieu, kuznet, niv, rick.jones2, netdev

David S. Miller wrote:
>
> Hubert, do you have netfilter enabled in the 2.6.10 kernel you are running?
> 
> I'm asking because the TCP changes in 2.6.10 are pretty benign
> (attached for the curious who want to review along), whereas
> netfilter had major updates particularly in the TCP connection
> tracking code.

There is no netfilter on this server.

> I also reviewed 2.6.10-ac11 for anything interesting wrt. TCP and the
> only thing in there is the tcp_retrans_try_collapse() missing check
> to avoid collapsing TSO segments.

I'm using 2.6.10-ac11 for security reasons. I could use 2.6.10-as1 as well.
As far as I know, they all behave exactly the same from the TCP point of view.
The difference is definetly between stock 2.6.9 and stock 2.6.10

If it helps, you can send me a patch reverting TCP changes between 2.6.10
and 2.6.9, and I'll give it a spin, just to be sure that the problem is
truely related to TCP code, not other changes side effects.

Anyway, here is the set of settings I'm using to build the kernel, and no
module is loaded while the test is running:

CONFIG_2GB:  y
CONFIG_ACPI:  y
CONFIG_ACPI_AC:  m
CONFIG_ACPI_BATTERY:  m
CONFIG_ACPI_BUTTON:  m
CONFIG_ACPI_FAN:  m
CONFIG_ACPI_PROCESSOR:  y
CONFIG_ACPI_SLEEP:  y
CONFIG_ACPI_THERMAL:  y
CONFIG_ACPI_VIDEO:  m
CONFIG_APM_RTC_IS_GMT:  y
CONFIG_ATALK:  m
CONFIG_AUTODETECT_RAID:  y
CONFIG_AUTOFS_FS:  m
CONFIG_BINFMT_ELF:  y
CONFIG_BINFMT_MISC:  y
CONFIG_BLK_DEV_CMD640:  y
CONFIG_BLK_DEV_FD:  m
CONFIG_BLK_DEV_GENERIC:  y
CONFIG_BLK_DEV_IDE:  y
CONFIG_BLK_DEV_IDECD:  m
CONFIG_BLK_DEV_IDEDISK:  y
CONFIG_BLK_DEV_IDEDMA:  y
CONFIG_BLK_DEV_IDEDMA_PCI:  y
CONFIG_BLK_DEV_IDEPCI:  y
CONFIG_BLK_DEV_IDESCSI:  m
CONFIG_BLK_DEV_LOOP:  m
CONFIG_BLK_DEV_MD:  y
CONFIG_BLK_DEV_NBD:  m
CONFIG_BLK_DEV_PIIX:  y
CONFIG_BLK_DEV_RAM:  m
CONFIG_BLK_DEV_RZ1000:  y
CONFIG_BLK_DEV_SD:  y
CONFIG_BLK_DEV_SR:  m
CONFIG_BLK_DEV_TRIRON:  y
CONFIG_BSD_PROCESS_ACCT:  y
CONFIG_CHR_DEV_SG:  m
CONFIG_CHR_DEV_ST:  m
CONFIG_CODA_FS:  m
CONFIG_E1000:  y
CONFIG_EXPERIMENTAL:  y
CONFIG_EXT2_FS:  y
CONFIG_EXT3_FS:  y
CONFIG_EXT3_FS_XATTR:  y
CONFIG_FAT_FS:  m
CONFIG_FILTER:  y
CONFIG_FUSION:  y
CONFIG_FUSION_CTL:  m
CONFIG_FUSION_ISENSE:  m
CONFIG_FUSION_LAN:  m
CONFIG_HFSPLUS_FS:  m
CONFIG_HFS_FS:  m
CONFIG_HIGHMEM:  y
CONFIG_HIGHMEM4G:  y
CONFIG_HPET_TIMER:  y
CONFIG_HPFS_FS:  m
CONFIG_IDE:  y
CONFIG_IDEDMA_AUTO:  y
CONFIG_IDEDMA_ONLYDISK:  y
CONFIG_IDEDMA_PCI_AUTO:  y
CONFIG_IDEPCI_SHARE_IRQ:  y
CONFIG_IDE_GENERIC:  y
CONFIG_INET:  y
CONFIG_INPUT:  y
CONFIG_INPUT_KEYBDEV:  m
CONFIG_INPUT_KEYBOARD:  y
CONFIG_INPUT_MOUSE:  y
CONFIG_INPUT_MOUSEDEV:  m
CONFIG_IP_ALIAS:  y
CONFIG_IP_ROUTE_VERBOSE:  y
CONFIG_IRQBALANCE:  y
CONFIG_ISO9660_FS:  m
CONFIG_KCORE_ELF:  y
CONFIG_KEYBOARD_ATKBD:  y
CONFIG_LEGACY_PTYS:  y
CONFIG_LOCKD:  m
CONFIG_M386:  n
CONFIG_M486:  n
CONFIG_M586:  n
CONFIG_M686:  n
CONFIG_MAC_PARTITION:  y
CONFIG_MD:  y
CONFIG_MD_BOOT:  y
CONFIG_MD_LINEAR:  y
CONFIG_MD_LVM:  n
CONFIG_MD_MIRRORING:  y
CONFIG_MD_RAID0:  y
CONFIG_MD_RAID1:  y
CONFIG_MD_RAID5:  y
CONFIG_MD_STRIPED:  y
CONFIG_MD_TRANSLUCENT:  n
CONFIG_MODULES:  y
CONFIG_MODULE_UNLOAD:  y
CONFIG_MOUSE:  m
CONFIG_MOUSE_PS2:  y
CONFIG_MPENTIUM4:  y
CONFIG_MSDOS_FS:  m
CONFIG_MTRR:  y
CONFIG_NET:  y
CONFIG_NETDEVICES:  y
CONFIG_NET_ETHERNET:  y
CONFIG_NFSD:  m
CONFIG_NFS_FS:  m
CONFIG_NLS:  y
CONFIG_NLS_CODEPAGE_437:  m
CONFIG_NLS_CODEPAGE_850:  m
CONFIG_NLS_ISO8859_1:  m
CONFIG_NLS_UTF8:  m
CONFIG_NTFS_FS:  m
CONFIG_OOM_KILLER:  y
CONFIG_PACKET:  y
CONFIG_PARPORT:  m
CONFIG_PARPORT_PC:  m
CONFIG_PCI:  y
CONFIG_PCI_BIOS:  y
CONFIG_PCI_GOANY:  y
CONFIG_PCI_OLD_PROC:  y
CONFIG_PCI_QUIRKS:  y
CONFIG_PIIX_TUNING:  y
CONFIG_PM:  y
CONFIG_PPP:  m
CONFIG_PPPOE:  m
CONFIG_PPP_ASYNC:  m
CONFIG_PPP_BSDCOMP:  m
CONFIG_PPP_DEFLATE:  m
CONFIG_PPP_FILTER:  y
CONFIG_PPP_SYNC_TTY:  m
CONFIG_PREEMPT:  y
CONFIG_PRINTER:  m
CONFIG_PRINTER_READBACK:  y
CONFIG_PROC_FS:  y
CONFIG_PSMOUSE:  y
CONFIG_QNX4FS_FS:  m
CONFIG_REGPARM:  y
CONFIG_RTC:  y
CONFIG_SCSI:  y
CONFIG_SCSI_PROC_FS:  y
CONFIG_SERIAL:  m
CONFIG_SERIAL_8250:  m
CONFIG_SHAPER:  m
CONFIG_SLIP:  m
CONFIG_SMB_FS:  m
CONFIG_SMP:  y
CONFIG_SOUND:  m
CONFIG_SUNRPC:  m
CONFIG_SYSCTL:  y
CONFIG_SYSVIPC:  y
CONFIG_UFS_FS:  m
CONFIG_UMSDOS_FS:  m
CONFIG_UNIX:  y
CONFIG_USB:  m
CONFIG_USB_ACM:  m
CONFIG_USB_AUDIO:  m
CONFIG_USB_CDCETHER:  m
CONFIG_USB_DEVICEFS:  y
CONFIG_USB_EHCI_HCD:  m
CONFIG_USB_HID:  m
CONFIG_USB_HIDINPUT:  y
CONFIG_USB_KBD:  m
CONFIG_USB_MOUSE:  m
CONFIG_USB_OHCI:  m
CONFIG_USB_OHCI_HCD:  m
CONFIG_USB_PRINTER:  m
CONFIG_USB_SERIAL:  m
CONFIG_USB_STORAGE:  m
CONFIG_USB_UHCI:  m
CONFIG_USB_UHCI_ALT:  m
CONFIG_USB_UHCI_HCD:  m
CONFIG_VFAT_FS:  m
CONFIG_VGA_CONSOLE:  y
CONFIG_VT:  y
CONFIG_VT_CONSOLE:  y
CONFIG_X86_MCE:  y
CONFIG_X86_UP_APIC:  y
CONFIG_X86_UP_IOAPIC:  y

Since we are at it, here are the hardware components of the box:

8086 	Intel Corporation 	254C 	E7501 	0 		Host Controller
8086 	Intel Corporation 	2543 	E7500/E7501 	0 		HI_B Virtual PCI-to-PCI Bridge
8086 	Intel Corporation 	2545 	E7500/E7501 	0 		HI_C Virtual PCI-to-PCI Bridge
8086 	Intel Corporation 	2547 	E7500/E7501 	0 		HI_D Virtual PCI-to-PCI Bridge
8086 	Intel Corporation 	2482 	82801CA/CAM 	10 		USB Controller
8086 	Intel Corporation 	244E 	82801BA/CA/DB, 6300ESB 	0 		Hub Interface to PCI Bridge
8086 	Intel Corporation 	2480 	82801CA 	0 		LPC Interface Bridge
8086 	Intel Corporation 	248B 	82801CA 	0 		UltraATA/100 IDE Controller
8086 	Intel Corporation 	1461 	14611014 	0 		I/OxAPIC Interrupt Controller
8086 	Intel Corporation 	1461 	14611014 	0 		I/OxAPIC Interrupt Controller
8086 	Intel Corporation 	1461 	14611014 	0 		I/OxAPIC Interrupt Controller
8086 	Intel Corporation 	1461 	14611014 	0 		I/OxAPIC Interrupt Controller
8086 	Intel Corporation 	1461 	14611014 	0 		I/OxAPIC Interrupt Controller
8086 	Intel Corporation 	1461 	14611014 	0 		I/OxAPIC Interrupt Controller
8086 	Intel Corporation 	1460 	82870P2 	0 		Hub Interface-to-PCI Bridge
8086 	Intel Corporation 	1460 	82870P2 	0 		Hub Interface-to-PCI Bridge
8086 	Intel Corporation 	1460 	82870P2 	0 		Hub Interface-to-PCI Bridge
8086 	Intel Corporation 	1460 	82870P2 	0 		Hub Interface-to-PCI Bridge
8086 	Intel Corporation 	1460 	82870P2 	0 		Hub Interface-to-PCI Bridge
8086 	Intel Corporation 	1460 	82870P2 	0 		Hub Interface-to-PCI Bridge
8086 	Intel Corporation 	1026 	82545GM 	18 		Gigabit Ethernet Controller
8086 	Intel Corporation 	100D 	82544GC 	1C 		Gigabit Ethernet Controller (LOM)
8086 	Intel Corporation 	0309 	80303 	0 		I/O Processor PCI-to-PCI Bridge Unit
1000 	LSI Logic 	0030 	LSI53C1020/1030 	78 		PCI-X to Ultra320 SCSI Controller
1000 	LSI Logic 	0030 	LSI53C1020/1030 	79 		PCI-X to Ultra320 SCSI Controller
1002 	ATI Technologies 	4752 	Rage XL PCI 	0 	

And the interrupts (while running 2.6.9):

           CPU0       CPU1       
  0:  159132374  132686719    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          0          0    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 14:          1          0    IO-APIC-edge  ide0
 24:   22225220          0   IO-APIC-level  eth0
 28:          4  134406507   IO-APIC-level  eth1
120:     532730     578109   IO-APIC-level  ioc0
121:    1931739    1327672   IO-APIC-level  ioc1
NMI:          0          0 
LOC:  291863458  291863528 
ERR:          0
MIS:          0

/proc/net/dev

Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
  eth0:2512143307 20914618    0    0    0     0          0         0 1951489031 52933097    0    0    0     0       0          0
  eth1:943883086 75451745    0    0    0     0          0         0 201914508 171409895    0    0    0     0       0          0
    lo:2247204588  748445    0    0    0     0          0         0 2247204588  748445    0    0    0     0       0          0

/proc/net/route

Iface Destination Gateway  Flags RefCnt Use Metric Mask MTU Window IRTT                                                       
eth0 207C29D5 00000000 0001 0 0 0 F0FFFFFF 0 0 0                                                                               
eth1 00606B0A 00000000 0001 0 0 0 00FFFFFF 0 0 0                                                                               
eth0 00000000 217C29D5 0003 0 0 0 00000000 0 0 0                                                                               

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-13 10:52 Hubert Tonneau
  2005-02-14 14:12 ` Alexey Kuznetsov
  0 siblings, 1 reply; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-13 10:52 UTC (permalink / raw)
  To: Alexey Kuznetsov, David S. Miller
  Cc: Alexey Kuznetsov, rick.jones2, shemminger, romieu, netdev

Alexey Kuznetsov wrote:
>
> Exactly. That's why the next test should be with disabled TSO in 2.6.9.
> If too rare PSHs were a problem, it will show as the same disaster there.

After,
ethtool -K eth1 tso off
the result is unchanged on 2.6.9 (14 seconds for 105 MB).

After,
ethtool -K eth1 tso off
the result is also unchanged on 2.6.10-ac11 with no extra TCP patch (325 seconds).

Settings for eth1:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
                        100baseT/Half 100baseT/Full 
                        1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
                        100baseT/Half 100baseT/Full 
                        1000baseT/Full 
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: g
Current message level: 0x00000007 (7)
Link detected: yes

PS:
Please sorry for the long delay I have to run tests, and the reason is that
it's a production server, so I cannot make tests in the middle of the day,
it's remote, so in order to switch the kernel, I have to upload the new one,
and then upload again the old one to switch back, and the best connection
I have these days is 30 Kbps modem connection. It will improve on monday since
I'll have a 128 Kbps ADSL connection.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-13 10:52 Hubert Tonneau
@ 2005-02-14 14:12 ` Alexey Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-14 14:12 UTC (permalink / raw)
  To: Hubert Tonneau
  Cc: Alexey Kuznetsov, David S. Miller, rick.jones2, shemminger,
	romieu, netdev

Hello!

> ethtool -K eth1 tso off
> the result is unchanged on 2.6.9 (14 seconds for 105 MB).
> 
> After,
> ethtool -K eth1 tso off
> the result is also unchanged on 2.6.10-ac11 with no extra TCP patch (325 seconds).

Well, it means the theory was wrong... tso is innocent. To make a new
theory we need a tcpdump of 2.6.10 with disabled tso.


> it's a production server,

I hope we can stay in its normal configuration now. TSO may be kept disabled.

Alexey

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-10 21:53 Hubert Tonneau
  2005-02-10 22:36 ` Rick Jones
  0 siblings, 1 reply; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-10 21:53 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Francois Romieu, Alexey Kuznetsov, netdev

It does not seem to solve the problem:
. Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
. Linux 2.6.10 with the TCP patch still takes 325 seconds.

Stephen Hemminger wrote:
>
> Please try this patch, based on Alexey's suggestion:
> 
> > That's one quick and simple idea: set PSH on each tso segment.
> > Seems, it is always good. Hardware will preserve it only on the last skb and
> > everyone will be happy.
> 
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> #   2005/02/09 11:00:57-08:00 shemminger@linux.site 
> #   Always set PUSH on TSO multi-segment frames
> #   to workaround bugs in MacOSX
> # 
> # net/ipv4/tcp_output.c
> #   2005/02/09 11:00:44-08:00 shemminger@linux.site +8 -0
> #   Always set PUSH on TSO multi-segment frames
> #   to workaround bugs in MacOSX
> # 
> diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> --- a/net/ipv4/tcp_output.c	2005-02-09 11:01:12 -08:00
> +++ b/net/ipv4/tcp_output.c	2005-02-09 11:01:12 -08:00
> @@ -754,6 +754,14 @@
>  					break;
>  			}
>  
> +			/* Force push to be on for any large TSO frames
> +			 * to workaround problems with busted implementations
> +			 * like MacOSX that hold off delivery of data until
> +			 * push.
> +			 */
> +			if (tcp_skb_pcount(skb) > 1)
> +			    TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> +
>  			TCP_SKB_CB(skb)->when = tcp_time_stamp;
>  			if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
>  				break;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-10 21:53 Hubert Tonneau
@ 2005-02-10 22:36 ` Rick Jones
  2005-02-11  1:16   ` David S. Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Rick Jones @ 2005-02-10 22:36 UTC (permalink / raw)
  To: Hubert Tonneau
  Cc: Stephen Hemminger, Francois Romieu, Alexey Kuznetsov, netdev

Hubert Tonneau wrote:
> It does not seem to solve the problem:
> . Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
> . Linux 2.6.10 with the TCP patch still takes 325 seconds.


is there a packet trace somewhere?
rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-10 22:36 ` Rick Jones
@ 2005-02-11  1:16   ` David S. Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-11  1:16 UTC (permalink / raw)
  To: Rick Jones; +Cc: hubert.tonneau, shemminger, romieu, kuznet, netdev

On Thu, 10 Feb 2005 14:36:40 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> Hubert Tonneau wrote:
> > It does not seem to solve the problem:
> > . Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
> > . Linux 2.6.10 with the TCP patch still takes 325 seconds.
> 
> 
> is there a packet trace somewhere?

I know what's wrong, no trace needed, Stephen's patch misses
tcp_push_one() and similar.

He only added the PSH bit setting to tcp_write_xmit().

Hubert, try this patch instead.

===== net/ipv4/tcp_output.c 1.77 vs edited =====
--- 1.77/net/ipv4/tcp_output.c	2005-01-18 12:23:36 -08:00
+++ edited/net/ipv4/tcp_output.c	2005-02-10 16:42:42 -08:00
@@ -408,6 +408,16 @@
 		sk->sk_send_head = skb;
 }
 
+static inline void tcp_tso_set_push(struct sk_buff *skb)
+{
+	/* Force push to be on for any TSO frames to workaround
+	 * problems with busted implementations like Mac OS-X that
+	 * hold off socket reveive wakeups until push is seen.
+	 */
+	if (tcp_skb_pcount(skb) > 1)
+		TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
+}
+
 /* Send _single_ skb sitting at the send head. This function requires
  * true push pending frames to setup probe timer etc.
  */
@@ -419,6 +429,7 @@
 	if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) {
 		/* Send it out now. */
 		TCP_SKB_CB(skb)->when = tcp_time_stamp;
+		tcp_tso_set_push(skb);
 		if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) {
 			sk->sk_send_head = NULL;
 			tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
@@ -755,6 +766,7 @@
 			}
 
 			TCP_SKB_CB(skb)->when = tcp_time_stamp;
+			tcp_tso_set_push(skb);
 			if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
 				break;
 
@@ -1096,6 +1108,7 @@
 	 * is still in somebody's hands, else make a clone.
 	 */
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
+	tcp_tso_set_push(skb);
 
 	err = tcp_transmit_skb(sk, (skb_cloned(skb) ?
 				    pskb_copy(skb, GFP_ATOMIC):
@@ -1668,6 +1681,7 @@
 
 			TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
 			TCP_SKB_CB(skb)->when = tcp_time_stamp;
+			tcp_tso_set_push(skb);
 			err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC));
 			if (!err) {
 				update_send_head(sk, tp, skb);

^ permalink raw reply	[flat|nested] 40+ messages in thread

[parent not found: <050QTJA12@server5.heliogroup.fr>]

* Re: 2.6.10 TCP troubles -- suggested patch
       [not found] <050QTJA12@server5.heliogroup.fr>
@ 2005-02-09 18:59 ` Stephen Hemminger
  2005-02-09 20:25   ` David S. Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Stephen Hemminger @ 2005-02-09 18:59 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: Francois Romieu, Alexey Kuznetsov, netdev

Please try this patch, based on Alexey's suggestion:

> That's one quick and simple idea: set PSH on each tso segment.
> Seems, it is always good. Hardware will preserve it only on the last skb and
> everyone will be happy.

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2005/02/09 11:00:57-08:00 shemminger@linux.site 
#   Always set PUSH on TSO multi-segment frames
#   to workaround bugs in MacOSX
# 
# net/ipv4/tcp_output.c
#   2005/02/09 11:00:44-08:00 shemminger@linux.site +8 -0
#   Always set PUSH on TSO multi-segment frames
#   to workaround bugs in MacOSX
# 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2005-02-09 11:01:12 -08:00
+++ b/net/ipv4/tcp_output.c	2005-02-09 11:01:12 -08:00
@@ -754,6 +754,14 @@
 					break;
 			}
 
+			/* Force push to be on for any large TSO frames
+			 * to workaround problems with busted implementations
+			 * like MacOSX that hold off delivery of data until
+			 * push.
+			 */
+			if (tcp_skb_pcount(skb) > 1)
+			    TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
+
 			TCP_SKB_CB(skb)->when = tcp_time_stamp;
 			if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
 				break;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: 2.6.10 TCP troubles -- suggested patch
  2005-02-09 18:59 ` Stephen Hemminger
@ 2005-02-09 20:25   ` David S. Miller
  0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-09 20:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: hubert.tonneau, romieu, kuznet, netdev

On Wed, 9 Feb 2005 10:59:09 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:

> Please try this patch, based on Alexey's suggestion:

-EBADINDENTATION :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2005-02-20 23:06 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-11 21:55 2.6.10 TCP troubles -- suggested patch Hubert Tonneau
2005-02-11 22:54 ` Rick Jones
2005-02-11 23:09   ` Nivedita Singhvi
2005-02-11 23:40     ` Rick Jones
2005-02-12  1:08     ` David S. Miller
2005-02-12  1:09   ` David S. Miller
2005-02-12 14:31     ` Alexey Kuznetsov
2005-02-12 19:28       ` David S. Miller
2005-02-12 19:44         ` Leonid Grossman
2005-02-12 19:52         ` Alexey Kuznetsov
2005-02-15 23:25           ` David S. Miller
2005-02-12 20:19       ` rick jones
2005-02-12 20:28         ` David S. Miller
2005-02-12 20:56         ` Alexey Kuznetsov
2005-02-12 21:27           ` Nivedita Singhvi
2005-02-12 21:43           ` rick jones
2005-02-12 22:00             ` Alexey Kuznetsov
2005-02-13  1:29               ` rick jones
2005-02-11 23:04 ` Stephen Hemminger
2005-02-12  1:07   ` David S. Miller
2005-02-12 12:11     ` Andi Kleen
2005-02-12 19:23       ` David S. Miller
2005-02-12 21:30         ` Andi Kleen
2005-02-12 14:16     ` Alexey Kuznetsov
2005-02-12 19:41       ` David S. Miller
2005-02-12 20:03         ` Alexey Kuznetsov
2005-02-15 23:26           ` David S. Miller
2005-02-15 23:42             ` Rick Jones
2005-02-15 23:23   ` David S. Miller
2005-02-16  9:13     ` Alexey Kuznetsov
2005-02-16 17:50       ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2005-02-20 23:06 Hubert Tonneau
2005-02-16 20:00 Hubert Tonneau
2005-02-13 10:52 Hubert Tonneau
2005-02-14 14:12 ` Alexey Kuznetsov
2005-02-10 21:53 Hubert Tonneau
2005-02-10 22:36 ` Rick Jones
2005-02-11  1:16   ` David S. Miller
     [not found] <050QTJA12@server5.heliogroup.fr>
2005-02-09 18:59 ` Stephen Hemminger
2005-02-09 20:25   ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).