* Re: 2.6.10 TCP troubles -- suggested patch
[not found] <050QTJA12@server5.heliogroup.fr>
@ 2005-02-09 18:59 ` Stephen Hemminger
2005-02-09 20:25 ` David S. Miller
0 siblings, 1 reply; 40+ messages in thread
From: Stephen Hemminger @ 2005-02-09 18:59 UTC (permalink / raw)
To: Hubert Tonneau; +Cc: Francois Romieu, Alexey Kuznetsov, netdev
Please try this patch, based on Alexey's suggestion:
> That's one quick and simple idea: set PSH on each tso segment.
> Seems, it is always good. Hardware will preserve it only on the last skb and
> everyone will be happy.
# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
# 2005/02/09 11:00:57-08:00 shemminger@linux.site
# Always set PUSH on TSO multi-segment frames
# to workaround bugs in MacOSX
#
# net/ipv4/tcp_output.c
# 2005/02/09 11:00:44-08:00 shemminger@linux.site +8 -0
# Always set PUSH on TSO multi-segment frames
# to workaround bugs in MacOSX
#
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00
+++ b/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00
@@ -754,6 +754,14 @@
break;
}
+ /* Force push to be on for any large TSO frames
+ * to workaround problems with busted implementations
+ * like MacOSX that hold off delivery of data until
+ * push.
+ */
+ if (tcp_skb_pcount(skb) > 1)
+ TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
+
TCP_SKB_CB(skb)->when = tcp_time_stamp;
if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
break;
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-09 18:59 ` Stephen Hemminger
@ 2005-02-09 20:25 ` David S. Miller
0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-09 20:25 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: hubert.tonneau, romieu, kuznet, netdev
On Wed, 9 Feb 2005 10:59:09 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:
> Please try this patch, based on Alexey's suggestion:
-EBADINDENTATION :-)
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-10 21:53 Hubert Tonneau
2005-02-10 22:36 ` Rick Jones
0 siblings, 1 reply; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-10 21:53 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Francois Romieu, Alexey Kuznetsov, netdev
It does not seem to solve the problem:
. Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
. Linux 2.6.10 with the TCP patch still takes 325 seconds.
Stephen Hemminger wrote:
>
> Please try this patch, based on Alexey's suggestion:
>
> > That's one quick and simple idea: set PSH on each tso segment.
> > Seems, it is always good. Hardware will preserve it only on the last skb and
> > everyone will be happy.
>
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> # 2005/02/09 11:00:57-08:00 shemminger@linux.site
> # Always set PUSH on TSO multi-segment frames
> # to workaround bugs in MacOSX
> #
> # net/ipv4/tcp_output.c
> # 2005/02/09 11:00:44-08:00 shemminger@linux.site +8 -0
> # Always set PUSH on TSO multi-segment frames
> # to workaround bugs in MacOSX
> #
> diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> --- a/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00
> +++ b/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00
> @@ -754,6 +754,14 @@
> break;
> }
>
> + /* Force push to be on for any large TSO frames
> + * to workaround problems with busted implementations
> + * like MacOSX that hold off delivery of data until
> + * push.
> + */
> + if (tcp_skb_pcount(skb) > 1)
> + TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> +
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
> break;
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-10 21:53 2.6.10 TCP troubles -- suggested patch Hubert Tonneau
@ 2005-02-10 22:36 ` Rick Jones
2005-02-11 1:16 ` David S. Miller
0 siblings, 1 reply; 40+ messages in thread
From: Rick Jones @ 2005-02-10 22:36 UTC (permalink / raw)
To: Hubert Tonneau
Cc: Stephen Hemminger, Francois Romieu, Alexey Kuznetsov, netdev
Hubert Tonneau wrote:
> It does not seem to solve the problem:
> . Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
> . Linux 2.6.10 with the TCP patch still takes 325 seconds.
is there a packet trace somewhere?
rick jones
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-10 22:36 ` Rick Jones
@ 2005-02-11 1:16 ` David S. Miller
0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-11 1:16 UTC (permalink / raw)
To: Rick Jones; +Cc: hubert.tonneau, shemminger, romieu, kuznet, netdev
On Thu, 10 Feb 2005 14:36:40 -0800
Rick Jones <rick.jones2@hp.com> wrote:
> Hubert Tonneau wrote:
> > It does not seem to solve the problem:
> > . Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
> > . Linux 2.6.10 with the TCP patch still takes 325 seconds.
>
>
> is there a packet trace somewhere?
I know what's wrong, no trace needed, Stephen's patch misses
tcp_push_one() and similar.
He only added the PSH bit setting to tcp_write_xmit().
Hubert, try this patch instead.
===== net/ipv4/tcp_output.c 1.77 vs edited =====
--- 1.77/net/ipv4/tcp_output.c 2005-01-18 12:23:36 -08:00
+++ edited/net/ipv4/tcp_output.c 2005-02-10 16:42:42 -08:00
@@ -408,6 +408,16 @@
sk->sk_send_head = skb;
}
+static inline void tcp_tso_set_push(struct sk_buff *skb)
+{
+ /* Force push to be on for any TSO frames to workaround
+ * problems with busted implementations like Mac OS-X that
+ * hold off socket reveive wakeups until push is seen.
+ */
+ if (tcp_skb_pcount(skb) > 1)
+ TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
+}
+
/* Send _single_ skb sitting at the send head. This function requires
* true push pending frames to setup probe timer etc.
*/
@@ -419,6 +429,7 @@
if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) {
/* Send it out now. */
TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ tcp_tso_set_push(skb);
if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) {
sk->sk_send_head = NULL;
tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
@@ -755,6 +766,7 @@
}
TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ tcp_tso_set_push(skb);
if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
break;
@@ -1096,6 +1108,7 @@
* is still in somebody's hands, else make a clone.
*/
TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ tcp_tso_set_push(skb);
err = tcp_transmit_skb(sk, (skb_cloned(skb) ?
pskb_copy(skb, GFP_ATOMIC):
@@ -1668,6 +1681,7 @@
TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ tcp_tso_set_push(skb);
err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC));
if (!err) {
update_send_head(sk, tp, skb);
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-11 21:55 Hubert Tonneau
2005-02-11 22:54 ` Rick Jones
2005-02-11 23:04 ` Stephen Hemminger
0 siblings, 2 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-11 21:55 UTC (permalink / raw)
To: David S. Miller
Cc: shemminger, romieu, kuznet, Nivedita Singhvi, Rick Jones, netdev
Sorry, it still does not work, unless I made a mistake:
Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.
You can pick the new tcpdump report, created through:
tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz
Here is the connection summary:
Dell PowerEdge 2600 (dual Xeon with hyper threading) running libsmbclient
on Linux 2.6.x, IP for eth1 (Intel pro 1000) is 10.107.96.7 (full
duplex, flow control is enabled)
|
|
gigabit switch
|
|
100 Mbps switch
|
|
Mac running Samba server on OSX,
IP is 10.107.96.230
David S. Miller wrote:
>
> Hubert, try this patch instead.
>
> ===== net/ipv4/tcp_output.c 1.77 vs edited =====
> --- 1.77/net/ipv4/tcp_output.c 2005-01-18 12:23:36 -08:00
> +++ edited/net/ipv4/tcp_output.c 2005-02-10 16:42:42 -08:00
> @@ -408,6 +408,16 @@
> sk->sk_send_head = skb;
> }
>
> +static inline void tcp_tso_set_push(struct sk_buff *skb)
> +{
> + /* Force push to be on for any TSO frames to workaround
> + * problems with busted implementations like Mac OS-X that
> + * hold off socket reveive wakeups until push is seen.
> + */
> + if (tcp_skb_pcount(skb) > 1)
> + TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> +}
> +
> /* Send _single_ skb sitting at the send head. This function requires
> * true push pending frames to setup probe timer etc.
> */
> @@ -419,6 +429,7 @@
> if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) {
> /* Send it out now. */
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
> if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) {
> sk->sk_send_head = NULL;
> tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
> @@ -755,6 +766,7 @@
> }
>
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
> if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
> break;
>
> @@ -1096,6 +1108,7 @@
> * is still in somebody's hands, else make a clone.
> */
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
>
> err = tcp_transmit_skb(sk, (skb_cloned(skb) ?
> pskb_copy(skb, GFP_ATOMIC):
> @@ -1668,6 +1681,7 @@
>
> TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
> err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC));
> if (!err) {
> update_send_head(sk, tp, skb);
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 21:55 Hubert Tonneau
@ 2005-02-11 22:54 ` Rick Jones
2005-02-11 23:09 ` Nivedita Singhvi
2005-02-12 1:09 ` David S. Miller
2005-02-11 23:04 ` Stephen Hemminger
1 sibling, 2 replies; 40+ messages in thread
From: Rick Jones @ 2005-02-11 22:54 UTC (permalink / raw)
To: Hubert Tonneau; +Cc: David S. Miller, shemminger, romieu, kuznet, netdev
Hubert Tonneau wrote:
> Sorry, it still does not work, unless I made a mistake:
> Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
> Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.
>
> You can pick the new tcpdump report, created through:
> tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
> at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz
>
> Here is the connection summary:
>
> Dell PowerEdge 2600 (dual Xeon with hyper threading) running libsmbclient
> on Linux 2.6.x, IP for eth1 (Intel pro 1000) is 10.107.96.7 (full
> duplex, flow control is enabled)
> |
> |
> gigabit switch
> |
> |
> 100 Mbps switch
> |
> |
> Mac running Samba server on OSX,
> IP is 10.107.96.230
"Cooking" the trace with tcpdump -ttt to give the relative timestamdps makes
things look like Mac OSX has an ACK avoidance heuristic in it? I figured there
was one in their OX <= 9 stack that came from a third-party, wasn't sure if they
put that into their OSX stack - IIRC that one is not from the third-party.
FWIW, there are two or three other stacks that have ACK avoidance heuristics as
well, it isn't an OSX only thing.
000780 10.107.96.230.139 > 10.107.96.7.32801: P 753:822(69) ack 1556 win 65535
<nop,nop,timestamp 1709240657 534173> NBT Packet (DF)
000579 10.107.96.7.32801 > 10.107.96.230.139: . 1556:3004(1448) ack 822 win 1460
<nop,nop,timestamp 534175 1709240657> NBT Packet (DF)
000027 10.107.96.7.32801 > 10.107.96.230.139: . 3004:4452(1448) ack 822 win 1460
<nop,nop,timestamp 534175 1709240657> NBT Packet (DF)
000005 10.107.96.7.32801 > 10.107.96.230.139: . 4452:5900(1448) ack 822 win 1460
<nop,nop,timestamp 534175 1709240657> NBT Packet (DF)
074685 10.107.96.230.139 > 10.107.96.7.32801: . ack 5900 win 62268
<nop,nop,timestamp 1709240657 534175> (DF)
delack above
000012 10.107.96.7.32801 > 10.107.96.230.139: . 5900:7348(1448) ack 822 win 1460
<nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
000003 10.107.96.7.32801 > 10.107.96.230.139: . 7348:8796(1448) ack 822 win 1460
<nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
000002 10.107.96.7.32801 > 10.107.96.230.139: . 8796:10244(1448) ack 822 win
1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
000002 10.107.96.7.32801 > 10.107.96.230.139: . 10244:11692(1448) ack 822 win
1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF)
200024 10.107.96.230.139 > 10.107.96.7.32801: . ack 11692 win 56476
<nop,nop,timestamp 1709240658 534249> (DF)
and again above.
000010 10.107.96.7.32801 > 10.107.96.230.139: . 11692:13140(1448) ack 822 win
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000004 10.107.96.7.32801 > 10.107.96.230.139: . 13140:14588(1448) ack 822 win
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000002 10.107.96.7.32801 > 10.107.96.230.139: P 14588:16036(1448) ack 822 win
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000022 10.107.96.7.32801 > 10.107.96.230.139: . 16036:17484(1448) ack 822 win
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000004 10.107.96.7.32801 > 10.107.96.230.139: P 17484:18192(708) ack 822 win
1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
000994 10.107.96.230.139 > 10.107.96.7.32801: . ack 18192 win 65535
<nop,nop,timestamp 1709240658 534449> (DF)
0
And then other cases where the ACK seems to take a rather long time to arrive,
seems to correlate a bit with slowly increasing numbers of segments before the
ACK is sent, and something along the lines of a 200 millisecond delayed ACK timer.
In some cases at least if the sender does not completely fill cwnd the ACKs will
be delayed. And IIRC under 2.6.10 with TSO enabled, the sender does not always
fill cwnd.
hth,
rick jones
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 21:55 Hubert Tonneau
2005-02-11 22:54 ` Rick Jones
@ 2005-02-11 23:04 ` Stephen Hemminger
2005-02-12 1:07 ` David S. Miller
2005-02-15 23:23 ` David S. Miller
1 sibling, 2 replies; 40+ messages in thread
From: Stephen Hemminger @ 2005-02-11 23:04 UTC (permalink / raw)
To: Hubert Tonneau
Cc: David S. Miller, romieu, kuznet, Nivedita Singhvi, Rick Jones,
netdev
On Fri, 11 Feb 2005 21:55:49 GMT
Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote:
> Sorry, it still does not work, unless I made a mistake:
> Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
> Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.
>
> You can pick the new tcpdump report, created through:
> tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
> at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz
Still not setting Push sufficiently to keep MacOSX happy.
13:40:35.027124 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 924:975(51) ack 67344 win 50728
13:40:35.027186 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 67344 win 65535
13:40:35.027328 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 975:1026(51) ack 67344 win 65535
13:40:35.027363 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 67344:68792(1448) ack 1026 win 1460
13:40:35.027367 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 68792:70240(1448) ack 1026 win 1460
13:40:35.027370 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 70240:71688(1448) ack 1026 win 1460
13:40:35.027373 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 71688:73136(1448) ack 1026 win 1460
13:40:35.027375 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 73136:74584(1448) ack 1026 win 1460
13:40:35.027378 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 74584:76032(1448) ack 1026 win 1460
13:40:35.027381 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 76032:77480(1448) ack 1026 win 1460
13:40:35.027384 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 77480:78928(1448) ack 1026 win 1460
13:40:35.027387 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 78928:80376(1448) ack 1026 win 1460
13:40:35.027390 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 80376:81824(1448) ack 1026 win 1460
13:40:35.027393 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 81824:83272(1448) ack 1026 win 1460
13:40:35.027397 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: P 83272:83980(708) ack 1026 win 1460
okay burst with push
13:40:35.034930 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 1179:1230(51) ack 133132 win 65535
13:40:35.035304 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 133132:134580(1448) ack 1230 win 1460
13:40:35.035312 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 134580:136028(1448) ack 1230 win 1460
Big gap... because of missing P
13:40:35.219175 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 136028 win 63716
13:40:35.219193 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 136028:137476(1448) ack 1230 win 1460
13:40:35.219197 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 137476:138924(1448) ack 1230 win 1460
13:40:35.419193 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 138924 win 60820
13:40:35.419202 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 138924:140372(1448) ack 1230 win 1460
13:40:35.419205 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 140372:141820(1448) ack 1230 win 1460
13:40:35.419207 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 141820:143268(1448) ack 1230 win 1460
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 22:54 ` Rick Jones
@ 2005-02-11 23:09 ` Nivedita Singhvi
2005-02-11 23:40 ` Rick Jones
2005-02-12 1:08 ` David S. Miller
2005-02-12 1:09 ` David S. Miller
1 sibling, 2 replies; 40+ messages in thread
From: Nivedita Singhvi @ 2005-02-11 23:09 UTC (permalink / raw)
To: Rick Jones
Cc: Hubert Tonneau, David S. Miller, shemminger, romieu, kuznet,
netdev
Rick Jones wrote:
> 000010 10.107.96.7.32801 > 10.107.96.230.139: . 11692:13140(1448) ack
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000004 10.107.96.7.32801 > 10.107.96.230.139: . 13140:14588(1448) ack
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000002 10.107.96.7.32801 > 10.107.96.230.139: P 14588:16036(1448) ack
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000022 10.107.96.7.32801 > 10.107.96.230.139: . 16036:17484(1448) ack
> 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000004 10.107.96.7.32801 > 10.107.96.230.139: P 17484:18192(708) ack 822
> win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF)
> 000994 10.107.96.230.139 > 10.107.96.7.32801: . ack 18192 win 65535
> <nop,nop,timestamp 1709240658 534449> (DF)
> 0
>
> And then other cases where the ACK seems to take a rather long time to
> arrive, seems to correlate a bit with slowly increasing numbers of
> segments before the ACK is sent, and something along the lines of a 200
> millisecond delayed ACK timer.
>
> In some cases at least if the sender does not completely fill cwnd the
> ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the
> sender does not always fill cwnd.
Er, how is this compliant with 2581 (yes, I know, it's only
a SHOULD, not a MUST) - an ACK should be generated for at
least every second full-sized segment received? Don't see
that happening. In many cases it's receiving quite a few
more packets. It should not be waiting for the delayed
ack timer to go off, surely?
thanks,
Nivedita
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 23:09 ` Nivedita Singhvi
@ 2005-02-11 23:40 ` Rick Jones
2005-02-12 1:08 ` David S. Miller
1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2005-02-11 23:40 UTC (permalink / raw)
To: netdev; +Cc: Hubert Tonneau, shemminger, romieu, kuznet
> Er, how is this compliant with 2581 (yes, I know, it's only a SHOULD, not a
> MUST) - an ACK should be generated for at least every second full-sized
> segment received? Don't see that happening. In many cases it's receiving
> quite a few more packets. It should not be waiting for the delayed ack timer
> to go off, surely?
Certainly it would make for an interesting disuscion. Indeed it is a
SHOULD which leaves-open the door to compliance of other ACK policies. Those
might result in an ACK for more than two segments, or even an ACK for fewer than
two segments, and there are folks in either camp/faction/sect/pick your favorite
term.
I would say that it is still compliant with 2581. The must there is that no
matter what, an ACK must be generated within 500 milliseconds.
I suspect that had a full cwnd's worth of data been sent there would have been
no lengthy delay in ACKs even with fewer than ACK-every-other. I suspect that
had TSO been disabled the full cwnd would have been sent and these delayed ACKs
would not have happened and the transfer speed would have been happiness and joy.
FWIW, as the industry has added features such as CKO (ChecKsum Offload),
copy-avoidance, and now TSO, the pie chart of time spent has been shifting more
and more to ACK processing. If we go back far enough, the writeups talk about
how delayed ACK to increase ACK piggybacking was added in the first place -
specifically (IIRC) for the purpose of minimizing ACK overhead.
rick jones
BTW, I'd be happy to trim emails that are already on netdev to avoid message
duplications, is netdev a "closed" list?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 23:04 ` Stephen Hemminger
@ 2005-02-12 1:07 ` David S. Miller
2005-02-12 12:11 ` Andi Kleen
2005-02-12 14:16 ` Alexey Kuznetsov
2005-02-15 23:23 ` David S. Miller
1 sibling, 2 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12 1:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev
On Fri, 11 Feb 2005 15:04:20 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:
> Still not setting Push sufficiently to keep MacOSX happy.
I don't think it's the kernel's fault in this case.
This set of data frames you quoted are all full, and
are tightly interspaced. It looks exactly like a TSO
frame, which we certainly set PSH on, but the TSO
engine is dropping it aparently.
I guess this is e1000. Any e1000 internals experts reading
here who can comment on how e1000's TSO engine treats the
PSH flag?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 23:09 ` Nivedita Singhvi
2005-02-11 23:40 ` Rick Jones
@ 2005-02-12 1:08 ` David S. Miller
1 sibling, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12 1:08 UTC (permalink / raw)
To: Nivedita Singhvi
Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev
On Fri, 11 Feb 2005 15:09:11 -0800
Nivedita Singhvi <niv@us.ibm.com> wrote:
> Er, how is this compliant with 2581 (yes, I know, it's only
> a SHOULD, not a MUST) - an ACK should be generated for at
> least every second full-sized segment received?
It's compliant but stupid.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 22:54 ` Rick Jones
2005-02-11 23:09 ` Nivedita Singhvi
@ 2005-02-12 1:09 ` David S. Miller
2005-02-12 14:31 ` Alexey Kuznetsov
1 sibling, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-12 1:09 UTC (permalink / raw)
To: Rick Jones; +Cc: hubert.tonneau, shemminger, romieu, kuznet, netdev
On Fri, 11 Feb 2005 14:54:27 -0800
Rick Jones <rick.jones2@hp.com> wrote:
> In some cases at least if the sender does not completely fill cwnd the
> ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the
> sender does not always fill cwnd.
At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left
empty.
By default, this is 1/8 of the cwnd.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 1:07 ` David S. Miller
@ 2005-02-12 12:11 ` Andi Kleen
2005-02-12 19:23 ` David S. Miller
2005-02-12 14:16 ` Alexey Kuznetsov
1 sibling, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2005-02-12 12:11 UTC (permalink / raw)
To: David S. Miller; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev
"David S. Miller" <davem@davemloft.net> writes:
>
> I guess this is e1000. Any e1000 internals experts reading
> here who can comment on how e1000's TSO engine treats the
> PSH flag?
If that is the problem it should be easy to test for. Just
disable TSO with ethtool -K ethX tso off
Hubert, does that make the problem go away?
-Andi
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 1:07 ` David S. Miller
2005-02-12 12:11 ` Andi Kleen
@ 2005-02-12 14:16 ` Alexey Kuznetsov
2005-02-12 19:41 ` David S. Miller
1 sibling, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 14:16 UTC (permalink / raw)
To: David S. Miller
Cc: Stephen Hemminger, hubert.tonneau, romieu, kuznet, niv,
rick.jones2, netdev
Hello!
> This set of data frames you quoted are all full, and
> are tightly interspaced. It looks exactly like a TSO
> frame, which we certainly set PSH on, but the TSO
> engine is dropping it aparently.
>
> I guess this is e1000. Any e1000 internals experts reading
> here who can comment on how e1000's TSO engine treats the
> PSH flag?
Or it was two one-segment frames.
Before blaming on e1000 it would be easier to confirm that
linux never worked with MacOS X, except for those kernels which
had congestion avoidance mostly supppressed.
I.e. let's disable TSO in 2.6.9 and look.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 1:09 ` David S. Miller
@ 2005-02-12 14:31 ` Alexey Kuznetsov
2005-02-12 19:28 ` David S. Miller
2005-02-12 20:19 ` rick jones
0 siblings, 2 replies; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 14:31 UTC (permalink / raw)
To: David S. Miller
Cc: Rick Jones, hubert.tonneau, shemminger, romieu, kuznet, netdev
Hello!
> > In some cases at least if the sender does not completely fill cwnd the
> > ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the
> > sender does not always fill cwnd.
>
> At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left
> empty.
>
> By default, this is 1/8 of the cwnd.
In any case, receiver cannot know sender cwnd, so that "fill" or "not fill"
is is not a question.
What is broken in that implementation is that it does not feel slow start.
ACK avoidance while slow start is certain disaster. Currrent theory is that
MacOS X thinks that we do not do slow start.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 12:11 ` Andi Kleen
@ 2005-02-12 19:23 ` David S. Miller
2005-02-12 21:30 ` Andi Kleen
0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-12 19:23 UTC (permalink / raw)
To: Andi Kleen; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev
On Sat, 12 Feb 2005 13:11:43 +0100
Andi Kleen <ak@muc.de> wrote:
> "David S. Miller" <davem@davemloft.net> writes:
> >
> > I guess this is e1000. Any e1000 internals experts reading
> > here who can comment on how e1000's TSO engine treats the
> > PSH flag?
>
> If that is the problem it should be easy to test for. Just
> disable TSO with ethtool -K ethX tso off
>
> Hubert, does that make the problem go away?
We're testing the new code that sets PSH on every TSO frame.
If we disable TSO, the new code won't be exercised nor tested.
:-)
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 14:31 ` Alexey Kuznetsov
@ 2005-02-12 19:28 ` David S. Miller
2005-02-12 19:44 ` Leonid Grossman
2005-02-12 19:52 ` Alexey Kuznetsov
2005-02-12 20:19 ` rick jones
1 sibling, 2 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12 19:28 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev
On Sat, 12 Feb 2005 17:31:05 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> In any case, receiver cannot know sender cwnd, so that "fill" or "not fill"
> is is not a question.
>
> What is broken in that implementation is that it does not feel slow start.
> ACK avoidance while slow start is certain disaster. Currrent theory is that
> MacOS X thinks that we do not do slow start.
It is correct. Although, I am still believing that setting PSH
is the avenue of investigation.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 14:16 ` Alexey Kuznetsov
@ 2005-02-12 19:41 ` David S. Miller
2005-02-12 20:03 ` Alexey Kuznetsov
0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-12 19:41 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: shemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2,
netdev
On Sat, 12 Feb 2005 17:16:41 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> > This set of data frames you quoted are all full, and
> > are tightly interspaced. It looks exactly like a TSO
> > frame, which we certainly set PSH on, but the TSO
> > engine is dropping it aparently.
...
> Or it was two one-segment frames.
Even ignoring my TSO changes, we should be seeing at a minimum
1/2 window PSH settings which we're not as far as I can tell.
(this is due to the forced_push() test in net/ipv4/tcp.c)
This also points out a bug in my TSO PSH patch, I should be
updating tp->pushed_seq shouldn't I? Question is, what to
set it to? I think correct value is TCP_SKB_CB(skb)->end_seq.
> I.e. let's disable TSO in 2.6.9 and look.
I believe this experiment had been performed already. Stephen,
isn't that the case?
^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: 2.6.10 TCP troubles -- suggested patch
2005-02-12 19:28 ` David S. Miller
@ 2005-02-12 19:44 ` Leonid Grossman
2005-02-12 19:52 ` Alexey Kuznetsov
1 sibling, 0 replies; 40+ messages in thread
From: Leonid Grossman @ 2005-02-12 19:44 UTC (permalink / raw)
To: 'David S. Miller', 'Alexey Kuznetsov'
Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev
Typically, a TSO engine sets PSH in the last packet that it builds for the
TSO+PSH request.
Leonid
> -----Original Message-----
> From: netdev-bounce@oss.sgi.com
> [mailto:netdev-bounce@oss.sgi.com] On Behalf Of David S. Miller
> Sent: Saturday, February 12, 2005 11:28 AM
> To: Alexey Kuznetsov
> Cc: rick.jones2@hp.com; hubert.tonneau@fullpliant.org;
> shemminger@osdl.org; romieu@fr.zoreil.com;
> kuznet@ms2.inr.ac.ru; netdev@oss.sgi.com
> Subject: Re: 2.6.10 TCP troubles -- suggested patch
>
> On Sat, 12 Feb 2005 17:31:05 +0300
> Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
>
> > In any case, receiver cannot know sender cwnd, so that
> "fill" or "not fill"
> > is is not a question.
> >
> > What is broken in that implementation is that it does not
> feel slow start.
> > ACK avoidance while slow start is certain disaster.
> Currrent theory is
> > that MacOS X thinks that we do not do slow start.
>
> It is correct. Although, I am still believing that setting
> PSH is the avenue of investigation.
>
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 19:28 ` David S. Miller
2005-02-12 19:44 ` Leonid Grossman
@ 2005-02-12 19:52 ` Alexey Kuznetsov
2005-02-15 23:25 ` David S. Miller
1 sibling, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 19:52 UTC (permalink / raw)
To: David S. Miller
Cc: Alexey Kuznetsov, rick.jones2, hubert.tonneau, shemminger, romieu,
netdev
Hello!
> It is correct. Although, I am still believing that setting PSH
> is the avenue of investigation.
Exactly. That's why the next test should be with disabled TSO in 2.6.9.
If too rare PSHs were a problem, it will show as the same disaster there.
[ And, to be honest, in this case, I daresay MacOS X may be left with its bugs
alone. Or we could help it with something like setting PSH when we are in slow
start and each half of CWND after completion of slow start. Or just set
PSH on each frame. ]
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 19:41 ` David S. Miller
@ 2005-02-12 20:03 ` Alexey Kuznetsov
2005-02-15 23:26 ` David S. Miller
0 siblings, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 20:03 UTC (permalink / raw)
To: David S. Miller
Cc: Alexey Kuznetsov, shemminger, hubert.tonneau, romieu, niv,
rick.jones2, netdev
Hello!
> set it to? I think correct value is TCP_SKB_CB(skb)->end_seq.
Yup. But it does not matter. When it is not advanced, it does not make
PSHs more rare.
Actually, that anti-MacOS never worked well. If segment with forced PSH
was not transmitted in time, even forced PSHs could be deleted.
Your patch with setting PSH right before (or in) tcp_transmit_skb() must
work. Unless these segments are not tso.
> > I.e. let's disable TSO in 2.6.9 and look.
>
> I believe this experiment had been performed already.
I saw only tests with TSO. And 2.6.9 showed exactly the same weird
behaviour. Only 2.6.9 did not slow start and we had only a few of 200msec
gaps.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 14:31 ` Alexey Kuznetsov
2005-02-12 19:28 ` David S. Miller
@ 2005-02-12 20:19 ` rick jones
2005-02-12 20:28 ` David S. Miller
2005-02-12 20:56 ` Alexey Kuznetsov
1 sibling, 2 replies; 40+ messages in thread
From: rick jones @ 2005-02-12 20:19 UTC (permalink / raw)
To: Alexey Kuznetsov; +Cc: netdev, romieu, hubert.tonneau, shemminger
On Feb 12, 2005, at 6:31 AM, Alexey Kuznetsov wrote:
> Hello!
>
>>> In some cases at least if the sender does not completely fill cwnd
>>> the
>>> ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the
>>> sender does not always fill cwnd.
>>
>> At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left
>> empty.
>>
>> By default, this is 1/8 of the cwnd.
>
> In any case, receiver cannot know sender cwnd, so that "fill" or "not
> fill"
> is is not a question.
How is that? Isn't cwnd based on the ACKs the sender receives from the
receiver?
> What is broken in that implementation is that it does not feel slow
> start.
> ACK avoidance while slow start is certain disaster. Currrent theory is
> that
> MacOS X thinks that we do not do slow start.
Actually, it may think slow start is being done - there was enough
small packet back and forth on the connection before the "heavy
transfer" to get cwnd opened - I just didn't quote that in the "cooked"
output. All the stacks with ACK avoidance with which I am familiar do
not make the assumption that the sender is not doing slow-start. They
make sure to send enough ACKs at the beginning (or after packet loss)
to allow the sender's cwnd to grow.
rick jones
wisdom teeth are impacted, people are affected by the effects of events
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 20:19 ` rick jones
@ 2005-02-12 20:28 ` David S. Miller
2005-02-12 20:56 ` Alexey Kuznetsov
1 sibling, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-12 20:28 UTC (permalink / raw)
To: rick jones; +Cc: kuznet, netdev, romieu, hubert.tonneau, shemminger
On Sat, 12 Feb 2005 12:19:35 -0800
rick jones <rick.jones2@hp.com> wrote:
> How is that? Isn't cwnd based on the ACKs the sender receives from the
> receiver?
ACKs go from sender to receiver, first of all.
It is based upon congestion as seen "by receiver", something which is
impossible for sender.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 20:19 ` rick jones
2005-02-12 20:28 ` David S. Miller
@ 2005-02-12 20:56 ` Alexey Kuznetsov
2005-02-12 21:27 ` Nivedita Singhvi
2005-02-12 21:43 ` rick jones
1 sibling, 2 replies; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 20:56 UTC (permalink / raw)
To: rick jones; +Cc: Alexey Kuznetsov, netdev, romieu, hubert.tonneau, shemminger
Hello!
> Actually, it may think slow start is being done - there was enough
> small packet back and forth on the connection before the "heavy
> transfer" to get cwnd opened
If receiver sent an ACK it still does not mean that sender used it
to increase its cwnd. Particularly, small packet exchange definitely
does not inflate cwnd.
> output. All the stacks with ACK avoidance with which I am familiar do
> not make the assumption that the sender is not doing slow-start. They
> make sure to send enough ACKs at the beginning (or after packet loss)
> to allow the sender's cwnd to grow.
Well, we do similar thing with delayed ACKs. And it took a few of runs
of testing to understand that we cannot detect even packet loss reliably
enough. :-)
Actually, those receivers could use the first delayed ACK event as
a sign of failure of their heuristics and block stretching acks for
this connection.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 20:56 ` Alexey Kuznetsov
@ 2005-02-12 21:27 ` Nivedita Singhvi
2005-02-12 21:43 ` rick jones
1 sibling, 0 replies; 40+ messages in thread
From: Nivedita Singhvi @ 2005-02-12 21:27 UTC (permalink / raw)
To: Alexey Kuznetsov; +Cc: rick jones, netdev, romieu, hubert.tonneau, shemminger
Alexey Kuznetsov wrote:
> If receiver sent an ACK it still does not mean that sender used it
> to increase its cwnd. Particularly, small packet exchange definitely
> does not inflate cwnd.
Simplest way to go about this is simply compare it to the
trace of the "good/fast" connection - Hubert, could you
provide the "good" trace as well? That would show where
the differences in time are taken up..
thanks,
Nivedita
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 19:23 ` David S. Miller
@ 2005-02-12 21:30 ` Andi Kleen
0 siblings, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2005-02-12 21:30 UTC (permalink / raw)
To: David S. Miller; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev
> We're testing the new code that sets PSH on every TSO frame.
> If we disable TSO, the new code won't be exercised nor tested.
> :-)
Sorry, I read the thread out of order (shouldn't do that) Ignore my mail.
-Andi
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 20:56 ` Alexey Kuznetsov
2005-02-12 21:27 ` Nivedita Singhvi
@ 2005-02-12 21:43 ` rick jones
2005-02-12 22:00 ` Alexey Kuznetsov
1 sibling, 1 reply; 40+ messages in thread
From: rick jones @ 2005-02-12 21:43 UTC (permalink / raw)
To: Alexey Kuznetsov; +Cc: netdev, romieu, hubert.tonneau, shemminger
> If receiver sent an ACK it still does not mean that sender used it
> to increase its cwnd. Particularly, small packet exchange definitely
> does not inflate cwnd.
Is that in general, or in Linux?
>> output. All the stacks with ACK avoidance with which I am familiar do
>> not make the assumption that the sender is not doing slow-start. They
>> make sure to send enough ACKs at the beginning (or after packet loss)
>> to allow the sender's cwnd to grow.
>
> Well, we do similar thing with delayed ACKs. And it took a few of runs
> of testing to understand that we cannot detect even packet loss
> reliably
> enough. :-)
I never claimed it was easy :)
> Actually, those receivers could use the first delayed ACK event as
> a sign of failure of their heuristics and block stretching acks for
> this connection.
The ones with which I am familiar do - after N delayed ACK events where
N is something other than one though. And they still send immediate
ACKs to the senders upon out of order data and all that.
rick jones
Wisdom teeth are impacted, people are affected by the effects of events
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 21:43 ` rick jones
@ 2005-02-12 22:00 ` Alexey Kuznetsov
2005-02-13 1:29 ` rick jones
0 siblings, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-12 22:00 UTC (permalink / raw)
To: rick jones; +Cc: Alexey Kuznetsov, netdev, romieu, hubert.tonneau, shemminger
Hello!
> Is that in general, or in Linux?
Any which follows some of congestion window validation recommendations.
Even canonical bsd restarts slow start after rtt.
> N is something other than one though.
Well, 1 is quite enough to be sure that something is very wrong.
You see a proof here.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 22:00 ` Alexey Kuznetsov
@ 2005-02-13 1:29 ` rick jones
0 siblings, 0 replies; 40+ messages in thread
From: rick jones @ 2005-02-13 1:29 UTC (permalink / raw)
To: netdev; +Cc: romieu, hubert.tonneau, shemminger
On Feb 12, 2005, at 2:00 PM, Alexey Kuznetsov wrote:
> Any which follows some of congestion window validation recommendations.
If you could point me at the chapter and verse that would be great.
> Even canonical bsd restarts slow start after rtt.
Did we have >= one RTT of idle in the packet trace?
>> N is something other than one though.
>
> Well, 1 is quite enough to be sure that something is very wrong.
> You see a proof here.
The debate of course is what :)
In and of _itself_, a delayed ACK does not guarantee something is very
wrong. For example, in a request/response situation when the response
takes longer than the delayed ACK interval to generate. And if it was
not request/response, and the sender simply didn't have any more to
send at that point.
Going back to the quantity of cwnd which may be left unused when TSO is
enabled. If when TSO is enabled, the sender does not take full
advantage of the cwnd doesn't that then mean that to deal with the same
bandwidth delay product, one needs a larger TCP window when TSO is
enabled than when it is not? In the default case of
tcp_tso_win_divisor being 8 that would be another 12.5% right?
rick jones
there is no rest for the wicked, yet the virtuous have no pillows
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-13 10:52 Hubert Tonneau
2005-02-14 14:12 ` Alexey Kuznetsov
0 siblings, 1 reply; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-13 10:52 UTC (permalink / raw)
To: Alexey Kuznetsov, David S. Miller
Cc: Alexey Kuznetsov, rick.jones2, shemminger, romieu, netdev
Alexey Kuznetsov wrote:
>
> Exactly. That's why the next test should be with disabled TSO in 2.6.9.
> If too rare PSHs were a problem, it will show as the same disaster there.
After,
ethtool -K eth1 tso off
the result is unchanged on 2.6.9 (14 seconds for 105 MB).
After,
ethtool -K eth1 tso off
the result is also unchanged on 2.6.10-ac11 with no extra TCP patch (325 seconds).
Settings for eth1:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: g
Current message level: 0x00000007 (7)
Link detected: yes
PS:
Please sorry for the long delay I have to run tests, and the reason is that
it's a production server, so I cannot make tests in the middle of the day,
it's remote, so in order to switch the kernel, I have to upload the new one,
and then upload again the old one to switch back, and the best connection
I have these days is 30 Kbps modem connection. It will improve on monday since
I'll have a 128 Kbps ADSL connection.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-13 10:52 Hubert Tonneau
@ 2005-02-14 14:12 ` Alexey Kuznetsov
0 siblings, 0 replies; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-14 14:12 UTC (permalink / raw)
To: Hubert Tonneau
Cc: Alexey Kuznetsov, David S. Miller, rick.jones2, shemminger,
romieu, netdev
Hello!
> ethtool -K eth1 tso off
> the result is unchanged on 2.6.9 (14 seconds for 105 MB).
>
> After,
> ethtool -K eth1 tso off
> the result is also unchanged on 2.6.10-ac11 with no extra TCP patch (325 seconds).
Well, it means the theory was wrong... tso is innocent. To make a new
theory we need a tcpdump of 2.6.10 with disabled tso.
> it's a production server,
I hope we can stay in its normal configuration now. TSO may be kept disabled.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-11 23:04 ` Stephen Hemminger
2005-02-12 1:07 ` David S. Miller
@ 2005-02-15 23:23 ` David S. Miller
2005-02-16 9:13 ` Alexey Kuznetsov
1 sibling, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-15 23:23 UTC (permalink / raw)
To: Stephen Hemminger
Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev
On Fri, 11 Feb 2005 15:04:20 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:
> Still not setting Push sufficiently to keep MacOSX happy.
...
> 13:40:35.034930 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 1179:1230(51) ack 133132 win 65535
> 13:40:35.035304 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 133132:134580(1448) ack 1230 win 1460
> 13:40:35.035312 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 134580:136028(1448) ack 1230 win 1460
>
> Big gap... because of missing P
>
> 13:40:35.219175 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 136028 win 63716
I am starting to understand Darwin's logic. If header prediction fast path
is hit, ACK is always delayed when delack sysctl is enabled.
One way to miss fast path is for PSH to be set.
This will make ACK not get delayed if ACK is pending already.
At least that is how it looks, and it makes sense given this trace.
How mind boggling a heuristic. I bet it works by accident rather
than intention and purposeful design.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 19:52 ` Alexey Kuznetsov
@ 2005-02-15 23:25 ` David S. Miller
0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-15 23:25 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: kuznet, rick.jones2, hubert.tonneau, shemminger, romieu, netdev
On Sat, 12 Feb 2005 22:52:46 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> Exactly. That's why the next test should be with disabled TSO in 2.6.9.
> If too rare PSHs were a problem, it will show as the same disaster there.
>
> [ And, to be honest, in this case, I daresay MacOS X may be left with its bugs
> alone. Or we could help it with something like setting PSH when we are in slow
> start and each half of CWND after completion of slow start. Or just set
> PSH on each frame. ]
Setting it every other frame would fix the problem, just forcing it to
miss header prediction path is what is needed to avoid the silly delayed
ACK behavior. And PSH is one way to do that.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-12 20:03 ` Alexey Kuznetsov
@ 2005-02-15 23:26 ` David S. Miller
2005-02-15 23:42 ` Rick Jones
0 siblings, 1 reply; 40+ messages in thread
From: David S. Miller @ 2005-02-15 23:26 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: kuznet, shemminger, hubert.tonneau, romieu, niv, rick.jones2,
netdev
On Sat, 12 Feb 2005 23:03:18 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> Actually, that anti-MacOS never worked well. If segment with forced PSH
> was not transmitted in time, even forced PSHs could be deleted.
> Your patch with setting PSH right before (or in) tcp_transmit_skb() must
> work. Unless these segments are not tso.
Yes, it never did work well. But now we understand more deeply the
nature of this beast, we can probably refine it.
In short, for properly working TCP stream with no drops and no
reordering, Darwin delays ACKs until delack timer fires or PSH
is seen :-)
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-15 23:26 ` David S. Miller
@ 2005-02-15 23:42 ` Rick Jones
0 siblings, 0 replies; 40+ messages in thread
From: Rick Jones @ 2005-02-15 23:42 UTC (permalink / raw)
To: netdev
> In short, for properly working TCP stream with no drops and no
> reordering, Darwin delays ACKs until delack timer fires or PSH
> is seen :-)
As a supporter of ACK avoidance heuristics in general, I will come-out and say
that the heuristic above does indeed sound quite broken. It is not the
heuristic with which I am familiar, which has a configurable maximum number of
segments for which to delay the ACK.
rick jones
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-15 23:23 ` David S. Miller
@ 2005-02-16 9:13 ` Alexey Kuznetsov
2005-02-16 17:50 ` David S. Miller
0 siblings, 1 reply; 40+ messages in thread
From: Alexey Kuznetsov @ 2005-02-16 9:13 UTC (permalink / raw)
To: David S. Miller
Cc: Stephen Hemminger, hubert.tonneau, romieu, kuznet, niv,
rick.jones2, netdev
Hello!
> How mind boggling a heuristic. I bet it works by accident rather
> than intention and purposeful design.
Yup. It is definitely not an "ack avoidance algorithm" :-) :-)
BTW it is still a puzzle why 2.6.9 works. With disabled TSO it should
insert PSHs quite rarely, similarly to tso.
And it is still a puzzle how that bunch of PSHless segments not followed
by PSH appeared in TSO case.
Alexey
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
2005-02-16 9:13 ` Alexey Kuznetsov
@ 2005-02-16 17:50 ` David S. Miller
0 siblings, 0 replies; 40+ messages in thread
From: David S. Miller @ 2005-02-16 17:50 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: shemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2,
netdev
[-- Attachment #1: Type: text/plain, Size: 661 bytes --]
On Wed, 16 Feb 2005 12:13:23 +0300
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote:
> BTW it is still a puzzle why 2.6.9 works. With disabled TSO it should
> insert PSHs quite rarely, similarly to tso.
Yes.
Hubert, do you have netfilter enabled in the 2.6.10 kernel you are running?
I'm asking because the TCP changes in 2.6.10 are pretty benign
(attached for the curious who want to review along), whereas
netfilter had major updates particularly in the TCP connection
tracking code.
I also reviewed 2.6.10-ac11 for anything interesting wrt. TCP and the
only thing in there is the tcp_retrans_try_collapse() missing check
to avoid collapsing TSO segments.
[-- Attachment #2: tcp-2.6.10 --]
[-- Type: application/octet-stream, Size: 35185 bytes --]
diff -Nru a/include/linux/tcp.h b/include/linux/tcp.h
--- a/include/linux/tcp.h 2004-12-24 13:36:49 -08:00
+++ b/include/linux/tcp.h 2004-12-24 13:36:49 -08:00
@@ -186,6 +186,8 @@
__u32 tcpi_rcv_rtt;
__u32 tcpi_rcv_space;
+
+ __u32 tcpi_total_retrans;
};
#ifdef __KERNEL__
@@ -363,6 +365,8 @@
__u8 pending; /* Scheduled timer event */
__u8 urg_mode; /* In urgent mode */
__u32 snd_up; /* Urgent pointer */
+
+ __u32 total_retrans; /* Total retransmits for entire connection */
/* The syn_wait_lock is necessary only to avoid proc interface having
* to grab the main lock sock while browsing the listening hash
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h 2004-12-24 13:36:18 -08:00
+++ b/include/net/tcp.h 2004-12-24 13:36:18 -08:00
@@ -159,7 +159,6 @@
extern void tcp_bucket_destroy(struct tcp_bind_bucket *tb);
extern void tcp_bucket_unlock(struct sock *sk);
extern int tcp_port_rover;
-extern struct sock *tcp_v4_lookup_listener(u32 addr, unsigned short hnum, int dif);
/* These are AF independent. */
static __inline__ int tcp_bhashfn(__u16 lport)
@@ -362,8 +361,8 @@
#define TCP_IPV6_MATCH(__sk, __saddr, __daddr, __ports, __dif) \
(((*((__u32 *)&(inet_sk(__sk)->dport)))== (__ports)) && \
((__sk)->sk_family == AF_INET6) && \
- !ipv6_addr_cmp(&inet6_sk(__sk)->daddr, (__saddr)) && \
- !ipv6_addr_cmp(&inet6_sk(__sk)->rcv_saddr, (__daddr)) && \
+ ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr)) && \
+ ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
/* These can have wildcards, don't try too hard. */
@@ -961,12 +960,14 @@
extern void tcp_init_xmit_timers(struct sock *);
extern void tcp_clear_xmit_timers(struct sock *);
-extern void tcp_delete_keepalive_timer (struct sock *);
-extern void tcp_reset_keepalive_timer (struct sock *, unsigned long);
+extern void tcp_delete_keepalive_timer(struct sock *);
+extern void tcp_reset_keepalive_timer(struct sock *, unsigned long);
extern unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu);
extern unsigned int tcp_current_mss(struct sock *sk, int large);
-extern const char timer_bug_msg[];
+#ifdef TCP_DEBUG
+extern const char tcp_timer_bug_msg[];
+#endif
/* tcp_diag.c */
extern void tcp_get_info(struct sock *, struct tcp_info *);
@@ -999,7 +1000,9 @@
#endif
break;
default:
- printk(timer_bug_msg);
+#ifdef TCP_DEBUG
+ printk(tcp_timer_bug_msg);
+#endif
return;
};
@@ -1034,7 +1037,9 @@
break;
default:
- printk(timer_bug_msg);
+#ifdef TCP_DEBUG
+ printk(tcp_timer_bug_msg);
+#endif
};
}
@@ -1083,7 +1088,7 @@
* Rcv_nxt can be after the window if our peer push more data
* than the offered window.
*/
-static __inline__ u32 tcp_receive_window(struct tcp_opt *tp)
+static __inline__ u32 tcp_receive_window(const struct tcp_opt *tp)
{
s32 win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt;
@@ -1161,18 +1166,19 @@
/* Due to TSO, an SKB can be composed of multiple actual
* packets. To keep these tracked properly, we use this.
*/
-static inline int tcp_skb_pcount(struct sk_buff *skb)
+static inline int tcp_skb_pcount(const struct sk_buff *skb)
{
return skb_shinfo(skb)->tso_segs;
}
/* This is valid iff tcp_skb_pcount() > 1. */
-static inline int tcp_skb_mss(struct sk_buff *skb)
+static inline int tcp_skb_mss(const struct sk_buff *skb)
{
return skb_shinfo(skb)->tso_size;
}
-static inline void tcp_inc_pcount(tcp_pcount_t *count, struct sk_buff *skb)
+static inline void tcp_inc_pcount(tcp_pcount_t *count,
+ const struct sk_buff *skb)
{
count->val += tcp_skb_pcount(skb);
}
@@ -1187,13 +1193,14 @@
count->val -= amt;
}
-static inline void tcp_dec_pcount(tcp_pcount_t *count, struct sk_buff *skb)
+static inline void tcp_dec_pcount(tcp_pcount_t *count,
+ const struct sk_buff *skb)
{
count->val -= tcp_skb_pcount(skb);
}
static inline void tcp_dec_pcount_approx(tcp_pcount_t *count,
- struct sk_buff *skb)
+ const struct sk_buff *skb)
{
if (count->val) {
count->val -= tcp_skb_pcount(skb);
@@ -1202,7 +1209,7 @@
}
}
-static inline __u32 tcp_get_pcount(tcp_pcount_t *count)
+static inline __u32 tcp_get_pcount(const tcp_pcount_t *count)
{
return count->val;
}
@@ -1212,8 +1219,9 @@
count->val = val;
}
-static inline void tcp_packets_out_inc(struct sock *sk, struct tcp_opt *tp,
- struct sk_buff *skb)
+static inline void tcp_packets_out_inc(struct sock *sk,
+ struct tcp_opt *tp,
+ const struct sk_buff *skb)
{
int orig = tcp_get_pcount(&tp->packets_out);
@@ -1222,7 +1230,8 @@
tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS, tp->rto);
}
-static inline void tcp_packets_out_dec(struct tcp_opt *tp, struct sk_buff *skb)
+static inline void tcp_packets_out_dec(struct tcp_opt *tp,
+ const struct sk_buff *skb)
{
tcp_dec_pcount(&tp->packets_out, skb);
}
@@ -1241,7 +1250,7 @@
* "Packets left network, but not honestly ACKed yet" PLUS
* "Packets fast retransmitted"
*/
-static __inline__ unsigned int tcp_packets_in_flight(struct tcp_opt *tp)
+static __inline__ unsigned int tcp_packets_in_flight(const struct tcp_opt *tp)
{
return (tcp_get_pcount(&tp->packets_out) -
tcp_get_pcount(&tp->left_out) +
@@ -1408,18 +1417,19 @@
/* Slow start with delack produces 3 packets of burst, so that
* it is safe "de facto".
*/
-static __inline__ __u32 tcp_max_burst(struct tcp_opt *tp)
+static __inline__ __u32 tcp_max_burst(const struct tcp_opt *tp)
{
return 3;
}
-static __inline__ int tcp_minshall_check(struct tcp_opt *tp)
+static __inline__ int tcp_minshall_check(const struct tcp_opt *tp)
{
return after(tp->snd_sml,tp->snd_una) &&
!after(tp->snd_sml, tp->snd_nxt);
}
-static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss, struct sk_buff *skb)
+static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss,
+ const struct sk_buff *skb)
{
if (skb->len < mss)
tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
@@ -1434,7 +1444,8 @@
*/
static __inline__ int
-tcp_nagle_check(struct tcp_opt *tp, struct sk_buff *skb, unsigned mss_now, int nonagle)
+tcp_nagle_check(const struct tcp_opt *tp, const struct sk_buff *skb,
+ unsigned mss_now, int nonagle)
{
return (skb->len < mss_now &&
!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
@@ -1449,7 +1460,8 @@
/* This checks if the data bearing packet SKB (usually sk->sk_send_head)
* should be put on the wire right now.
*/
-static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
+static __inline__ int tcp_snd_test(const struct tcp_opt *tp,
+ struct sk_buff *skb,
unsigned cur_mss, int nonagle)
{
int pkts = tcp_skb_pcount(skb);
@@ -1496,7 +1508,8 @@
tcp_reset_xmit_timer(sk, TCP_TIME_PROBE0, tp->rto);
}
-static __inline__ int tcp_skb_is_last(struct sock *sk, struct sk_buff *skb)
+static __inline__ int tcp_skb_is_last(const struct sock *sk,
+ const struct sk_buff *skb)
{
return skb->next == (struct sk_buff *)&sk->sk_write_queue;
}
@@ -1547,7 +1560,7 @@
tp->snd_wl1 = seq;
}
-extern void tcp_destroy_sock(struct sock *sk);
+extern void tcp_destroy_sock(struct sock *sk);
/*
@@ -1621,7 +1634,7 @@
#undef STATE_TRACE
#ifdef STATE_TRACE
-static char *statename[]={
+static const char *statename[]={
"Unused","Established","Syn Sent","Syn Recv",
"Fin Wait 1","Fin Wait 2","Time Wait", "Close",
"Close Wait","Last ACK","Listen","Closing"
@@ -1892,17 +1905,17 @@
wake_up(&tcp_lhash_wait);
}
-static inline int keepalive_intvl_when(struct tcp_opt *tp)
+static inline int keepalive_intvl_when(const struct tcp_opt *tp)
{
return tp->keepalive_intvl ? : sysctl_tcp_keepalive_intvl;
}
-static inline int keepalive_time_when(struct tcp_opt *tp)
+static inline int keepalive_time_when(const struct tcp_opt *tp)
{
return tp->keepalive_time ? : sysctl_tcp_keepalive_time;
}
-static inline int tcp_fin_time(struct tcp_opt *tp)
+static inline int tcp_fin_time(const struct tcp_opt *tp)
{
int fin_timeout = tp->linger2 ? : sysctl_tcp_fin_timeout;
@@ -1912,7 +1925,7 @@
return fin_timeout;
}
-static inline int tcp_paws_check(struct tcp_opt *tp, int rst)
+static inline int tcp_paws_check(const struct tcp_opt *tp, int rst)
{
if ((s32)(tp->rcv_tsval - tp->ts_recent) >= 0)
return 0;
diff -Nru a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c 2004-12-24 13:36:31 -08:00
+++ b/net/ipv4/tcp.c 2004-12-24 13:36:31 -08:00
@@ -467,7 +467,7 @@
sk->sk_max_ack_backlog = 0;
sk->sk_ack_backlog = 0;
tp->accept_queue = tp->accept_queue_tail = NULL;
- tp->syn_wait_lock = RW_LOCK_UNLOCKED;
+ rwlock_init(&tp->syn_wait_lock);
tcp_delack_init(tp);
lopt = kmalloc(sizeof(struct tcp_listen_opt), GFP_KERNEL);
@@ -2095,6 +2095,65 @@
return err;
}
+/* Return information about state of tcp endpoint in API format. */
+void tcp_get_info(struct sock *sk, struct tcp_info *info)
+{
+ struct tcp_opt *tp = tcp_sk(sk);
+ u32 now = tcp_time_stamp;
+
+ memset(info, 0, sizeof(*info));
+
+ info->tcpi_state = sk->sk_state;
+ info->tcpi_ca_state = tp->ca_state;
+ info->tcpi_retransmits = tp->retransmits;
+ info->tcpi_probes = tp->probes_out;
+ info->tcpi_backoff = tp->backoff;
+
+ if (tp->tstamp_ok)
+ info->tcpi_options |= TCPI_OPT_TIMESTAMPS;
+ if (tp->sack_ok)
+ info->tcpi_options |= TCPI_OPT_SACK;
+ if (tp->wscale_ok) {
+ info->tcpi_options |= TCPI_OPT_WSCALE;
+ info->tcpi_snd_wscale = tp->snd_wscale;
+ info->tcpi_rcv_wscale = tp->rcv_wscale;
+ }
+
+ if (tp->ecn_flags&TCP_ECN_OK)
+ info->tcpi_options |= TCPI_OPT_ECN;
+
+ info->tcpi_rto = jiffies_to_usecs(tp->rto);
+ info->tcpi_ato = jiffies_to_usecs(tp->ack.ato);
+ info->tcpi_snd_mss = tp->mss_cache_std;
+ info->tcpi_rcv_mss = tp->ack.rcv_mss;
+
+ info->tcpi_unacked = tcp_get_pcount(&tp->packets_out);
+ info->tcpi_sacked = tcp_get_pcount(&tp->sacked_out);
+ info->tcpi_lost = tcp_get_pcount(&tp->lost_out);
+ info->tcpi_retrans = tcp_get_pcount(&tp->retrans_out);
+ info->tcpi_fackets = tcp_get_pcount(&tp->fackets_out);
+
+ info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
+ info->tcpi_last_data_recv = jiffies_to_msecs(now - tp->ack.lrcvtime);
+ info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
+
+ info->tcpi_pmtu = tp->pmtu_cookie;
+ info->tcpi_rcv_ssthresh = tp->rcv_ssthresh;
+ info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3;
+ info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2;
+ info->tcpi_snd_ssthresh = tp->snd_ssthresh;
+ info->tcpi_snd_cwnd = tp->snd_cwnd;
+ info->tcpi_advmss = tp->advmss;
+ info->tcpi_reordering = tp->reordering;
+
+ info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3;
+ info->tcpi_rcv_space = tp->rcvq_space.space;
+
+ info->tcpi_total_retrans = tp->total_retrans;
+}
+
+EXPORT_SYMBOL_GPL(tcp_get_info);
+
int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
int __user *optlen)
{
@@ -2250,7 +2309,7 @@
if (!tcp_ehash)
panic("Failed to allocate TCP established hash table\n");
for (i = 0; i < (tcp_ehash_size << 1); i++) {
- tcp_ehash[i].lock = RW_LOCK_UNLOCKED;
+ rwlock_init(&tcp_ehash[i].lock);
INIT_HLIST_HEAD(&tcp_ehash[i].chain);
}
@@ -2266,7 +2325,7 @@
if (!tcp_bhash)
panic("Failed to allocate TCP bind hash table\n");
for (i = 0; i < tcp_bhash_size; i++) {
- tcp_bhash[i].lock = SPIN_LOCK_UNLOCKED;
+ spin_lock_init(&tcp_bhash[i].lock);
INIT_HLIST_HEAD(&tcp_bhash[i].chain);
}
@@ -2301,13 +2360,10 @@
printk(KERN_INFO "TCP: Hash tables configured "
"(established %d bind %d)\n",
tcp_ehash_size << 1, tcp_bhash_size);
-
- tcpdiag_init();
}
EXPORT_SYMBOL(tcp_accept);
EXPORT_SYMBOL(tcp_close);
-EXPORT_SYMBOL(tcp_close_state);
EXPORT_SYMBOL(tcp_destroy_sock);
EXPORT_SYMBOL(tcp_disconnect);
EXPORT_SYMBOL(tcp_getsockopt);
diff -Nru a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
--- a/net/ipv4/tcp_diag.c 2004-12-24 13:36:17 -08:00
+++ b/net/ipv4/tcp_diag.c 2004-12-24 13:36:17 -08:00
@@ -18,6 +18,7 @@
#include <linux/random.h>
#include <linux/cache.h>
#include <linux/init.h>
+#include <linux/time.h>
#include <net/icmp.h>
#include <net/tcp.h>
@@ -29,6 +30,16 @@
#include <linux/tcp_diag.h>
+struct tcpdiag_entry
+{
+ u32 *saddr;
+ u32 *daddr;
+ u16 sport;
+ u16 dport;
+ u16 family;
+ u16 userlocks;
+};
+
static struct sock *tcpnl;
@@ -41,63 +52,8 @@
rta->rta_len = rtalen; \
RTA_DATA(rta); })
-/* Return information about state of tcp endpoint in API format. */
-void tcp_get_info(struct sock *sk, struct tcp_info *info)
-{
- struct tcp_opt *tp = tcp_sk(sk);
- u32 now = tcp_time_stamp;
-
- memset(info, 0, sizeof(*info));
-
- info->tcpi_state = sk->sk_state;
- info->tcpi_ca_state = tp->ca_state;
- info->tcpi_retransmits = tp->retransmits;
- info->tcpi_probes = tp->probes_out;
- info->tcpi_backoff = tp->backoff;
-
- if (tp->tstamp_ok)
- info->tcpi_options |= TCPI_OPT_TIMESTAMPS;
- if (tp->sack_ok)
- info->tcpi_options |= TCPI_OPT_SACK;
- if (tp->wscale_ok) {
- info->tcpi_options |= TCPI_OPT_WSCALE;
- info->tcpi_snd_wscale = tp->snd_wscale;
- info->tcpi_rcv_wscale = tp->rcv_wscale;
- }
-
- if (tp->ecn_flags&TCP_ECN_OK)
- info->tcpi_options |= TCPI_OPT_ECN;
-
- info->tcpi_rto = jiffies_to_usecs(tp->rto);
- info->tcpi_ato = jiffies_to_usecs(tp->ack.ato);
- info->tcpi_snd_mss = tp->mss_cache_std;
- info->tcpi_rcv_mss = tp->ack.rcv_mss;
-
- info->tcpi_unacked = tcp_get_pcount(&tp->packets_out);
- info->tcpi_sacked = tcp_get_pcount(&tp->sacked_out);
- info->tcpi_lost = tcp_get_pcount(&tp->lost_out);
- info->tcpi_retrans = tcp_get_pcount(&tp->retrans_out);
- info->tcpi_fackets = tcp_get_pcount(&tp->fackets_out);
-
- info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
- info->tcpi_last_data_recv = jiffies_to_msecs(now - tp->ack.lrcvtime);
- info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
-
- info->tcpi_pmtu = tp->pmtu_cookie;
- info->tcpi_rcv_ssthresh = tp->rcv_ssthresh;
- info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3;
- info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2;
- info->tcpi_snd_ssthresh = tp->snd_ssthresh;
- info->tcpi_snd_cwnd = tp->snd_cwnd;
- info->tcpi_advmss = tp->advmss;
- info->tcpi_reordering = tp->reordering;
-
- info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3;
- info->tcpi_rcv_space = tp->rcvq_space.space;
-}
-
static int tcpdiag_fill(struct sk_buff *skb, struct sock *sk,
- int ext, u32 pid, u32 seq)
+ int ext, u32 pid, u32 seq, u16 nlmsg_flags)
{
struct inet_opt *inet = inet_sk(sk);
struct tcp_opt *tp = tcp_sk(sk);
@@ -109,6 +65,7 @@
unsigned char *b = skb->tail;
nlh = NLMSG_PUT(skb, pid, seq, TCPDIAG_GETSOCK, sizeof(*r));
+ nlh->nlmsg_flags = nlmsg_flags;
r = NLMSG_DATA(nlh);
if (sk->sk_state != TCP_TIME_WAIT) {
if (ext & (1<<(TCPDIAG_MEMINFO-1)))
@@ -146,7 +103,7 @@
r->tcpdiag_wqueue = 0;
r->tcpdiag_uid = 0;
r->tcpdiag_inode = 0;
-#ifdef CONFIG_IPV6
+#ifdef CONFIG_IP_TCPDIAG_IPV6
if (r->tcpdiag_family == AF_INET6) {
ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_src,
&tw->tw_v6_rcv_saddr);
@@ -163,7 +120,7 @@
r->id.tcpdiag_src[0] = inet->rcv_saddr;
r->id.tcpdiag_dst[0] = inet->daddr;
-#ifdef CONFIG_IPV6
+#ifdef CONFIG_IP_TCPDIAG_IPV6
if (r->tcpdiag_family == AF_INET6) {
struct ipv6_pinfo *np = inet6_sk(sk);
@@ -231,11 +188,19 @@
return -1;
}
-extern struct sock *tcp_v4_lookup(u32 saddr, u16 sport, u32 daddr, u16 dport, int dif);
-#ifdef CONFIG_IPV6
+extern struct sock *tcp_v4_lookup(u32 saddr, u16 sport, u32 daddr, u16 dport,
+ int dif);
+#ifdef CONFIG_IP_TCPDIAG_IPV6
extern struct sock *tcp_v6_lookup(struct in6_addr *saddr, u16 sport,
struct in6_addr *daddr, u16 dport,
int dif);
+#else
+static inline struct sock *tcp_v6_lookup(struct in6_addr *saddr, u16 sport,
+ struct in6_addr *daddr, u16 dport,
+ int dif)
+{
+ return NULL;
+}
#endif
static int tcpdiag_get_exact(struct sk_buff *in_skb, const struct nlmsghdr *nlh)
@@ -250,7 +215,7 @@
req->id.tcpdiag_src[0], req->id.tcpdiag_sport,
req->id.tcpdiag_if);
}
-#ifdef CONFIG_IPV6
+#ifdef CONFIG_IP_TCPDIAG_IPV6
else if (req->tcpdiag_family == AF_INET6) {
sk = tcp_v6_lookup((struct in6_addr*)req->id.tcpdiag_dst, req->id.tcpdiag_dport,
(struct in6_addr*)req->id.tcpdiag_src, req->id.tcpdiag_sport,
@@ -280,7 +245,7 @@
if (tcpdiag_fill(rep, sk, req->tcpdiag_ext,
NETLINK_CB(in_skb).pid,
- nlh->nlmsg_seq) <= 0)
+ nlh->nlmsg_seq, 0) <= 0)
BUG();
err = netlink_unicast(tcpnl, rep, NETLINK_CB(in_skb).pid, MSG_DONTWAIT);
@@ -324,11 +289,11 @@
}
-static int tcpdiag_bc_run(const void *bc, int len, struct sock *sk)
+static int tcpdiag_bc_run(const void *bc, int len,
+ const struct tcpdiag_entry *entry)
{
while (len > 0) {
int yes = 1;
- struct inet_opt *inet = inet_sk(sk);
const struct tcpdiag_bc_op *op = bc;
switch (op->code) {
@@ -338,19 +303,19 @@
yes = 0;
break;
case TCPDIAG_BC_S_GE:
- yes = inet->num >= op[1].no;
+ yes = entry->sport >= op[1].no;
break;
case TCPDIAG_BC_S_LE:
- yes = inet->num <= op[1].no;
+ yes = entry->dport <= op[1].no;
break;
case TCPDIAG_BC_D_GE:
- yes = ntohs(inet->dport) >= op[1].no;
+ yes = entry->dport >= op[1].no;
break;
case TCPDIAG_BC_D_LE:
- yes = ntohs(inet->dport) <= op[1].no;
+ yes = entry->dport <= op[1].no;
break;
case TCPDIAG_BC_AUTO:
- yes = !(sk->sk_userlocks & SOCK_BINDPORT_LOCK);
+ yes = !(entry->userlocks & SOCK_BINDPORT_LOCK);
break;
case TCPDIAG_BC_S_COND:
case TCPDIAG_BC_D_COND:
@@ -360,7 +325,7 @@
if (cond->port != -1 &&
cond->port != (op->code == TCPDIAG_BC_S_COND ?
- inet->num : ntohs(inet->dport))) {
+ entry->sport : entry->dport)) {
yes = 0;
break;
}
@@ -368,26 +333,14 @@
if (cond->prefix_len == 0)
break;
-#ifdef CONFIG_IPV6
- if (sk->sk_family == AF_INET6) {
- struct ipv6_pinfo *np = inet6_sk(sk);
-
- if (op->code == TCPDIAG_BC_S_COND)
- addr = (u32*)&np->rcv_saddr;
- else
- addr = (u32*)&np->daddr;
- } else
-#endif
- {
- if (op->code == TCPDIAG_BC_S_COND)
- addr = &inet->rcv_saddr;
- else
- addr = &inet->daddr;
- }
+ if (op->code == TCPDIAG_BC_S_COND)
+ addr = entry->saddr;
+ else
+ addr = entry->daddr;
if (bitstring_match(addr, cond->addr, cond->prefix_len))
break;
- if (sk->sk_family == AF_INET6 &&
+ if (entry->family == AF_INET6 &&
cond->family == AF_INET) {
if (addr[0] == 0 && addr[1] == 0 &&
addr[2] == htonl(0xffff) &&
@@ -466,16 +419,182 @@
return len == 0 ? 0 : -EINVAL;
}
+static int tcpdiag_dump_sock(struct sk_buff *skb, struct sock *sk,
+ struct netlink_callback *cb)
+{
+ struct tcpdiagreq *r = NLMSG_DATA(cb->nlh);
+
+ if (cb->nlh->nlmsg_len > 4 + NLMSG_SPACE(sizeof(*r))) {
+ struct tcpdiag_entry entry;
+ struct rtattr *bc = (struct rtattr *)(r + 1);
+ struct inet_opt *inet = inet_sk(sk);
+
+ entry.family = sk->sk_family;
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+ if (entry.family == AF_INET6) {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+
+ entry.saddr = np->rcv_saddr.s6_addr32;
+ entry.daddr = np->daddr.s6_addr32;
+ } else
+#endif
+ {
+ entry.saddr = &inet->rcv_saddr;
+ entry.daddr = &inet->daddr;
+ }
+ entry.sport = inet->num;
+ entry.dport = ntohs(inet->dport);
+ entry.userlocks = sk->sk_userlocks;
+
+ if (!tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), &entry))
+ return 0;
+ }
+
+ return tcpdiag_fill(skb, sk, r->tcpdiag_ext, NETLINK_CB(cb->skb).pid,
+ cb->nlh->nlmsg_seq, NLM_F_MULTI);
+}
+
+static int tcpdiag_fill_req(struct sk_buff *skb, struct sock *sk,
+ struct open_request *req,
+ u32 pid, u32 seq)
+{
+ struct inet_opt *inet = inet_sk(sk);
+ unsigned char *b = skb->tail;
+ struct tcpdiagmsg *r;
+ struct nlmsghdr *nlh;
+ long tmo;
+
+ nlh = NLMSG_PUT(skb, pid, seq, TCPDIAG_GETSOCK, sizeof(*r));
+ nlh->nlmsg_flags = NLM_F_MULTI;
+ r = NLMSG_DATA(nlh);
+
+ r->tcpdiag_family = sk->sk_family;
+ r->tcpdiag_state = TCP_SYN_RECV;
+ r->tcpdiag_timer = 1;
+ r->tcpdiag_retrans = req->retrans;
+
+ r->id.tcpdiag_if = sk->sk_bound_dev_if;
+ r->id.tcpdiag_cookie[0] = (u32)(unsigned long)req;
+ r->id.tcpdiag_cookie[1] = (u32)(((unsigned long)req >> 31) >> 1);
+
+ tmo = req->expires - jiffies;
+ if (tmo < 0)
+ tmo = 0;
+
+ r->id.tcpdiag_sport = inet->sport;
+ r->id.tcpdiag_dport = req->rmt_port;
+ r->id.tcpdiag_src[0] = req->af.v4_req.loc_addr;
+ r->id.tcpdiag_dst[0] = req->af.v4_req.rmt_addr;
+ r->tcpdiag_expires = jiffies_to_msecs(tmo),
+ r->tcpdiag_rqueue = 0;
+ r->tcpdiag_wqueue = 0;
+ r->tcpdiag_uid = sock_i_uid(sk);
+ r->tcpdiag_inode = 0;
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+ if (r->tcpdiag_family == AF_INET6) {
+ ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_src,
+ &req->af.v6_req.loc_addr);
+ ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_dst,
+ &req->af.v6_req.rmt_addr);
+ }
+#endif
+ nlh->nlmsg_len = skb->tail - b;
+
+ return skb->len;
+
+nlmsg_failure:
+ skb_trim(skb, b - skb->data);
+ return -1;
+}
+
+static int tcpdiag_dump_reqs(struct sk_buff *skb, struct sock *sk,
+ struct netlink_callback *cb)
+{
+ struct tcpdiag_entry entry;
+ struct tcpdiagreq *r = NLMSG_DATA(cb->nlh);
+ struct tcp_opt *tp = tcp_sk(sk);
+ struct tcp_listen_opt *lopt;
+ struct rtattr *bc = NULL;
+ struct inet_opt *inet = inet_sk(sk);
+ int j, s_j;
+ int reqnum, s_reqnum;
+ int err = 0;
+
+ s_j = cb->args[3];
+ s_reqnum = cb->args[4];
+
+ if (s_j > 0)
+ s_j--;
+
+ entry.family = sk->sk_family;
+
+ read_lock_bh(&tp->syn_wait_lock);
+
+ lopt = tp->listen_opt;
+ if (!lopt || !lopt->qlen)
+ goto out;
+
+ if (cb->nlh->nlmsg_len > 4 + NLMSG_SPACE(sizeof(*r))) {
+ bc = (struct rtattr *)(r + 1);
+ entry.sport = inet->num;
+ entry.userlocks = sk->sk_userlocks;
+ }
+
+ for (j = s_j; j < TCP_SYNQ_HSIZE; j++) {
+ struct open_request *req, *head = lopt->syn_table[j];
+
+ reqnum = 0;
+ for (req = head; req; reqnum++, req = req->dl_next) {
+ if (reqnum < s_reqnum)
+ continue;
+ if (r->id.tcpdiag_dport != req->rmt_port &&
+ r->id.tcpdiag_dport)
+ continue;
+
+ if (bc) {
+ entry.saddr =
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+ (entry.family == AF_INET6) ?
+ req->af.v6_req.loc_addr.s6_addr32 :
+#endif
+ &req->af.v4_req.loc_addr;
+ entry.daddr =
+#ifdef CONFIG_IP_TCPDIAG_IPV6
+ (entry.family == AF_INET6) ?
+ req->af.v6_req.rmt_addr.s6_addr32 :
+#endif
+ &req->af.v4_req.rmt_addr;
+ entry.dport = ntohs(req->rmt_port);
+
+ if (!tcpdiag_bc_run(RTA_DATA(bc),
+ RTA_PAYLOAD(bc), &entry))
+ continue;
+ }
+
+ err = tcpdiag_fill_req(skb, sk, req,
+ NETLINK_CB(cb->skb).pid,
+ cb->nlh->nlmsg_seq);
+ if (err < 0) {
+ cb->args[3] = j + 1;
+ cb->args[4] = reqnum;
+ goto out;
+ }
+ }
+
+ s_reqnum = 0;
+ }
+
+out:
+ read_unlock_bh(&tp->syn_wait_lock);
+
+ return err;
+}
static int tcpdiag_dump(struct sk_buff *skb, struct netlink_callback *cb)
{
int i, num;
int s_i, s_num;
struct tcpdiagreq *r = NLMSG_DATA(cb->nlh);
- struct rtattr *bc = NULL;
-
- if (cb->nlh->nlmsg_len > 4+NLMSG_SPACE(sizeof(struct tcpdiagreq)))
- bc = (struct rtattr*)(r+1);
s_i = cb->args[1];
s_num = num = cb->args[2];
@@ -488,31 +607,47 @@
struct sock *sk;
struct hlist_node *node;
- if (i > s_i)
- s_num = 0;
-
num = 0;
sk_for_each(sk, node, &tcp_listening_hash[i]) {
struct inet_opt *inet = inet_sk(sk);
- if (num < s_num)
- goto next_listen;
- if (!(r->tcpdiag_states&TCPF_LISTEN) ||
- r->id.tcpdiag_dport)
- goto next_listen;
+
+ if (num < s_num) {
+ num++;
+ continue;
+ }
+
if (r->id.tcpdiag_sport != inet->sport &&
r->id.tcpdiag_sport)
goto next_listen;
- if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk))
+
+ if (!(r->tcpdiag_states&TCPF_LISTEN) ||
+ r->id.tcpdiag_dport ||
+ cb->args[3] > 0)
+ goto syn_recv;
+
+ if (tcpdiag_dump_sock(skb, sk, cb) < 0) {
+ tcp_listen_unlock();
+ goto done;
+ }
+
+syn_recv:
+ if (!(r->tcpdiag_states&TCPF_SYN_RECV))
goto next_listen;
- if (tcpdiag_fill(skb, sk, r->tcpdiag_ext,
- NETLINK_CB(cb->skb).pid,
- cb->nlh->nlmsg_seq) <= 0) {
+
+ if (tcpdiag_dump_reqs(skb, sk, cb) < 0) {
tcp_listen_unlock();
goto done;
}
+
next_listen:
+ cb->args[3] = 0;
+ cb->args[4] = 0;
++num;
}
+
+ s_num = 0;
+ cb->args[3] = 0;
+ cb->args[4] = 0;
}
tcp_listen_unlock();
skip_listen_ht:
@@ -546,11 +681,7 @@
goto next_normal;
if (r->id.tcpdiag_dport != inet->dport && r->id.tcpdiag_dport)
goto next_normal;
- if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk))
- goto next_normal;
- if (tcpdiag_fill(skb, sk, r->tcpdiag_ext,
- NETLINK_CB(cb->skb).pid,
- cb->nlh->nlmsg_seq) <= 0) {
+ if (tcpdiag_dump_sock(skb, sk, cb) < 0) {
read_unlock_bh(&head->lock);
goto done;
}
@@ -571,11 +702,7 @@
if (r->id.tcpdiag_dport != inet->dport &&
r->id.tcpdiag_dport)
goto next_dying;
- if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk))
- goto next_dying;
- if (tcpdiag_fill(skb, sk, r->tcpdiag_ext,
- NETLINK_CB(cb->skb).pid,
- cb->nlh->nlmsg_seq) <= 0) {
+ if (tcpdiag_dump_sock(skb, sk, cb) < 0) {
read_unlock_bh(&head->lock);
goto done;
}
@@ -657,9 +784,19 @@
}
}
-void __init tcpdiag_init(void)
+static int __init tcpdiag_init(void)
{
tcpnl = netlink_kernel_create(NETLINK_TCPDIAG, tcpdiag_rcv);
if (tcpnl == NULL)
- panic("tcpdiag_init: Cannot create netlink socket.");
+ return -ENOMEM;
+ return 0;
}
+
+static void __exit tcpdiag_exit(void)
+{
+ sock_release(tcpnl->sk_socket);
+}
+
+module_init(tcpdiag_init);
+module_exit(tcpdiag_exit);
+MODULE_LICENSE("GPL");
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c 2004-12-24 13:37:04 -08:00
+++ b/net/ipv4/tcp_input.c 2004-12-24 13:37:04 -08:00
@@ -2369,25 +2369,19 @@
{
struct tcp_opt *tp = tcp_sk(sk);
struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
- __u32 mss = tcp_skb_mss(skb);
- __u32 snd_una = tp->snd_una;
- __u32 orig_seq, seq;
- __u32 packets_acked = 0;
+ __u32 seq = tp->snd_una;
+ __u32 packets_acked;
int acked = 0;
/* If we get here, the whole TSO packet has not been
* acked.
*/
- BUG_ON(!after(scb->end_seq, snd_una));
+ BUG_ON(!after(scb->end_seq, seq));
- seq = orig_seq = scb->seq;
- while (!after(seq + mss, snd_una)) {
- packets_acked++;
- seq += mss;
- }
-
- if (tcp_trim_head(sk, skb, (seq - orig_seq)))
+ packets_acked = tcp_skb_pcount(skb);
+ if (tcp_trim_head(sk, skb, seq - scb->seq))
return 0;
+ packets_acked -= tcp_skb_pcount(skb);
if (packets_acked) {
__u8 sacked = scb->sacked;
@@ -3034,8 +3028,8 @@
tp->snd_wscale = *(__u8 *)ptr;
if(tp->snd_wscale > 14) {
if(net_ratelimit())
- printk("tcp_parse_options: Illegal window "
- "scaling value %d >14 received.",
+ printk(KERN_INFO "tcp_parse_options: Illegal window "
+ "scaling value %d >14 received.\n",
tp->snd_wscale);
tp->snd_wscale = 14;
}
@@ -4963,7 +4957,6 @@
EXPORT_SYMBOL(sysctl_tcp_ecn);
EXPORT_SYMBOL(sysctl_tcp_reordering);
-EXPORT_SYMBOL(tcp_cwnd_application_limited);
EXPORT_SYMBOL(tcp_parse_options);
EXPORT_SYMBOL(tcp_rcv_established);
EXPORT_SYMBOL(tcp_rcv_state_process);
diff -Nru a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c 2004-12-24 13:36:34 -08:00
+++ b/net/ipv4/tcp_ipv4.c 2004-12-24 13:36:34 -08:00
@@ -448,8 +448,8 @@
}
/* Optimize the common listener case. */
-inline struct sock *tcp_v4_lookup_listener(u32 daddr, unsigned short hnum,
- int dif)
+static inline struct sock *tcp_v4_lookup_listener(u32 daddr,
+ unsigned short hnum, int dif)
{
struct sock *sk = NULL;
struct hlist_head *head;
@@ -535,6 +535,8 @@
return sk;
}
+EXPORT_SYMBOL_GPL(tcp_v4_lookup);
+
static inline __u32 tcp_v4_init_sequence(struct sock *sk, struct sk_buff *skb)
{
return secure_tcp_sequence_number(skb->nh.iph->daddr,
@@ -2596,6 +2598,7 @@
struct proto tcp_prot = {
.name = "TCP",
+ .owner = THIS_MODULE,
.close = tcp_close,
.connect = tcp_v4_connect,
.disconnect = tcp_disconnect,
@@ -2653,7 +2656,6 @@
EXPORT_SYMBOL(tcp_v4_conn_request);
EXPORT_SYMBOL(tcp_v4_connect);
EXPORT_SYMBOL(tcp_v4_do_rcv);
-EXPORT_SYMBOL(tcp_v4_lookup_listener);
EXPORT_SYMBOL(tcp_v4_rebuild_header);
EXPORT_SYMBOL(tcp_v4_remember_stamp);
EXPORT_SYMBOL(tcp_v4_send_check);
@@ -2663,8 +2665,7 @@
EXPORT_SYMBOL(tcp_proc_register);
EXPORT_SYMBOL(tcp_proc_unregister);
#endif
-#ifdef CONFIG_SYSCTL
EXPORT_SYMBOL(sysctl_local_port_range);
EXPORT_SYMBOL(sysctl_max_syn_backlog);
EXPORT_SYMBOL(sysctl_tcp_low_latency);
-#endif
+
diff -Nru a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
--- a/net/ipv4/tcp_minisocks.c 2004-12-24 13:37:04 -08:00
+++ b/net/ipv4/tcp_minisocks.c 2004-12-24 13:37:04 -08:00
@@ -706,7 +706,7 @@
sock_lock_init(newsk);
bh_lock_sock(newsk);
- newsk->sk_dst_lock = RW_LOCK_UNLOCKED;
+ rwlock_init(&newsk->sk_dst_lock);
atomic_set(&newsk->sk_rmem_alloc, 0);
skb_queue_head_init(&newsk->sk_receive_queue);
atomic_set(&newsk->sk_wmem_alloc, 0);
@@ -719,7 +719,7 @@
newsk->sk_userlocks = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
newsk->sk_backlog.head = newsk->sk_backlog.tail = NULL;
newsk->sk_send_head = NULL;
- newsk->sk_callback_lock = RW_LOCK_UNLOCKED;
+ rwlock_init(&newsk->sk_callback_lock);
skb_queue_head_init(&newsk->sk_error_queue);
newsk->sk_write_space = sk_stream_write_space;
@@ -1075,7 +1075,3 @@
EXPORT_SYMBOL(tcp_create_openreq_child);
EXPORT_SYMBOL(tcp_timewait_state_process);
EXPORT_SYMBOL(tcp_tw_deschedule);
-
-#ifdef CONFIG_SYSCTL
-EXPORT_SYMBOL(sysctl_tcp_tw_recycle);
-#endif
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c 2004-12-24 13:37:01 -08:00
+++ b/net/ipv4/tcp_output.c 2004-12-24 13:37:01 -08:00
@@ -455,9 +455,13 @@
{
struct tcp_opt *tp = tcp_sk(sk);
struct sk_buff *buff;
- int nsize = skb->len - len;
+ int nsize;
u16 flags;
+ nsize = skb_headlen(skb) - len;
+ if (nsize < 0)
+ nsize = 0;
+
if (skb_cloned(skb) &&
skb_is_nonlinear(skb) &&
pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
@@ -562,8 +566,6 @@
int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
{
- struct tcp_opt *tp = tcp_sk(sk);
-
if (skb_cloned(skb) &&
pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
return -ENOMEM;
@@ -586,7 +588,8 @@
/* Any change of skb->len requires recalculation of tso
* factor and mss.
*/
- tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
+ if (tcp_skb_pcount(skb) > 1)
+ tcp_set_skb_tso_segs(skb, tcp_skb_mss(skb));
return 0;
}
@@ -1102,6 +1105,8 @@
/* Update global TCP statistics. */
TCP_INC_STATS(TCP_MIB_RETRANSSEGS);
+ tp->total_retrans++;
+
#if FASTRETRANS_DEBUG > 0
if (TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_RETRANS) {
if (net_ratelimit())
@@ -1715,12 +1720,7 @@
}
}
-EXPORT_SYMBOL(tcp_acceptable_seq);
EXPORT_SYMBOL(tcp_connect);
-EXPORT_SYMBOL(tcp_connect_init);
EXPORT_SYMBOL(tcp_make_synack);
-EXPORT_SYMBOL(tcp_send_synack);
EXPORT_SYMBOL(tcp_simple_retransmit);
EXPORT_SYMBOL(tcp_sync_mss);
-EXPORT_SYMBOL(tcp_write_wakeup);
-EXPORT_SYMBOL(tcp_write_xmit);
diff -Nru a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
--- a/net/ipv4/tcp_timer.c 2004-12-24 13:37:19 -08:00
+++ b/net/ipv4/tcp_timer.c 2004-12-24 13:37:19 -08:00
@@ -36,7 +36,9 @@
static void tcp_delack_timer(unsigned long);
static void tcp_keepalive_timer (unsigned long data);
-const char timer_bug_msg[] = KERN_DEBUG "tcpbug: unknown timer value\n";
+#ifdef TCP_DEBUG
+const char tcp_timer_bug_msg[] = KERN_DEBUG "tcpbug: unknown timer value\n";
+#endif
/*
* Using different timers for retransmit, delayed acks and probes
@@ -651,3 +653,6 @@
EXPORT_SYMBOL(tcp_delete_keepalive_timer);
EXPORT_SYMBOL(tcp_init_xmit_timers);
EXPORT_SYMBOL(tcp_reset_keepalive_timer);
+#ifdef TCP_DEBUG
+EXPORT_SYMBOL(tcp_timer_bug_msg);
+#endif
diff -Nru a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c 2004-12-24 13:36:56 -08:00
+++ b/net/ipv6/tcp_ipv6.c 2004-12-24 13:36:56 -08:00
@@ -262,7 +262,7 @@
score = 1;
if (!ipv6_addr_any(&np->rcv_saddr)) {
- if (ipv6_addr_cmp(&np->rcv_saddr, daddr))
+ if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
continue;
score++;
}
@@ -321,8 +321,8 @@
if(*((__u32 *)&(tw->tw_dport)) == ports &&
sk->sk_family == PF_INET6) {
- if(!ipv6_addr_cmp(&tw->tw_v6_daddr, saddr) &&
- !ipv6_addr_cmp(&tw->tw_v6_rcv_saddr, daddr) &&
+ if(ipv6_addr_equal(&tw->tw_v6_daddr, saddr) &&
+ ipv6_addr_equal(&tw->tw_v6_rcv_saddr, daddr) &&
(!sk->sk_bound_dev_if || sk->sk_bound_dev_if == dif))
goto hit;
}
@@ -364,6 +364,8 @@
return sk;
}
+EXPORT_SYMBOL_GPL(tcp_v6_lookup);
+
/*
* Open request hash tables.
@@ -404,8 +406,8 @@
prev = &req->dl_next) {
if (req->rmt_port == rport &&
req->class->family == AF_INET6 &&
- !ipv6_addr_cmp(&req->af.v6_req.rmt_addr, raddr) &&
- !ipv6_addr_cmp(&req->af.v6_req.loc_addr, laddr) &&
+ ipv6_addr_equal(&req->af.v6_req.rmt_addr, raddr) &&
+ ipv6_addr_equal(&req->af.v6_req.loc_addr, laddr) &&
(!req->af.v6_req.iif || req->af.v6_req.iif == iif)) {
BUG_TRAP(req->sk == NULL);
*prevp = prev;
@@ -461,8 +463,8 @@
if(*((__u32 *)&(tw->tw_dport)) == ports &&
sk2->sk_family == PF_INET6 &&
- !ipv6_addr_cmp(&tw->tw_v6_daddr, saddr) &&
- !ipv6_addr_cmp(&tw->tw_v6_rcv_saddr, daddr) &&
+ ipv6_addr_equal(&tw->tw_v6_daddr, saddr) &&
+ ipv6_addr_equal(&tw->tw_v6_rcv_saddr, daddr) &&
sk2->sk_bound_dev_if == sk->sk_bound_dev_if) {
struct tcp_opt *tp = tcp_sk(sk);
@@ -608,7 +610,7 @@
}
if (tp->ts_recent_stamp &&
- ipv6_addr_cmp(&np->daddr, &usin->sin6_addr)) {
+ !ipv6_addr_equal(&np->daddr, &usin->sin6_addr)) {
tp->ts_recent = 0;
tp->ts_recent_stamp = 0;
tp->write_seq = 0;
@@ -1802,6 +1804,7 @@
struct ipv6_pinfo *np = inet6_sk(sk);
struct flowi fl;
struct dst_entry *dst;
+ struct in6_addr *final_p = NULL, final;
memset(&fl, 0, sizeof(fl));
fl.proto = IPPROTO_TCP;
@@ -1815,7 +1818,9 @@
if (np->opt && np->opt->srcrt) {
struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
+ ipv6_addr_copy(&final, &fl.fl6_dst);
ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
+ final_p = &final;
}
dst = __sk_dst_check(sk, np->dst_cookie);
@@ -1828,6 +1833,9 @@
return err;
}
+ if (final_p)
+ ipv6_addr_copy(&fl.fl6_dst, final_p);
+
if ((err = xfrm_lookup(&dst, &fl, sk, 0)) < 0) {
sk->sk_route_caps = 0;
dst_release(dst);
@@ -2124,6 +2132,7 @@
struct proto tcpv6_prot = {
.name = "TCPv6",
+ .owner = THIS_MODULE,
.close = tcp_close,
.connect = tcp_v6_connect,
.disconnect = tcp_disconnect,
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-16 20:00 Hubert Tonneau
0 siblings, 0 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-16 20:00 UTC (permalink / raw)
To: David S. Miller, Alexey Kuznetsov
Cc: shemminger, romieu, kuznet, niv, rick.jones2, netdev
David S. Miller wrote:
>
> Hubert, do you have netfilter enabled in the 2.6.10 kernel you are running?
>
> I'm asking because the TCP changes in 2.6.10 are pretty benign
> (attached for the curious who want to review along), whereas
> netfilter had major updates particularly in the TCP connection
> tracking code.
There is no netfilter on this server.
> I also reviewed 2.6.10-ac11 for anything interesting wrt. TCP and the
> only thing in there is the tcp_retrans_try_collapse() missing check
> to avoid collapsing TSO segments.
I'm using 2.6.10-ac11 for security reasons. I could use 2.6.10-as1 as well.
As far as I know, they all behave exactly the same from the TCP point of view.
The difference is definetly between stock 2.6.9 and stock 2.6.10
If it helps, you can send me a patch reverting TCP changes between 2.6.10
and 2.6.9, and I'll give it a spin, just to be sure that the problem is
truely related to TCP code, not other changes side effects.
Anyway, here is the set of settings I'm using to build the kernel, and no
module is loaded while the test is running:
CONFIG_2GB: y
CONFIG_ACPI: y
CONFIG_ACPI_AC: m
CONFIG_ACPI_BATTERY: m
CONFIG_ACPI_BUTTON: m
CONFIG_ACPI_FAN: m
CONFIG_ACPI_PROCESSOR: y
CONFIG_ACPI_SLEEP: y
CONFIG_ACPI_THERMAL: y
CONFIG_ACPI_VIDEO: m
CONFIG_APM_RTC_IS_GMT: y
CONFIG_ATALK: m
CONFIG_AUTODETECT_RAID: y
CONFIG_AUTOFS_FS: m
CONFIG_BINFMT_ELF: y
CONFIG_BINFMT_MISC: y
CONFIG_BLK_DEV_CMD640: y
CONFIG_BLK_DEV_FD: m
CONFIG_BLK_DEV_GENERIC: y
CONFIG_BLK_DEV_IDE: y
CONFIG_BLK_DEV_IDECD: m
CONFIG_BLK_DEV_IDEDISK: y
CONFIG_BLK_DEV_IDEDMA: y
CONFIG_BLK_DEV_IDEDMA_PCI: y
CONFIG_BLK_DEV_IDEPCI: y
CONFIG_BLK_DEV_IDESCSI: m
CONFIG_BLK_DEV_LOOP: m
CONFIG_BLK_DEV_MD: y
CONFIG_BLK_DEV_NBD: m
CONFIG_BLK_DEV_PIIX: y
CONFIG_BLK_DEV_RAM: m
CONFIG_BLK_DEV_RZ1000: y
CONFIG_BLK_DEV_SD: y
CONFIG_BLK_DEV_SR: m
CONFIG_BLK_DEV_TRIRON: y
CONFIG_BSD_PROCESS_ACCT: y
CONFIG_CHR_DEV_SG: m
CONFIG_CHR_DEV_ST: m
CONFIG_CODA_FS: m
CONFIG_E1000: y
CONFIG_EXPERIMENTAL: y
CONFIG_EXT2_FS: y
CONFIG_EXT3_FS: y
CONFIG_EXT3_FS_XATTR: y
CONFIG_FAT_FS: m
CONFIG_FILTER: y
CONFIG_FUSION: y
CONFIG_FUSION_CTL: m
CONFIG_FUSION_ISENSE: m
CONFIG_FUSION_LAN: m
CONFIG_HFSPLUS_FS: m
CONFIG_HFS_FS: m
CONFIG_HIGHMEM: y
CONFIG_HIGHMEM4G: y
CONFIG_HPET_TIMER: y
CONFIG_HPFS_FS: m
CONFIG_IDE: y
CONFIG_IDEDMA_AUTO: y
CONFIG_IDEDMA_ONLYDISK: y
CONFIG_IDEDMA_PCI_AUTO: y
CONFIG_IDEPCI_SHARE_IRQ: y
CONFIG_IDE_GENERIC: y
CONFIG_INET: y
CONFIG_INPUT: y
CONFIG_INPUT_KEYBDEV: m
CONFIG_INPUT_KEYBOARD: y
CONFIG_INPUT_MOUSE: y
CONFIG_INPUT_MOUSEDEV: m
CONFIG_IP_ALIAS: y
CONFIG_IP_ROUTE_VERBOSE: y
CONFIG_IRQBALANCE: y
CONFIG_ISO9660_FS: m
CONFIG_KCORE_ELF: y
CONFIG_KEYBOARD_ATKBD: y
CONFIG_LEGACY_PTYS: y
CONFIG_LOCKD: m
CONFIG_M386: n
CONFIG_M486: n
CONFIG_M586: n
CONFIG_M686: n
CONFIG_MAC_PARTITION: y
CONFIG_MD: y
CONFIG_MD_BOOT: y
CONFIG_MD_LINEAR: y
CONFIG_MD_LVM: n
CONFIG_MD_MIRRORING: y
CONFIG_MD_RAID0: y
CONFIG_MD_RAID1: y
CONFIG_MD_RAID5: y
CONFIG_MD_STRIPED: y
CONFIG_MD_TRANSLUCENT: n
CONFIG_MODULES: y
CONFIG_MODULE_UNLOAD: y
CONFIG_MOUSE: m
CONFIG_MOUSE_PS2: y
CONFIG_MPENTIUM4: y
CONFIG_MSDOS_FS: m
CONFIG_MTRR: y
CONFIG_NET: y
CONFIG_NETDEVICES: y
CONFIG_NET_ETHERNET: y
CONFIG_NFSD: m
CONFIG_NFS_FS: m
CONFIG_NLS: y
CONFIG_NLS_CODEPAGE_437: m
CONFIG_NLS_CODEPAGE_850: m
CONFIG_NLS_ISO8859_1: m
CONFIG_NLS_UTF8: m
CONFIG_NTFS_FS: m
CONFIG_OOM_KILLER: y
CONFIG_PACKET: y
CONFIG_PARPORT: m
CONFIG_PARPORT_PC: m
CONFIG_PCI: y
CONFIG_PCI_BIOS: y
CONFIG_PCI_GOANY: y
CONFIG_PCI_OLD_PROC: y
CONFIG_PCI_QUIRKS: y
CONFIG_PIIX_TUNING: y
CONFIG_PM: y
CONFIG_PPP: m
CONFIG_PPPOE: m
CONFIG_PPP_ASYNC: m
CONFIG_PPP_BSDCOMP: m
CONFIG_PPP_DEFLATE: m
CONFIG_PPP_FILTER: y
CONFIG_PPP_SYNC_TTY: m
CONFIG_PREEMPT: y
CONFIG_PRINTER: m
CONFIG_PRINTER_READBACK: y
CONFIG_PROC_FS: y
CONFIG_PSMOUSE: y
CONFIG_QNX4FS_FS: m
CONFIG_REGPARM: y
CONFIG_RTC: y
CONFIG_SCSI: y
CONFIG_SCSI_PROC_FS: y
CONFIG_SERIAL: m
CONFIG_SERIAL_8250: m
CONFIG_SHAPER: m
CONFIG_SLIP: m
CONFIG_SMB_FS: m
CONFIG_SMP: y
CONFIG_SOUND: m
CONFIG_SUNRPC: m
CONFIG_SYSCTL: y
CONFIG_SYSVIPC: y
CONFIG_UFS_FS: m
CONFIG_UMSDOS_FS: m
CONFIG_UNIX: y
CONFIG_USB: m
CONFIG_USB_ACM: m
CONFIG_USB_AUDIO: m
CONFIG_USB_CDCETHER: m
CONFIG_USB_DEVICEFS: y
CONFIG_USB_EHCI_HCD: m
CONFIG_USB_HID: m
CONFIG_USB_HIDINPUT: y
CONFIG_USB_KBD: m
CONFIG_USB_MOUSE: m
CONFIG_USB_OHCI: m
CONFIG_USB_OHCI_HCD: m
CONFIG_USB_PRINTER: m
CONFIG_USB_SERIAL: m
CONFIG_USB_STORAGE: m
CONFIG_USB_UHCI: m
CONFIG_USB_UHCI_ALT: m
CONFIG_USB_UHCI_HCD: m
CONFIG_VFAT_FS: m
CONFIG_VGA_CONSOLE: y
CONFIG_VT: y
CONFIG_VT_CONSOLE: y
CONFIG_X86_MCE: y
CONFIG_X86_UP_APIC: y
CONFIG_X86_UP_IOAPIC: y
Since we are at it, here are the hardware components of the box:
8086 Intel Corporation 254C E7501 0 Host Controller
8086 Intel Corporation 2543 E7500/E7501 0 HI_B Virtual PCI-to-PCI Bridge
8086 Intel Corporation 2545 E7500/E7501 0 HI_C Virtual PCI-to-PCI Bridge
8086 Intel Corporation 2547 E7500/E7501 0 HI_D Virtual PCI-to-PCI Bridge
8086 Intel Corporation 2482 82801CA/CAM 10 USB Controller
8086 Intel Corporation 244E 82801BA/CA/DB, 6300ESB 0 Hub Interface to PCI Bridge
8086 Intel Corporation 2480 82801CA 0 LPC Interface Bridge
8086 Intel Corporation 248B 82801CA 0 UltraATA/100 IDE Controller
8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller
8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller
8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller
8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller
8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller
8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller
8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge
8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge
8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge
8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge
8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge
8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge
8086 Intel Corporation 1026 82545GM 18 Gigabit Ethernet Controller
8086 Intel Corporation 100D 82544GC 1C Gigabit Ethernet Controller (LOM)
8086 Intel Corporation 0309 80303 0 I/O Processor PCI-to-PCI Bridge Unit
1000 LSI Logic 0030 LSI53C1020/1030 78 PCI-X to Ultra320 SCSI Controller
1000 LSI Logic 0030 LSI53C1020/1030 79 PCI-X to Ultra320 SCSI Controller
1002 ATI Technologies 4752 Rage XL PCI 0
And the interrupts (while running 2.6.9):
CPU0 CPU1
0: 159132374 132686719 IO-APIC-edge timer
1: 9 0 IO-APIC-edge i8042
8: 0 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
14: 1 0 IO-APIC-edge ide0
24: 22225220 0 IO-APIC-level eth0
28: 4 134406507 IO-APIC-level eth1
120: 532730 578109 IO-APIC-level ioc0
121: 1931739 1327672 IO-APIC-level ioc1
NMI: 0 0
LOC: 291863458 291863528
ERR: 0
MIS: 0
/proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eth0:2512143307 20914618 0 0 0 0 0 0 1951489031 52933097 0 0 0 0 0 0
eth1:943883086 75451745 0 0 0 0 0 0 201914508 171409895 0 0 0 0 0 0
lo:2247204588 748445 0 0 0 0 0 0 2247204588 748445 0 0 0 0 0 0
/proc/net/route
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 207C29D5 00000000 0001 0 0 0 F0FFFFFF 0 0 0
eth1 00606B0A 00000000 0001 0 0 0 00FFFFFF 0 0 0
eth0 00000000 217C29D5 0003 0 0 0 00000000 0 0 0
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-20 23:06 Hubert Tonneau
0 siblings, 0 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-20 23:06 UTC (permalink / raw)
To: David S. Miller, Alexey Kuznetsov, Nivedita Singhvi
Cc: Stephen Hemminger, romieu, kuznet, niv, rick.jones2, netdev
I've noticed something very interesting:
if trying to send to a gigabit connected Mac OSX instead of 100 Mbps connected,
then there is no drastic slowdown when switching Linux 2.6.9 to 2.6.10
> Any chance you could
> send me just the following from your boxes:
> (Before and after the transfer)
>
> - /proc/net/snmp
> - /proc/net/netstat
Here are the requested extra informations:
2.6.10-ac10 before:
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 47336 0 0 0 0 0 47197 127721 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 2 0 0 0 0 0 0 0 2 0 0 0 0 417 0 417 0 0 0 0 0 0 0 0 0 0
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 40 209 0 2 7 46158 126953 156 0 243
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 332 417 0 336
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLoss TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnSyn TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory T
CPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures
TcpExt: 0 0 0 0 0 0 0 0 0 0 94 0 0 0 0 0 452 0 0 0 0 9499 215 241030 0 7583 377 16696 3330 123 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 0 0 7 0 0 0 0 0 0 0 0 0 90 0 0 2 0 0 0
2.6.10-ac10 after sending to the 100 Mbps connected Mac OSX:
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 70100 0 0 0 0 0 69901 214176 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 2 0 0 0 0 0 0 0 2 0 0 0 0 421 0 421 0 0 0 0 0 0 0 0 0 0
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 49 263 0 2 9 68728 213354 284 0 315
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 382 421 0 386
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLoss TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnSyn TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory T
CPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures
TcpExt: 0 0 0 0 0 0 0 0 0 0 105 0 0 0 0 0 804 0 0 0 0 12808 215 310763 0 11460 472 26236 5086 247 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 247 0 0 11 0 0 0 0 0 0 0 0 0 123 0 0 2 0 0 0
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2005-02-20 23:06 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-10 21:53 2.6.10 TCP troubles -- suggested patch Hubert Tonneau
2005-02-10 22:36 ` Rick Jones
2005-02-11 1:16 ` David S. Miller
-- strict thread matches above, loose matches on Subject: below --
2005-02-20 23:06 Hubert Tonneau
2005-02-16 20:00 Hubert Tonneau
2005-02-13 10:52 Hubert Tonneau
2005-02-14 14:12 ` Alexey Kuznetsov
2005-02-11 21:55 Hubert Tonneau
2005-02-11 22:54 ` Rick Jones
2005-02-11 23:09 ` Nivedita Singhvi
2005-02-11 23:40 ` Rick Jones
2005-02-12 1:08 ` David S. Miller
2005-02-12 1:09 ` David S. Miller
2005-02-12 14:31 ` Alexey Kuznetsov
2005-02-12 19:28 ` David S. Miller
2005-02-12 19:44 ` Leonid Grossman
2005-02-12 19:52 ` Alexey Kuznetsov
2005-02-15 23:25 ` David S. Miller
2005-02-12 20:19 ` rick jones
2005-02-12 20:28 ` David S. Miller
2005-02-12 20:56 ` Alexey Kuznetsov
2005-02-12 21:27 ` Nivedita Singhvi
2005-02-12 21:43 ` rick jones
2005-02-12 22:00 ` Alexey Kuznetsov
2005-02-13 1:29 ` rick jones
2005-02-11 23:04 ` Stephen Hemminger
2005-02-12 1:07 ` David S. Miller
2005-02-12 12:11 ` Andi Kleen
2005-02-12 19:23 ` David S. Miller
2005-02-12 21:30 ` Andi Kleen
2005-02-12 14:16 ` Alexey Kuznetsov
2005-02-12 19:41 ` David S. Miller
2005-02-12 20:03 ` Alexey Kuznetsov
2005-02-15 23:26 ` David S. Miller
2005-02-15 23:42 ` Rick Jones
2005-02-15 23:23 ` David S. Miller
2005-02-16 9:13 ` Alexey Kuznetsov
2005-02-16 17:50 ` David S. Miller
[not found] <050QTJA12@server5.heliogroup.fr>
2005-02-09 18:59 ` Stephen Hemminger
2005-02-09 20:25 ` David S. Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).