* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-11 21:55 Hubert Tonneau
2005-02-11 22:54 ` Rick Jones
2005-02-11 23:04 ` Stephen Hemminger
0 siblings, 2 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-11 21:55 UTC (permalink / raw)
To: David S. Miller
Cc: shemminger, romieu, kuznet, Nivedita Singhvi, Rick Jones, netdev
Sorry, it still does not work, unless I made a mistake:
Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX
Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same.
You can pick the new tcpdump report, created through:
tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2
at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz
Here is the connection summary:
Dell PowerEdge 2600 (dual Xeon with hyper threading) running libsmbclient
on Linux 2.6.x, IP for eth1 (Intel pro 1000) is 10.107.96.7 (full
duplex, flow control is enabled)
|
|
gigabit switch
|
|
100 Mbps switch
|
|
Mac running Samba server on OSX,
IP is 10.107.96.230
David S. Miller wrote:
>
> Hubert, try this patch instead.
>
> ===== net/ipv4/tcp_output.c 1.77 vs edited =====
> --- 1.77/net/ipv4/tcp_output.c 2005-01-18 12:23:36 -08:00
> +++ edited/net/ipv4/tcp_output.c 2005-02-10 16:42:42 -08:00
> @@ -408,6 +408,16 @@
> sk->sk_send_head = skb;
> }
>
> +static inline void tcp_tso_set_push(struct sk_buff *skb)
> +{
> + /* Force push to be on for any TSO frames to workaround
> + * problems with busted implementations like Mac OS-X that
> + * hold off socket reveive wakeups until push is seen.
> + */
> + if (tcp_skb_pcount(skb) > 1)
> + TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> +}
> +
> /* Send _single_ skb sitting at the send head. This function requires
> * true push pending frames to setup probe timer etc.
> */
> @@ -419,6 +429,7 @@
> if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) {
> /* Send it out now. */
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
> if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) {
> sk->sk_send_head = NULL;
> tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
> @@ -755,6 +766,7 @@
> }
>
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
> if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
> break;
>
> @@ -1096,6 +1108,7 @@
> * is still in somebody's hands, else make a clone.
> */
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
>
> err = tcp_transmit_skb(sk, (skb_cloned(skb) ?
> pskb_copy(skb, GFP_ATOMIC):
> @@ -1668,6 +1681,7 @@
>
> TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> + tcp_tso_set_push(skb);
> err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC));
> if (!err) {
> update_send_head(sk, tp, skb);
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 21:55 2.6.10 TCP troubles -- suggested patch Hubert Tonneau @ 2005-02-11 22:54 ` Rick Jones 2005-02-11 23:09 ` Nivedita Singhvi 2005-02-12 1:09 ` David S. Miller 2005-02-11 23:04 ` Stephen Hemminger 1 sibling, 2 replies; 40+ messages in thread From: Rick Jones @ 2005-02-11 22:54 UTC (permalink / raw) To: Hubert Tonneau; +Cc: David S. Miller, shemminger, romieu, kuznet, netdev Hubert Tonneau wrote: > Sorry, it still does not work, unless I made a mistake: > Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX > Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same. > > You can pick the new tcpdump report, created through: > tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2 > at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz > > Here is the connection summary: > > Dell PowerEdge 2600 (dual Xeon with hyper threading) running libsmbclient > on Linux 2.6.x, IP for eth1 (Intel pro 1000) is 10.107.96.7 (full > duplex, flow control is enabled) > | > | > gigabit switch > | > | > 100 Mbps switch > | > | > Mac running Samba server on OSX, > IP is 10.107.96.230 "Cooking" the trace with tcpdump -ttt to give the relative timestamdps makes things look like Mac OSX has an ACK avoidance heuristic in it? I figured there was one in their OX <= 9 stack that came from a third-party, wasn't sure if they put that into their OSX stack - IIRC that one is not from the third-party. FWIW, there are two or three other stacks that have ACK avoidance heuristics as well, it isn't an OSX only thing. 000780 10.107.96.230.139 > 10.107.96.7.32801: P 753:822(69) ack 1556 win 65535 <nop,nop,timestamp 1709240657 534173> NBT Packet (DF) 000579 10.107.96.7.32801 > 10.107.96.230.139: . 1556:3004(1448) ack 822 win 1460 <nop,nop,timestamp 534175 1709240657> NBT Packet (DF) 000027 10.107.96.7.32801 > 10.107.96.230.139: . 3004:4452(1448) ack 822 win 1460 <nop,nop,timestamp 534175 1709240657> NBT Packet (DF) 000005 10.107.96.7.32801 > 10.107.96.230.139: . 4452:5900(1448) ack 822 win 1460 <nop,nop,timestamp 534175 1709240657> NBT Packet (DF) 074685 10.107.96.230.139 > 10.107.96.7.32801: . ack 5900 win 62268 <nop,nop,timestamp 1709240657 534175> (DF) delack above 000012 10.107.96.7.32801 > 10.107.96.230.139: . 5900:7348(1448) ack 822 win 1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF) 000003 10.107.96.7.32801 > 10.107.96.230.139: . 7348:8796(1448) ack 822 win 1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF) 000002 10.107.96.7.32801 > 10.107.96.230.139: . 8796:10244(1448) ack 822 win 1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF) 000002 10.107.96.7.32801 > 10.107.96.230.139: . 10244:11692(1448) ack 822 win 1460 <nop,nop,timestamp 534249 1709240657> NBT Packet (DF) 200024 10.107.96.230.139 > 10.107.96.7.32801: . ack 11692 win 56476 <nop,nop,timestamp 1709240658 534249> (DF) and again above. 000010 10.107.96.7.32801 > 10.107.96.230.139: . 11692:13140(1448) ack 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) 000004 10.107.96.7.32801 > 10.107.96.230.139: . 13140:14588(1448) ack 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) 000002 10.107.96.7.32801 > 10.107.96.230.139: P 14588:16036(1448) ack 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) 000022 10.107.96.7.32801 > 10.107.96.230.139: . 16036:17484(1448) ack 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) 000004 10.107.96.7.32801 > 10.107.96.230.139: P 17484:18192(708) ack 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) 000994 10.107.96.230.139 > 10.107.96.7.32801: . ack 18192 win 65535 <nop,nop,timestamp 1709240658 534449> (DF) 0 And then other cases where the ACK seems to take a rather long time to arrive, seems to correlate a bit with slowly increasing numbers of segments before the ACK is sent, and something along the lines of a 200 millisecond delayed ACK timer. In some cases at least if the sender does not completely fill cwnd the ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the sender does not always fill cwnd. hth, rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 22:54 ` Rick Jones @ 2005-02-11 23:09 ` Nivedita Singhvi 2005-02-11 23:40 ` Rick Jones 2005-02-12 1:08 ` David S. Miller 2005-02-12 1:09 ` David S. Miller 1 sibling, 2 replies; 40+ messages in thread From: Nivedita Singhvi @ 2005-02-11 23:09 UTC (permalink / raw) To: Rick Jones Cc: Hubert Tonneau, David S. Miller, shemminger, romieu, kuznet, netdev Rick Jones wrote: > 000010 10.107.96.7.32801 > 10.107.96.230.139: . 11692:13140(1448) ack > 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) > 000004 10.107.96.7.32801 > 10.107.96.230.139: . 13140:14588(1448) ack > 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) > 000002 10.107.96.7.32801 > 10.107.96.230.139: P 14588:16036(1448) ack > 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) > 000022 10.107.96.7.32801 > 10.107.96.230.139: . 16036:17484(1448) ack > 822 win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) > 000004 10.107.96.7.32801 > 10.107.96.230.139: P 17484:18192(708) ack 822 > win 1460 <nop,nop,timestamp 534449 1709240658> NBT Packet (DF) > 000994 10.107.96.230.139 > 10.107.96.7.32801: . ack 18192 win 65535 > <nop,nop,timestamp 1709240658 534449> (DF) > 0 > > And then other cases where the ACK seems to take a rather long time to > arrive, seems to correlate a bit with slowly increasing numbers of > segments before the ACK is sent, and something along the lines of a 200 > millisecond delayed ACK timer. > > In some cases at least if the sender does not completely fill cwnd the > ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the > sender does not always fill cwnd. Er, how is this compliant with 2581 (yes, I know, it's only a SHOULD, not a MUST) - an ACK should be generated for at least every second full-sized segment received? Don't see that happening. In many cases it's receiving quite a few more packets. It should not be waiting for the delayed ack timer to go off, surely? thanks, Nivedita ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 23:09 ` Nivedita Singhvi @ 2005-02-11 23:40 ` Rick Jones 2005-02-12 1:08 ` David S. Miller 1 sibling, 0 replies; 40+ messages in thread From: Rick Jones @ 2005-02-11 23:40 UTC (permalink / raw) To: netdev; +Cc: Hubert Tonneau, shemminger, romieu, kuznet > Er, how is this compliant with 2581 (yes, I know, it's only a SHOULD, not a > MUST) - an ACK should be generated for at least every second full-sized > segment received? Don't see that happening. In many cases it's receiving > quite a few more packets. It should not be waiting for the delayed ack timer > to go off, surely? Certainly it would make for an interesting disuscion. Indeed it is a SHOULD which leaves-open the door to compliance of other ACK policies. Those might result in an ACK for more than two segments, or even an ACK for fewer than two segments, and there are folks in either camp/faction/sect/pick your favorite term. I would say that it is still compliant with 2581. The must there is that no matter what, an ACK must be generated within 500 milliseconds. I suspect that had a full cwnd's worth of data been sent there would have been no lengthy delay in ACKs even with fewer than ACK-every-other. I suspect that had TSO been disabled the full cwnd would have been sent and these delayed ACKs would not have happened and the transfer speed would have been happiness and joy. FWIW, as the industry has added features such as CKO (ChecKsum Offload), copy-avoidance, and now TSO, the pie chart of time spent has been shifting more and more to ACK processing. If we go back far enough, the writeups talk about how delayed ACK to increase ACK piggybacking was added in the first place - specifically (IIRC) for the purpose of minimizing ACK overhead. rick jones BTW, I'd be happy to trim emails that are already on netdev to avoid message duplications, is netdev a "closed" list? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 23:09 ` Nivedita Singhvi 2005-02-11 23:40 ` Rick Jones @ 2005-02-12 1:08 ` David S. Miller 1 sibling, 0 replies; 40+ messages in thread From: David S. Miller @ 2005-02-12 1:08 UTC (permalink / raw) To: Nivedita Singhvi Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev On Fri, 11 Feb 2005 15:09:11 -0800 Nivedita Singhvi <niv@us.ibm.com> wrote: > Er, how is this compliant with 2581 (yes, I know, it's only > a SHOULD, not a MUST) - an ACK should be generated for at > least every second full-sized segment received? It's compliant but stupid. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 22:54 ` Rick Jones 2005-02-11 23:09 ` Nivedita Singhvi @ 2005-02-12 1:09 ` David S. Miller 2005-02-12 14:31 ` Alexey Kuznetsov 1 sibling, 1 reply; 40+ messages in thread From: David S. Miller @ 2005-02-12 1:09 UTC (permalink / raw) To: Rick Jones; +Cc: hubert.tonneau, shemminger, romieu, kuznet, netdev On Fri, 11 Feb 2005 14:54:27 -0800 Rick Jones <rick.jones2@hp.com> wrote: > In some cases at least if the sender does not completely fill cwnd the > ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the > sender does not always fill cwnd. At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left empty. By default, this is 1/8 of the cwnd. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 1:09 ` David S. Miller @ 2005-02-12 14:31 ` Alexey Kuznetsov 2005-02-12 19:28 ` David S. Miller 2005-02-12 20:19 ` rick jones 0 siblings, 2 replies; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-12 14:31 UTC (permalink / raw) To: David S. Miller Cc: Rick Jones, hubert.tonneau, shemminger, romieu, kuznet, netdev Hello! > > In some cases at least if the sender does not completely fill cwnd the > > ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the > > sender does not always fill cwnd. > > At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left > empty. > > By default, this is 1/8 of the cwnd. In any case, receiver cannot know sender cwnd, so that "fill" or "not fill" is is not a question. What is broken in that implementation is that it does not feel slow start. ACK avoidance while slow start is certain disaster. Currrent theory is that MacOS X thinks that we do not do slow start. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 14:31 ` Alexey Kuznetsov @ 2005-02-12 19:28 ` David S. Miller 2005-02-12 19:44 ` Leonid Grossman 2005-02-12 19:52 ` Alexey Kuznetsov 2005-02-12 20:19 ` rick jones 1 sibling, 2 replies; 40+ messages in thread From: David S. Miller @ 2005-02-12 19:28 UTC (permalink / raw) To: Alexey Kuznetsov Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev On Sat, 12 Feb 2005 17:31:05 +0300 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote: > In any case, receiver cannot know sender cwnd, so that "fill" or "not fill" > is is not a question. > > What is broken in that implementation is that it does not feel slow start. > ACK avoidance while slow start is certain disaster. Currrent theory is that > MacOS X thinks that we do not do slow start. It is correct. Although, I am still believing that setting PSH is the avenue of investigation. ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: 2.6.10 TCP troubles -- suggested patch 2005-02-12 19:28 ` David S. Miller @ 2005-02-12 19:44 ` Leonid Grossman 2005-02-12 19:52 ` Alexey Kuznetsov 1 sibling, 0 replies; 40+ messages in thread From: Leonid Grossman @ 2005-02-12 19:44 UTC (permalink / raw) To: 'David S. Miller', 'Alexey Kuznetsov' Cc: rick.jones2, hubert.tonneau, shemminger, romieu, kuznet, netdev Typically, a TSO engine sets PSH in the last packet that it builds for the TSO+PSH request. Leonid > -----Original Message----- > From: netdev-bounce@oss.sgi.com > [mailto:netdev-bounce@oss.sgi.com] On Behalf Of David S. Miller > Sent: Saturday, February 12, 2005 11:28 AM > To: Alexey Kuznetsov > Cc: rick.jones2@hp.com; hubert.tonneau@fullpliant.org; > shemminger@osdl.org; romieu@fr.zoreil.com; > kuznet@ms2.inr.ac.ru; netdev@oss.sgi.com > Subject: Re: 2.6.10 TCP troubles -- suggested patch > > On Sat, 12 Feb 2005 17:31:05 +0300 > Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote: > > > In any case, receiver cannot know sender cwnd, so that > "fill" or "not fill" > > is is not a question. > > > > What is broken in that implementation is that it does not > feel slow start. > > ACK avoidance while slow start is certain disaster. > Currrent theory is > > that MacOS X thinks that we do not do slow start. > > It is correct. Although, I am still believing that setting > PSH is the avenue of investigation. > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 19:28 ` David S. Miller 2005-02-12 19:44 ` Leonid Grossman @ 2005-02-12 19:52 ` Alexey Kuznetsov 2005-02-15 23:25 ` David S. Miller 1 sibling, 1 reply; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-12 19:52 UTC (permalink / raw) To: David S. Miller Cc: Alexey Kuznetsov, rick.jones2, hubert.tonneau, shemminger, romieu, netdev Hello! > It is correct. Although, I am still believing that setting PSH > is the avenue of investigation. Exactly. That's why the next test should be with disabled TSO in 2.6.9. If too rare PSHs were a problem, it will show as the same disaster there. [ And, to be honest, in this case, I daresay MacOS X may be left with its bugs alone. Or we could help it with something like setting PSH when we are in slow start and each half of CWND after completion of slow start. Or just set PSH on each frame. ] Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 19:52 ` Alexey Kuznetsov @ 2005-02-15 23:25 ` David S. Miller 0 siblings, 0 replies; 40+ messages in thread From: David S. Miller @ 2005-02-15 23:25 UTC (permalink / raw) To: Alexey Kuznetsov Cc: kuznet, rick.jones2, hubert.tonneau, shemminger, romieu, netdev On Sat, 12 Feb 2005 22:52:46 +0300 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote: > Exactly. That's why the next test should be with disabled TSO in 2.6.9. > If too rare PSHs were a problem, it will show as the same disaster there. > > [ And, to be honest, in this case, I daresay MacOS X may be left with its bugs > alone. Or we could help it with something like setting PSH when we are in slow > start and each half of CWND after completion of slow start. Or just set > PSH on each frame. ] Setting it every other frame would fix the problem, just forcing it to miss header prediction path is what is needed to avoid the silly delayed ACK behavior. And PSH is one way to do that. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 14:31 ` Alexey Kuznetsov 2005-02-12 19:28 ` David S. Miller @ 2005-02-12 20:19 ` rick jones 2005-02-12 20:28 ` David S. Miller 2005-02-12 20:56 ` Alexey Kuznetsov 1 sibling, 2 replies; 40+ messages in thread From: rick jones @ 2005-02-12 20:19 UTC (permalink / raw) To: Alexey Kuznetsov; +Cc: netdev, romieu, hubert.tonneau, shemminger On Feb 12, 2005, at 6:31 AM, Alexey Kuznetsov wrote: > Hello! > >>> In some cases at least if the sender does not completely fill cwnd >>> the >>> ACKs will be delayed. And IIRC under 2.6.10 with TSO enabled, the >>> sender does not always fill cwnd. >> >> At a maximum, "1/tcp_tso_win_divisor" of the cwnd will ever be left >> empty. >> >> By default, this is 1/8 of the cwnd. > > In any case, receiver cannot know sender cwnd, so that "fill" or "not > fill" > is is not a question. How is that? Isn't cwnd based on the ACKs the sender receives from the receiver? > What is broken in that implementation is that it does not feel slow > start. > ACK avoidance while slow start is certain disaster. Currrent theory is > that > MacOS X thinks that we do not do slow start. Actually, it may think slow start is being done - there was enough small packet back and forth on the connection before the "heavy transfer" to get cwnd opened - I just didn't quote that in the "cooked" output. All the stacks with ACK avoidance with which I am familiar do not make the assumption that the sender is not doing slow-start. They make sure to send enough ACKs at the beginning (or after packet loss) to allow the sender's cwnd to grow. rick jones wisdom teeth are impacted, people are affected by the effects of events ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 20:19 ` rick jones @ 2005-02-12 20:28 ` David S. Miller 2005-02-12 20:56 ` Alexey Kuznetsov 1 sibling, 0 replies; 40+ messages in thread From: David S. Miller @ 2005-02-12 20:28 UTC (permalink / raw) To: rick jones; +Cc: kuznet, netdev, romieu, hubert.tonneau, shemminger On Sat, 12 Feb 2005 12:19:35 -0800 rick jones <rick.jones2@hp.com> wrote: > How is that? Isn't cwnd based on the ACKs the sender receives from the > receiver? ACKs go from sender to receiver, first of all. It is based upon congestion as seen "by receiver", something which is impossible for sender. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 20:19 ` rick jones 2005-02-12 20:28 ` David S. Miller @ 2005-02-12 20:56 ` Alexey Kuznetsov 2005-02-12 21:27 ` Nivedita Singhvi 2005-02-12 21:43 ` rick jones 1 sibling, 2 replies; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-12 20:56 UTC (permalink / raw) To: rick jones; +Cc: Alexey Kuznetsov, netdev, romieu, hubert.tonneau, shemminger Hello! > Actually, it may think slow start is being done - there was enough > small packet back and forth on the connection before the "heavy > transfer" to get cwnd opened If receiver sent an ACK it still does not mean that sender used it to increase its cwnd. Particularly, small packet exchange definitely does not inflate cwnd. > output. All the stacks with ACK avoidance with which I am familiar do > not make the assumption that the sender is not doing slow-start. They > make sure to send enough ACKs at the beginning (or after packet loss) > to allow the sender's cwnd to grow. Well, we do similar thing with delayed ACKs. And it took a few of runs of testing to understand that we cannot detect even packet loss reliably enough. :-) Actually, those receivers could use the first delayed ACK event as a sign of failure of their heuristics and block stretching acks for this connection. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 20:56 ` Alexey Kuznetsov @ 2005-02-12 21:27 ` Nivedita Singhvi 2005-02-12 21:43 ` rick jones 1 sibling, 0 replies; 40+ messages in thread From: Nivedita Singhvi @ 2005-02-12 21:27 UTC (permalink / raw) To: Alexey Kuznetsov; +Cc: rick jones, netdev, romieu, hubert.tonneau, shemminger Alexey Kuznetsov wrote: > If receiver sent an ACK it still does not mean that sender used it > to increase its cwnd. Particularly, small packet exchange definitely > does not inflate cwnd. Simplest way to go about this is simply compare it to the trace of the "good/fast" connection - Hubert, could you provide the "good" trace as well? That would show where the differences in time are taken up.. thanks, Nivedita ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 20:56 ` Alexey Kuznetsov 2005-02-12 21:27 ` Nivedita Singhvi @ 2005-02-12 21:43 ` rick jones 2005-02-12 22:00 ` Alexey Kuznetsov 1 sibling, 1 reply; 40+ messages in thread From: rick jones @ 2005-02-12 21:43 UTC (permalink / raw) To: Alexey Kuznetsov; +Cc: netdev, romieu, hubert.tonneau, shemminger > If receiver sent an ACK it still does not mean that sender used it > to increase its cwnd. Particularly, small packet exchange definitely > does not inflate cwnd. Is that in general, or in Linux? >> output. All the stacks with ACK avoidance with which I am familiar do >> not make the assumption that the sender is not doing slow-start. They >> make sure to send enough ACKs at the beginning (or after packet loss) >> to allow the sender's cwnd to grow. > > Well, we do similar thing with delayed ACKs. And it took a few of runs > of testing to understand that we cannot detect even packet loss > reliably > enough. :-) I never claimed it was easy :) > Actually, those receivers could use the first delayed ACK event as > a sign of failure of their heuristics and block stretching acks for > this connection. The ones with which I am familiar do - after N delayed ACK events where N is something other than one though. And they still send immediate ACKs to the senders upon out of order data and all that. rick jones Wisdom teeth are impacted, people are affected by the effects of events ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 21:43 ` rick jones @ 2005-02-12 22:00 ` Alexey Kuznetsov 2005-02-13 1:29 ` rick jones 0 siblings, 1 reply; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-12 22:00 UTC (permalink / raw) To: rick jones; +Cc: Alexey Kuznetsov, netdev, romieu, hubert.tonneau, shemminger Hello! > Is that in general, or in Linux? Any which follows some of congestion window validation recommendations. Even canonical bsd restarts slow start after rtt. > N is something other than one though. Well, 1 is quite enough to be sure that something is very wrong. You see a proof here. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 22:00 ` Alexey Kuznetsov @ 2005-02-13 1:29 ` rick jones 0 siblings, 0 replies; 40+ messages in thread From: rick jones @ 2005-02-13 1:29 UTC (permalink / raw) To: netdev; +Cc: romieu, hubert.tonneau, shemminger On Feb 12, 2005, at 2:00 PM, Alexey Kuznetsov wrote: > Any which follows some of congestion window validation recommendations. If you could point me at the chapter and verse that would be great. > Even canonical bsd restarts slow start after rtt. Did we have >= one RTT of idle in the packet trace? >> N is something other than one though. > > Well, 1 is quite enough to be sure that something is very wrong. > You see a proof here. The debate of course is what :) In and of _itself_, a delayed ACK does not guarantee something is very wrong. For example, in a request/response situation when the response takes longer than the delayed ACK interval to generate. And if it was not request/response, and the sender simply didn't have any more to send at that point. Going back to the quantity of cwnd which may be left unused when TSO is enabled. If when TSO is enabled, the sender does not take full advantage of the cwnd doesn't that then mean that to deal with the same bandwidth delay product, one needs a larger TCP window when TSO is enabled than when it is not? In the default case of tcp_tso_win_divisor being 8 that would be another 12.5% right? rick jones there is no rest for the wicked, yet the virtuous have no pillows ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 21:55 2.6.10 TCP troubles -- suggested patch Hubert Tonneau 2005-02-11 22:54 ` Rick Jones @ 2005-02-11 23:04 ` Stephen Hemminger 2005-02-12 1:07 ` David S. Miller 2005-02-15 23:23 ` David S. Miller 1 sibling, 2 replies; 40+ messages in thread From: Stephen Hemminger @ 2005-02-11 23:04 UTC (permalink / raw) To: Hubert Tonneau Cc: David S. Miller, romieu, kuznet, Nivedita Singhvi, Rick Jones, netdev On Fri, 11 Feb 2005 21:55:49 GMT Hubert Tonneau <hubert.tonneau@fullpliant.org> wrote: > Sorry, it still does not work, unless I made a mistake: > Linux 2.6.9 takes 15 seconds to copy 105 MB to Mac OSX > Linux 2.6.10 with the TCP patch below still takes 325 seconds to do the same. > > You can pick the new tcpdump report, created through: > tcpdump -i eth1 ip host 10.107.96.230 -w /tmp/dump-2.6.10-tcp2 > at http://fullpliant.org/pliant/browse/file/archive/dump-2.6.10-tcp2.gz Still not setting Push sufficiently to keep MacOSX happy. 13:40:35.027124 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 924:975(51) ack 67344 win 50728 13:40:35.027186 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 67344 win 65535 13:40:35.027328 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 975:1026(51) ack 67344 win 65535 13:40:35.027363 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 67344:68792(1448) ack 1026 win 1460 13:40:35.027367 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 68792:70240(1448) ack 1026 win 1460 13:40:35.027370 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 70240:71688(1448) ack 1026 win 1460 13:40:35.027373 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 71688:73136(1448) ack 1026 win 1460 13:40:35.027375 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 73136:74584(1448) ack 1026 win 1460 13:40:35.027378 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 74584:76032(1448) ack 1026 win 1460 13:40:35.027381 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 76032:77480(1448) ack 1026 win 1460 13:40:35.027384 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 77480:78928(1448) ack 1026 win 1460 13:40:35.027387 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 78928:80376(1448) ack 1026 win 1460 13:40:35.027390 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 80376:81824(1448) ack 1026 win 1460 13:40:35.027393 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 81824:83272(1448) ack 1026 win 1460 13:40:35.027397 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: P 83272:83980(708) ack 1026 win 1460 okay burst with push 13:40:35.034930 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 1179:1230(51) ack 133132 win 65535 13:40:35.035304 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 133132:134580(1448) ack 1230 win 1460 13:40:35.035312 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 134580:136028(1448) ack 1230 win 1460 Big gap... because of missing P 13:40:35.219175 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 136028 win 63716 13:40:35.219193 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 136028:137476(1448) ack 1230 win 1460 13:40:35.219197 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 137476:138924(1448) ack 1230 win 1460 13:40:35.419193 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 138924 win 60820 13:40:35.419202 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 138924:140372(1448) ack 1230 win 1460 13:40:35.419205 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 140372:141820(1448) ack 1230 win 1460 13:40:35.419207 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 141820:143268(1448) ack 1230 win 1460 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 23:04 ` Stephen Hemminger @ 2005-02-12 1:07 ` David S. Miller 2005-02-12 12:11 ` Andi Kleen 2005-02-12 14:16 ` Alexey Kuznetsov 2005-02-15 23:23 ` David S. Miller 1 sibling, 2 replies; 40+ messages in thread From: David S. Miller @ 2005-02-12 1:07 UTC (permalink / raw) To: Stephen Hemminger Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev On Fri, 11 Feb 2005 15:04:20 -0800 Stephen Hemminger <shemminger@osdl.org> wrote: > Still not setting Push sufficiently to keep MacOSX happy. I don't think it's the kernel's fault in this case. This set of data frames you quoted are all full, and are tightly interspaced. It looks exactly like a TSO frame, which we certainly set PSH on, but the TSO engine is dropping it aparently. I guess this is e1000. Any e1000 internals experts reading here who can comment on how e1000's TSO engine treats the PSH flag? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 1:07 ` David S. Miller @ 2005-02-12 12:11 ` Andi Kleen 2005-02-12 19:23 ` David S. Miller 2005-02-12 14:16 ` Alexey Kuznetsov 1 sibling, 1 reply; 40+ messages in thread From: Andi Kleen @ 2005-02-12 12:11 UTC (permalink / raw) To: David S. Miller; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev "David S. Miller" <davem@davemloft.net> writes: > > I guess this is e1000. Any e1000 internals experts reading > here who can comment on how e1000's TSO engine treats the > PSH flag? If that is the problem it should be easy to test for. Just disable TSO with ethtool -K ethX tso off Hubert, does that make the problem go away? -Andi ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 12:11 ` Andi Kleen @ 2005-02-12 19:23 ` David S. Miller 2005-02-12 21:30 ` Andi Kleen 0 siblings, 1 reply; 40+ messages in thread From: David S. Miller @ 2005-02-12 19:23 UTC (permalink / raw) To: Andi Kleen; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev On Sat, 12 Feb 2005 13:11:43 +0100 Andi Kleen <ak@muc.de> wrote: > "David S. Miller" <davem@davemloft.net> writes: > > > > I guess this is e1000. Any e1000 internals experts reading > > here who can comment on how e1000's TSO engine treats the > > PSH flag? > > If that is the problem it should be easy to test for. Just > disable TSO with ethtool -K ethX tso off > > Hubert, does that make the problem go away? We're testing the new code that sets PSH on every TSO frame. If we disable TSO, the new code won't be exercised nor tested. :-) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 19:23 ` David S. Miller @ 2005-02-12 21:30 ` Andi Kleen 0 siblings, 0 replies; 40+ messages in thread From: Andi Kleen @ 2005-02-12 21:30 UTC (permalink / raw) To: David S. Miller; +Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev > We're testing the new code that sets PSH on every TSO frame. > If we disable TSO, the new code won't be exercised nor tested. > :-) Sorry, I read the thread out of order (shouldn't do that) Ignore my mail. -Andi ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 1:07 ` David S. Miller 2005-02-12 12:11 ` Andi Kleen @ 2005-02-12 14:16 ` Alexey Kuznetsov 2005-02-12 19:41 ` David S. Miller 1 sibling, 1 reply; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-12 14:16 UTC (permalink / raw) To: David S. Miller Cc: Stephen Hemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev Hello! > This set of data frames you quoted are all full, and > are tightly interspaced. It looks exactly like a TSO > frame, which we certainly set PSH on, but the TSO > engine is dropping it aparently. > > I guess this is e1000. Any e1000 internals experts reading > here who can comment on how e1000's TSO engine treats the > PSH flag? Or it was two one-segment frames. Before blaming on e1000 it would be easier to confirm that linux never worked with MacOS X, except for those kernels which had congestion avoidance mostly supppressed. I.e. let's disable TSO in 2.6.9 and look. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 14:16 ` Alexey Kuznetsov @ 2005-02-12 19:41 ` David S. Miller 2005-02-12 20:03 ` Alexey Kuznetsov 0 siblings, 1 reply; 40+ messages in thread From: David S. Miller @ 2005-02-12 19:41 UTC (permalink / raw) To: Alexey Kuznetsov Cc: shemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev On Sat, 12 Feb 2005 17:16:41 +0300 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote: > > This set of data frames you quoted are all full, and > > are tightly interspaced. It looks exactly like a TSO > > frame, which we certainly set PSH on, but the TSO > > engine is dropping it aparently. ... > Or it was two one-segment frames. Even ignoring my TSO changes, we should be seeing at a minimum 1/2 window PSH settings which we're not as far as I can tell. (this is due to the forced_push() test in net/ipv4/tcp.c) This also points out a bug in my TSO PSH patch, I should be updating tp->pushed_seq shouldn't I? Question is, what to set it to? I think correct value is TCP_SKB_CB(skb)->end_seq. > I.e. let's disable TSO in 2.6.9 and look. I believe this experiment had been performed already. Stephen, isn't that the case? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 19:41 ` David S. Miller @ 2005-02-12 20:03 ` Alexey Kuznetsov 2005-02-15 23:26 ` David S. Miller 0 siblings, 1 reply; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-12 20:03 UTC (permalink / raw) To: David S. Miller Cc: Alexey Kuznetsov, shemminger, hubert.tonneau, romieu, niv, rick.jones2, netdev Hello! > set it to? I think correct value is TCP_SKB_CB(skb)->end_seq. Yup. But it does not matter. When it is not advanced, it does not make PSHs more rare. Actually, that anti-MacOS never worked well. If segment with forced PSH was not transmitted in time, even forced PSHs could be deleted. Your patch with setting PSH right before (or in) tcp_transmit_skb() must work. Unless these segments are not tso. > > I.e. let's disable TSO in 2.6.9 and look. > > I believe this experiment had been performed already. I saw only tests with TSO. And 2.6.9 showed exactly the same weird behaviour. Only 2.6.9 did not slow start and we had only a few of 200msec gaps. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-12 20:03 ` Alexey Kuznetsov @ 2005-02-15 23:26 ` David S. Miller 2005-02-15 23:42 ` Rick Jones 0 siblings, 1 reply; 40+ messages in thread From: David S. Miller @ 2005-02-15 23:26 UTC (permalink / raw) To: Alexey Kuznetsov Cc: kuznet, shemminger, hubert.tonneau, romieu, niv, rick.jones2, netdev On Sat, 12 Feb 2005 23:03:18 +0300 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote: > Actually, that anti-MacOS never worked well. If segment with forced PSH > was not transmitted in time, even forced PSHs could be deleted. > Your patch with setting PSH right before (or in) tcp_transmit_skb() must > work. Unless these segments are not tso. Yes, it never did work well. But now we understand more deeply the nature of this beast, we can probably refine it. In short, for properly working TCP stream with no drops and no reordering, Darwin delays ACKs until delack timer fires or PSH is seen :-) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-15 23:26 ` David S. Miller @ 2005-02-15 23:42 ` Rick Jones 0 siblings, 0 replies; 40+ messages in thread From: Rick Jones @ 2005-02-15 23:42 UTC (permalink / raw) To: netdev > In short, for properly working TCP stream with no drops and no > reordering, Darwin delays ACKs until delack timer fires or PSH > is seen :-) As a supporter of ACK avoidance heuristics in general, I will come-out and say that the heuristic above does indeed sound quite broken. It is not the heuristic with which I am familiar, which has a configurable maximum number of segments for which to delay the ACK. rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-11 23:04 ` Stephen Hemminger 2005-02-12 1:07 ` David S. Miller @ 2005-02-15 23:23 ` David S. Miller 2005-02-16 9:13 ` Alexey Kuznetsov 1 sibling, 1 reply; 40+ messages in thread From: David S. Miller @ 2005-02-15 23:23 UTC (permalink / raw) To: Stephen Hemminger Cc: hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev On Fri, 11 Feb 2005 15:04:20 -0800 Stephen Hemminger <shemminger@osdl.org> wrote: > Still not setting Push sufficiently to keep MacOSX happy. ... > 13:40:35.034930 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: P 1179:1230(51) ack 133132 win 65535 > 13:40:35.035304 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 133132:134580(1448) ack 1230 win 1460 > 13:40:35.035312 IP 10.107.96.7.32801 > 10.107.96.230.netbios-ssn: . 134580:136028(1448) ack 1230 win 1460 > > Big gap... because of missing P > > 13:40:35.219175 IP 10.107.96.230.netbios-ssn > 10.107.96.7.32801: . ack 136028 win 63716 I am starting to understand Darwin's logic. If header prediction fast path is hit, ACK is always delayed when delack sysctl is enabled. One way to miss fast path is for PSH to be set. This will make ACK not get delayed if ACK is pending already. At least that is how it looks, and it makes sense given this trace. How mind boggling a heuristic. I bet it works by accident rather than intention and purposeful design. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-15 23:23 ` David S. Miller @ 2005-02-16 9:13 ` Alexey Kuznetsov 2005-02-16 17:50 ` David S. Miller 0 siblings, 1 reply; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-16 9:13 UTC (permalink / raw) To: David S. Miller Cc: Stephen Hemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev Hello! > How mind boggling a heuristic. I bet it works by accident rather > than intention and purposeful design. Yup. It is definitely not an "ack avoidance algorithm" :-) :-) BTW it is still a puzzle why 2.6.9 works. With disabled TSO it should insert PSHs quite rarely, similarly to tso. And it is still a puzzle how that bunch of PSHless segments not followed by PSH appeared in TSO case. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-16 9:13 ` Alexey Kuznetsov @ 2005-02-16 17:50 ` David S. Miller 0 siblings, 0 replies; 40+ messages in thread From: David S. Miller @ 2005-02-16 17:50 UTC (permalink / raw) To: Alexey Kuznetsov Cc: shemminger, hubert.tonneau, romieu, kuznet, niv, rick.jones2, netdev [-- Attachment #1: Type: text/plain, Size: 661 bytes --] On Wed, 16 Feb 2005 12:13:23 +0300 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> wrote: > BTW it is still a puzzle why 2.6.9 works. With disabled TSO it should > insert PSHs quite rarely, similarly to tso. Yes. Hubert, do you have netfilter enabled in the 2.6.10 kernel you are running? I'm asking because the TCP changes in 2.6.10 are pretty benign (attached for the curious who want to review along), whereas netfilter had major updates particularly in the TCP connection tracking code. I also reviewed 2.6.10-ac11 for anything interesting wrt. TCP and the only thing in there is the tcp_retrans_try_collapse() missing check to avoid collapsing TSO segments. [-- Attachment #2: tcp-2.6.10 --] [-- Type: application/octet-stream, Size: 35185 bytes --] diff -Nru a/include/linux/tcp.h b/include/linux/tcp.h --- a/include/linux/tcp.h 2004-12-24 13:36:49 -08:00 +++ b/include/linux/tcp.h 2004-12-24 13:36:49 -08:00 @@ -186,6 +186,8 @@ __u32 tcpi_rcv_rtt; __u32 tcpi_rcv_space; + + __u32 tcpi_total_retrans; }; #ifdef __KERNEL__ @@ -363,6 +365,8 @@ __u8 pending; /* Scheduled timer event */ __u8 urg_mode; /* In urgent mode */ __u32 snd_up; /* Urgent pointer */ + + __u32 total_retrans; /* Total retransmits for entire connection */ /* The syn_wait_lock is necessary only to avoid proc interface having * to grab the main lock sock while browsing the listening hash diff -Nru a/include/net/tcp.h b/include/net/tcp.h --- a/include/net/tcp.h 2004-12-24 13:36:18 -08:00 +++ b/include/net/tcp.h 2004-12-24 13:36:18 -08:00 @@ -159,7 +159,6 @@ extern void tcp_bucket_destroy(struct tcp_bind_bucket *tb); extern void tcp_bucket_unlock(struct sock *sk); extern int tcp_port_rover; -extern struct sock *tcp_v4_lookup_listener(u32 addr, unsigned short hnum, int dif); /* These are AF independent. */ static __inline__ int tcp_bhashfn(__u16 lport) @@ -362,8 +361,8 @@ #define TCP_IPV6_MATCH(__sk, __saddr, __daddr, __ports, __dif) \ (((*((__u32 *)&(inet_sk(__sk)->dport)))== (__ports)) && \ ((__sk)->sk_family == AF_INET6) && \ - !ipv6_addr_cmp(&inet6_sk(__sk)->daddr, (__saddr)) && \ - !ipv6_addr_cmp(&inet6_sk(__sk)->rcv_saddr, (__daddr)) && \ + ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr)) && \ + ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) /* These can have wildcards, don't try too hard. */ @@ -961,12 +960,14 @@ extern void tcp_init_xmit_timers(struct sock *); extern void tcp_clear_xmit_timers(struct sock *); -extern void tcp_delete_keepalive_timer (struct sock *); -extern void tcp_reset_keepalive_timer (struct sock *, unsigned long); +extern void tcp_delete_keepalive_timer(struct sock *); +extern void tcp_reset_keepalive_timer(struct sock *, unsigned long); extern unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu); extern unsigned int tcp_current_mss(struct sock *sk, int large); -extern const char timer_bug_msg[]; +#ifdef TCP_DEBUG +extern const char tcp_timer_bug_msg[]; +#endif /* tcp_diag.c */ extern void tcp_get_info(struct sock *, struct tcp_info *); @@ -999,7 +1000,9 @@ #endif break; default: - printk(timer_bug_msg); +#ifdef TCP_DEBUG + printk(tcp_timer_bug_msg); +#endif return; }; @@ -1034,7 +1037,9 @@ break; default: - printk(timer_bug_msg); +#ifdef TCP_DEBUG + printk(tcp_timer_bug_msg); +#endif }; } @@ -1083,7 +1088,7 @@ * Rcv_nxt can be after the window if our peer push more data * than the offered window. */ -static __inline__ u32 tcp_receive_window(struct tcp_opt *tp) +static __inline__ u32 tcp_receive_window(const struct tcp_opt *tp) { s32 win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt; @@ -1161,18 +1166,19 @@ /* Due to TSO, an SKB can be composed of multiple actual * packets. To keep these tracked properly, we use this. */ -static inline int tcp_skb_pcount(struct sk_buff *skb) +static inline int tcp_skb_pcount(const struct sk_buff *skb) { return skb_shinfo(skb)->tso_segs; } /* This is valid iff tcp_skb_pcount() > 1. */ -static inline int tcp_skb_mss(struct sk_buff *skb) +static inline int tcp_skb_mss(const struct sk_buff *skb) { return skb_shinfo(skb)->tso_size; } -static inline void tcp_inc_pcount(tcp_pcount_t *count, struct sk_buff *skb) +static inline void tcp_inc_pcount(tcp_pcount_t *count, + const struct sk_buff *skb) { count->val += tcp_skb_pcount(skb); } @@ -1187,13 +1193,14 @@ count->val -= amt; } -static inline void tcp_dec_pcount(tcp_pcount_t *count, struct sk_buff *skb) +static inline void tcp_dec_pcount(tcp_pcount_t *count, + const struct sk_buff *skb) { count->val -= tcp_skb_pcount(skb); } static inline void tcp_dec_pcount_approx(tcp_pcount_t *count, - struct sk_buff *skb) + const struct sk_buff *skb) { if (count->val) { count->val -= tcp_skb_pcount(skb); @@ -1202,7 +1209,7 @@ } } -static inline __u32 tcp_get_pcount(tcp_pcount_t *count) +static inline __u32 tcp_get_pcount(const tcp_pcount_t *count) { return count->val; } @@ -1212,8 +1219,9 @@ count->val = val; } -static inline void tcp_packets_out_inc(struct sock *sk, struct tcp_opt *tp, - struct sk_buff *skb) +static inline void tcp_packets_out_inc(struct sock *sk, + struct tcp_opt *tp, + const struct sk_buff *skb) { int orig = tcp_get_pcount(&tp->packets_out); @@ -1222,7 +1230,8 @@ tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS, tp->rto); } -static inline void tcp_packets_out_dec(struct tcp_opt *tp, struct sk_buff *skb) +static inline void tcp_packets_out_dec(struct tcp_opt *tp, + const struct sk_buff *skb) { tcp_dec_pcount(&tp->packets_out, skb); } @@ -1241,7 +1250,7 @@ * "Packets left network, but not honestly ACKed yet" PLUS * "Packets fast retransmitted" */ -static __inline__ unsigned int tcp_packets_in_flight(struct tcp_opt *tp) +static __inline__ unsigned int tcp_packets_in_flight(const struct tcp_opt *tp) { return (tcp_get_pcount(&tp->packets_out) - tcp_get_pcount(&tp->left_out) + @@ -1408,18 +1417,19 @@ /* Slow start with delack produces 3 packets of burst, so that * it is safe "de facto". */ -static __inline__ __u32 tcp_max_burst(struct tcp_opt *tp) +static __inline__ __u32 tcp_max_burst(const struct tcp_opt *tp) { return 3; } -static __inline__ int tcp_minshall_check(struct tcp_opt *tp) +static __inline__ int tcp_minshall_check(const struct tcp_opt *tp) { return after(tp->snd_sml,tp->snd_una) && !after(tp->snd_sml, tp->snd_nxt); } -static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss, struct sk_buff *skb) +static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss, + const struct sk_buff *skb) { if (skb->len < mss) tp->snd_sml = TCP_SKB_CB(skb)->end_seq; @@ -1434,7 +1444,8 @@ */ static __inline__ int -tcp_nagle_check(struct tcp_opt *tp, struct sk_buff *skb, unsigned mss_now, int nonagle) +tcp_nagle_check(const struct tcp_opt *tp, const struct sk_buff *skb, + unsigned mss_now, int nonagle) { return (skb->len < mss_now && !(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && @@ -1449,7 +1460,8 @@ /* This checks if the data bearing packet SKB (usually sk->sk_send_head) * should be put on the wire right now. */ -static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb, +static __inline__ int tcp_snd_test(const struct tcp_opt *tp, + struct sk_buff *skb, unsigned cur_mss, int nonagle) { int pkts = tcp_skb_pcount(skb); @@ -1496,7 +1508,8 @@ tcp_reset_xmit_timer(sk, TCP_TIME_PROBE0, tp->rto); } -static __inline__ int tcp_skb_is_last(struct sock *sk, struct sk_buff *skb) +static __inline__ int tcp_skb_is_last(const struct sock *sk, + const struct sk_buff *skb) { return skb->next == (struct sk_buff *)&sk->sk_write_queue; } @@ -1547,7 +1560,7 @@ tp->snd_wl1 = seq; } -extern void tcp_destroy_sock(struct sock *sk); +extern void tcp_destroy_sock(struct sock *sk); /* @@ -1621,7 +1634,7 @@ #undef STATE_TRACE #ifdef STATE_TRACE -static char *statename[]={ +static const char *statename[]={ "Unused","Established","Syn Sent","Syn Recv", "Fin Wait 1","Fin Wait 2","Time Wait", "Close", "Close Wait","Last ACK","Listen","Closing" @@ -1892,17 +1905,17 @@ wake_up(&tcp_lhash_wait); } -static inline int keepalive_intvl_when(struct tcp_opt *tp) +static inline int keepalive_intvl_when(const struct tcp_opt *tp) { return tp->keepalive_intvl ? : sysctl_tcp_keepalive_intvl; } -static inline int keepalive_time_when(struct tcp_opt *tp) +static inline int keepalive_time_when(const struct tcp_opt *tp) { return tp->keepalive_time ? : sysctl_tcp_keepalive_time; } -static inline int tcp_fin_time(struct tcp_opt *tp) +static inline int tcp_fin_time(const struct tcp_opt *tp) { int fin_timeout = tp->linger2 ? : sysctl_tcp_fin_timeout; @@ -1912,7 +1925,7 @@ return fin_timeout; } -static inline int tcp_paws_check(struct tcp_opt *tp, int rst) +static inline int tcp_paws_check(const struct tcp_opt *tp, int rst) { if ((s32)(tp->rcv_tsval - tp->ts_recent) >= 0) return 0; diff -Nru a/net/ipv4/tcp.c b/net/ipv4/tcp.c --- a/net/ipv4/tcp.c 2004-12-24 13:36:31 -08:00 +++ b/net/ipv4/tcp.c 2004-12-24 13:36:31 -08:00 @@ -467,7 +467,7 @@ sk->sk_max_ack_backlog = 0; sk->sk_ack_backlog = 0; tp->accept_queue = tp->accept_queue_tail = NULL; - tp->syn_wait_lock = RW_LOCK_UNLOCKED; + rwlock_init(&tp->syn_wait_lock); tcp_delack_init(tp); lopt = kmalloc(sizeof(struct tcp_listen_opt), GFP_KERNEL); @@ -2095,6 +2095,65 @@ return err; } +/* Return information about state of tcp endpoint in API format. */ +void tcp_get_info(struct sock *sk, struct tcp_info *info) +{ + struct tcp_opt *tp = tcp_sk(sk); + u32 now = tcp_time_stamp; + + memset(info, 0, sizeof(*info)); + + info->tcpi_state = sk->sk_state; + info->tcpi_ca_state = tp->ca_state; + info->tcpi_retransmits = tp->retransmits; + info->tcpi_probes = tp->probes_out; + info->tcpi_backoff = tp->backoff; + + if (tp->tstamp_ok) + info->tcpi_options |= TCPI_OPT_TIMESTAMPS; + if (tp->sack_ok) + info->tcpi_options |= TCPI_OPT_SACK; + if (tp->wscale_ok) { + info->tcpi_options |= TCPI_OPT_WSCALE; + info->tcpi_snd_wscale = tp->snd_wscale; + info->tcpi_rcv_wscale = tp->rcv_wscale; + } + + if (tp->ecn_flags&TCP_ECN_OK) + info->tcpi_options |= TCPI_OPT_ECN; + + info->tcpi_rto = jiffies_to_usecs(tp->rto); + info->tcpi_ato = jiffies_to_usecs(tp->ack.ato); + info->tcpi_snd_mss = tp->mss_cache_std; + info->tcpi_rcv_mss = tp->ack.rcv_mss; + + info->tcpi_unacked = tcp_get_pcount(&tp->packets_out); + info->tcpi_sacked = tcp_get_pcount(&tp->sacked_out); + info->tcpi_lost = tcp_get_pcount(&tp->lost_out); + info->tcpi_retrans = tcp_get_pcount(&tp->retrans_out); + info->tcpi_fackets = tcp_get_pcount(&tp->fackets_out); + + info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime); + info->tcpi_last_data_recv = jiffies_to_msecs(now - tp->ack.lrcvtime); + info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp); + + info->tcpi_pmtu = tp->pmtu_cookie; + info->tcpi_rcv_ssthresh = tp->rcv_ssthresh; + info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3; + info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2; + info->tcpi_snd_ssthresh = tp->snd_ssthresh; + info->tcpi_snd_cwnd = tp->snd_cwnd; + info->tcpi_advmss = tp->advmss; + info->tcpi_reordering = tp->reordering; + + info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3; + info->tcpi_rcv_space = tp->rcvq_space.space; + + info->tcpi_total_retrans = tp->total_retrans; +} + +EXPORT_SYMBOL_GPL(tcp_get_info); + int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen) { @@ -2250,7 +2309,7 @@ if (!tcp_ehash) panic("Failed to allocate TCP established hash table\n"); for (i = 0; i < (tcp_ehash_size << 1); i++) { - tcp_ehash[i].lock = RW_LOCK_UNLOCKED; + rwlock_init(&tcp_ehash[i].lock); INIT_HLIST_HEAD(&tcp_ehash[i].chain); } @@ -2266,7 +2325,7 @@ if (!tcp_bhash) panic("Failed to allocate TCP bind hash table\n"); for (i = 0; i < tcp_bhash_size; i++) { - tcp_bhash[i].lock = SPIN_LOCK_UNLOCKED; + spin_lock_init(&tcp_bhash[i].lock); INIT_HLIST_HEAD(&tcp_bhash[i].chain); } @@ -2301,13 +2360,10 @@ printk(KERN_INFO "TCP: Hash tables configured " "(established %d bind %d)\n", tcp_ehash_size << 1, tcp_bhash_size); - - tcpdiag_init(); } EXPORT_SYMBOL(tcp_accept); EXPORT_SYMBOL(tcp_close); -EXPORT_SYMBOL(tcp_close_state); EXPORT_SYMBOL(tcp_destroy_sock); EXPORT_SYMBOL(tcp_disconnect); EXPORT_SYMBOL(tcp_getsockopt); diff -Nru a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c --- a/net/ipv4/tcp_diag.c 2004-12-24 13:36:17 -08:00 +++ b/net/ipv4/tcp_diag.c 2004-12-24 13:36:17 -08:00 @@ -18,6 +18,7 @@ #include <linux/random.h> #include <linux/cache.h> #include <linux/init.h> +#include <linux/time.h> #include <net/icmp.h> #include <net/tcp.h> @@ -29,6 +30,16 @@ #include <linux/tcp_diag.h> +struct tcpdiag_entry +{ + u32 *saddr; + u32 *daddr; + u16 sport; + u16 dport; + u16 family; + u16 userlocks; +}; + static struct sock *tcpnl; @@ -41,63 +52,8 @@ rta->rta_len = rtalen; \ RTA_DATA(rta); }) -/* Return information about state of tcp endpoint in API format. */ -void tcp_get_info(struct sock *sk, struct tcp_info *info) -{ - struct tcp_opt *tp = tcp_sk(sk); - u32 now = tcp_time_stamp; - - memset(info, 0, sizeof(*info)); - - info->tcpi_state = sk->sk_state; - info->tcpi_ca_state = tp->ca_state; - info->tcpi_retransmits = tp->retransmits; - info->tcpi_probes = tp->probes_out; - info->tcpi_backoff = tp->backoff; - - if (tp->tstamp_ok) - info->tcpi_options |= TCPI_OPT_TIMESTAMPS; - if (tp->sack_ok) - info->tcpi_options |= TCPI_OPT_SACK; - if (tp->wscale_ok) { - info->tcpi_options |= TCPI_OPT_WSCALE; - info->tcpi_snd_wscale = tp->snd_wscale; - info->tcpi_rcv_wscale = tp->rcv_wscale; - } - - if (tp->ecn_flags&TCP_ECN_OK) - info->tcpi_options |= TCPI_OPT_ECN; - - info->tcpi_rto = jiffies_to_usecs(tp->rto); - info->tcpi_ato = jiffies_to_usecs(tp->ack.ato); - info->tcpi_snd_mss = tp->mss_cache_std; - info->tcpi_rcv_mss = tp->ack.rcv_mss; - - info->tcpi_unacked = tcp_get_pcount(&tp->packets_out); - info->tcpi_sacked = tcp_get_pcount(&tp->sacked_out); - info->tcpi_lost = tcp_get_pcount(&tp->lost_out); - info->tcpi_retrans = tcp_get_pcount(&tp->retrans_out); - info->tcpi_fackets = tcp_get_pcount(&tp->fackets_out); - - info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime); - info->tcpi_last_data_recv = jiffies_to_msecs(now - tp->ack.lrcvtime); - info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp); - - info->tcpi_pmtu = tp->pmtu_cookie; - info->tcpi_rcv_ssthresh = tp->rcv_ssthresh; - info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3; - info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2; - info->tcpi_snd_ssthresh = tp->snd_ssthresh; - info->tcpi_snd_cwnd = tp->snd_cwnd; - info->tcpi_advmss = tp->advmss; - info->tcpi_reordering = tp->reordering; - - info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3; - info->tcpi_rcv_space = tp->rcvq_space.space; -} - static int tcpdiag_fill(struct sk_buff *skb, struct sock *sk, - int ext, u32 pid, u32 seq) + int ext, u32 pid, u32 seq, u16 nlmsg_flags) { struct inet_opt *inet = inet_sk(sk); struct tcp_opt *tp = tcp_sk(sk); @@ -109,6 +65,7 @@ unsigned char *b = skb->tail; nlh = NLMSG_PUT(skb, pid, seq, TCPDIAG_GETSOCK, sizeof(*r)); + nlh->nlmsg_flags = nlmsg_flags; r = NLMSG_DATA(nlh); if (sk->sk_state != TCP_TIME_WAIT) { if (ext & (1<<(TCPDIAG_MEMINFO-1))) @@ -146,7 +103,7 @@ r->tcpdiag_wqueue = 0; r->tcpdiag_uid = 0; r->tcpdiag_inode = 0; -#ifdef CONFIG_IPV6 +#ifdef CONFIG_IP_TCPDIAG_IPV6 if (r->tcpdiag_family == AF_INET6) { ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_src, &tw->tw_v6_rcv_saddr); @@ -163,7 +120,7 @@ r->id.tcpdiag_src[0] = inet->rcv_saddr; r->id.tcpdiag_dst[0] = inet->daddr; -#ifdef CONFIG_IPV6 +#ifdef CONFIG_IP_TCPDIAG_IPV6 if (r->tcpdiag_family == AF_INET6) { struct ipv6_pinfo *np = inet6_sk(sk); @@ -231,11 +188,19 @@ return -1; } -extern struct sock *tcp_v4_lookup(u32 saddr, u16 sport, u32 daddr, u16 dport, int dif); -#ifdef CONFIG_IPV6 +extern struct sock *tcp_v4_lookup(u32 saddr, u16 sport, u32 daddr, u16 dport, + int dif); +#ifdef CONFIG_IP_TCPDIAG_IPV6 extern struct sock *tcp_v6_lookup(struct in6_addr *saddr, u16 sport, struct in6_addr *daddr, u16 dport, int dif); +#else +static inline struct sock *tcp_v6_lookup(struct in6_addr *saddr, u16 sport, + struct in6_addr *daddr, u16 dport, + int dif) +{ + return NULL; +} #endif static int tcpdiag_get_exact(struct sk_buff *in_skb, const struct nlmsghdr *nlh) @@ -250,7 +215,7 @@ req->id.tcpdiag_src[0], req->id.tcpdiag_sport, req->id.tcpdiag_if); } -#ifdef CONFIG_IPV6 +#ifdef CONFIG_IP_TCPDIAG_IPV6 else if (req->tcpdiag_family == AF_INET6) { sk = tcp_v6_lookup((struct in6_addr*)req->id.tcpdiag_dst, req->id.tcpdiag_dport, (struct in6_addr*)req->id.tcpdiag_src, req->id.tcpdiag_sport, @@ -280,7 +245,7 @@ if (tcpdiag_fill(rep, sk, req->tcpdiag_ext, NETLINK_CB(in_skb).pid, - nlh->nlmsg_seq) <= 0) + nlh->nlmsg_seq, 0) <= 0) BUG(); err = netlink_unicast(tcpnl, rep, NETLINK_CB(in_skb).pid, MSG_DONTWAIT); @@ -324,11 +289,11 @@ } -static int tcpdiag_bc_run(const void *bc, int len, struct sock *sk) +static int tcpdiag_bc_run(const void *bc, int len, + const struct tcpdiag_entry *entry) { while (len > 0) { int yes = 1; - struct inet_opt *inet = inet_sk(sk); const struct tcpdiag_bc_op *op = bc; switch (op->code) { @@ -338,19 +303,19 @@ yes = 0; break; case TCPDIAG_BC_S_GE: - yes = inet->num >= op[1].no; + yes = entry->sport >= op[1].no; break; case TCPDIAG_BC_S_LE: - yes = inet->num <= op[1].no; + yes = entry->dport <= op[1].no; break; case TCPDIAG_BC_D_GE: - yes = ntohs(inet->dport) >= op[1].no; + yes = entry->dport >= op[1].no; break; case TCPDIAG_BC_D_LE: - yes = ntohs(inet->dport) <= op[1].no; + yes = entry->dport <= op[1].no; break; case TCPDIAG_BC_AUTO: - yes = !(sk->sk_userlocks & SOCK_BINDPORT_LOCK); + yes = !(entry->userlocks & SOCK_BINDPORT_LOCK); break; case TCPDIAG_BC_S_COND: case TCPDIAG_BC_D_COND: @@ -360,7 +325,7 @@ if (cond->port != -1 && cond->port != (op->code == TCPDIAG_BC_S_COND ? - inet->num : ntohs(inet->dport))) { + entry->sport : entry->dport)) { yes = 0; break; } @@ -368,26 +333,14 @@ if (cond->prefix_len == 0) break; -#ifdef CONFIG_IPV6 - if (sk->sk_family == AF_INET6) { - struct ipv6_pinfo *np = inet6_sk(sk); - - if (op->code == TCPDIAG_BC_S_COND) - addr = (u32*)&np->rcv_saddr; - else - addr = (u32*)&np->daddr; - } else -#endif - { - if (op->code == TCPDIAG_BC_S_COND) - addr = &inet->rcv_saddr; - else - addr = &inet->daddr; - } + if (op->code == TCPDIAG_BC_S_COND) + addr = entry->saddr; + else + addr = entry->daddr; if (bitstring_match(addr, cond->addr, cond->prefix_len)) break; - if (sk->sk_family == AF_INET6 && + if (entry->family == AF_INET6 && cond->family == AF_INET) { if (addr[0] == 0 && addr[1] == 0 && addr[2] == htonl(0xffff) && @@ -466,16 +419,182 @@ return len == 0 ? 0 : -EINVAL; } +static int tcpdiag_dump_sock(struct sk_buff *skb, struct sock *sk, + struct netlink_callback *cb) +{ + struct tcpdiagreq *r = NLMSG_DATA(cb->nlh); + + if (cb->nlh->nlmsg_len > 4 + NLMSG_SPACE(sizeof(*r))) { + struct tcpdiag_entry entry; + struct rtattr *bc = (struct rtattr *)(r + 1); + struct inet_opt *inet = inet_sk(sk); + + entry.family = sk->sk_family; +#ifdef CONFIG_IP_TCPDIAG_IPV6 + if (entry.family == AF_INET6) { + struct ipv6_pinfo *np = inet6_sk(sk); + + entry.saddr = np->rcv_saddr.s6_addr32; + entry.daddr = np->daddr.s6_addr32; + } else +#endif + { + entry.saddr = &inet->rcv_saddr; + entry.daddr = &inet->daddr; + } + entry.sport = inet->num; + entry.dport = ntohs(inet->dport); + entry.userlocks = sk->sk_userlocks; + + if (!tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), &entry)) + return 0; + } + + return tcpdiag_fill(skb, sk, r->tcpdiag_ext, NETLINK_CB(cb->skb).pid, + cb->nlh->nlmsg_seq, NLM_F_MULTI); +} + +static int tcpdiag_fill_req(struct sk_buff *skb, struct sock *sk, + struct open_request *req, + u32 pid, u32 seq) +{ + struct inet_opt *inet = inet_sk(sk); + unsigned char *b = skb->tail; + struct tcpdiagmsg *r; + struct nlmsghdr *nlh; + long tmo; + + nlh = NLMSG_PUT(skb, pid, seq, TCPDIAG_GETSOCK, sizeof(*r)); + nlh->nlmsg_flags = NLM_F_MULTI; + r = NLMSG_DATA(nlh); + + r->tcpdiag_family = sk->sk_family; + r->tcpdiag_state = TCP_SYN_RECV; + r->tcpdiag_timer = 1; + r->tcpdiag_retrans = req->retrans; + + r->id.tcpdiag_if = sk->sk_bound_dev_if; + r->id.tcpdiag_cookie[0] = (u32)(unsigned long)req; + r->id.tcpdiag_cookie[1] = (u32)(((unsigned long)req >> 31) >> 1); + + tmo = req->expires - jiffies; + if (tmo < 0) + tmo = 0; + + r->id.tcpdiag_sport = inet->sport; + r->id.tcpdiag_dport = req->rmt_port; + r->id.tcpdiag_src[0] = req->af.v4_req.loc_addr; + r->id.tcpdiag_dst[0] = req->af.v4_req.rmt_addr; + r->tcpdiag_expires = jiffies_to_msecs(tmo), + r->tcpdiag_rqueue = 0; + r->tcpdiag_wqueue = 0; + r->tcpdiag_uid = sock_i_uid(sk); + r->tcpdiag_inode = 0; +#ifdef CONFIG_IP_TCPDIAG_IPV6 + if (r->tcpdiag_family == AF_INET6) { + ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_src, + &req->af.v6_req.loc_addr); + ipv6_addr_copy((struct in6_addr *)r->id.tcpdiag_dst, + &req->af.v6_req.rmt_addr); + } +#endif + nlh->nlmsg_len = skb->tail - b; + + return skb->len; + +nlmsg_failure: + skb_trim(skb, b - skb->data); + return -1; +} + +static int tcpdiag_dump_reqs(struct sk_buff *skb, struct sock *sk, + struct netlink_callback *cb) +{ + struct tcpdiag_entry entry; + struct tcpdiagreq *r = NLMSG_DATA(cb->nlh); + struct tcp_opt *tp = tcp_sk(sk); + struct tcp_listen_opt *lopt; + struct rtattr *bc = NULL; + struct inet_opt *inet = inet_sk(sk); + int j, s_j; + int reqnum, s_reqnum; + int err = 0; + + s_j = cb->args[3]; + s_reqnum = cb->args[4]; + + if (s_j > 0) + s_j--; + + entry.family = sk->sk_family; + + read_lock_bh(&tp->syn_wait_lock); + + lopt = tp->listen_opt; + if (!lopt || !lopt->qlen) + goto out; + + if (cb->nlh->nlmsg_len > 4 + NLMSG_SPACE(sizeof(*r))) { + bc = (struct rtattr *)(r + 1); + entry.sport = inet->num; + entry.userlocks = sk->sk_userlocks; + } + + for (j = s_j; j < TCP_SYNQ_HSIZE; j++) { + struct open_request *req, *head = lopt->syn_table[j]; + + reqnum = 0; + for (req = head; req; reqnum++, req = req->dl_next) { + if (reqnum < s_reqnum) + continue; + if (r->id.tcpdiag_dport != req->rmt_port && + r->id.tcpdiag_dport) + continue; + + if (bc) { + entry.saddr = +#ifdef CONFIG_IP_TCPDIAG_IPV6 + (entry.family == AF_INET6) ? + req->af.v6_req.loc_addr.s6_addr32 : +#endif + &req->af.v4_req.loc_addr; + entry.daddr = +#ifdef CONFIG_IP_TCPDIAG_IPV6 + (entry.family == AF_INET6) ? + req->af.v6_req.rmt_addr.s6_addr32 : +#endif + &req->af.v4_req.rmt_addr; + entry.dport = ntohs(req->rmt_port); + + if (!tcpdiag_bc_run(RTA_DATA(bc), + RTA_PAYLOAD(bc), &entry)) + continue; + } + + err = tcpdiag_fill_req(skb, sk, req, + NETLINK_CB(cb->skb).pid, + cb->nlh->nlmsg_seq); + if (err < 0) { + cb->args[3] = j + 1; + cb->args[4] = reqnum; + goto out; + } + } + + s_reqnum = 0; + } + +out: + read_unlock_bh(&tp->syn_wait_lock); + + return err; +} static int tcpdiag_dump(struct sk_buff *skb, struct netlink_callback *cb) { int i, num; int s_i, s_num; struct tcpdiagreq *r = NLMSG_DATA(cb->nlh); - struct rtattr *bc = NULL; - - if (cb->nlh->nlmsg_len > 4+NLMSG_SPACE(sizeof(struct tcpdiagreq))) - bc = (struct rtattr*)(r+1); s_i = cb->args[1]; s_num = num = cb->args[2]; @@ -488,31 +607,47 @@ struct sock *sk; struct hlist_node *node; - if (i > s_i) - s_num = 0; - num = 0; sk_for_each(sk, node, &tcp_listening_hash[i]) { struct inet_opt *inet = inet_sk(sk); - if (num < s_num) - goto next_listen; - if (!(r->tcpdiag_states&TCPF_LISTEN) || - r->id.tcpdiag_dport) - goto next_listen; + + if (num < s_num) { + num++; + continue; + } + if (r->id.tcpdiag_sport != inet->sport && r->id.tcpdiag_sport) goto next_listen; - if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk)) + + if (!(r->tcpdiag_states&TCPF_LISTEN) || + r->id.tcpdiag_dport || + cb->args[3] > 0) + goto syn_recv; + + if (tcpdiag_dump_sock(skb, sk, cb) < 0) { + tcp_listen_unlock(); + goto done; + } + +syn_recv: + if (!(r->tcpdiag_states&TCPF_SYN_RECV)) goto next_listen; - if (tcpdiag_fill(skb, sk, r->tcpdiag_ext, - NETLINK_CB(cb->skb).pid, - cb->nlh->nlmsg_seq) <= 0) { + + if (tcpdiag_dump_reqs(skb, sk, cb) < 0) { tcp_listen_unlock(); goto done; } + next_listen: + cb->args[3] = 0; + cb->args[4] = 0; ++num; } + + s_num = 0; + cb->args[3] = 0; + cb->args[4] = 0; } tcp_listen_unlock(); skip_listen_ht: @@ -546,11 +681,7 @@ goto next_normal; if (r->id.tcpdiag_dport != inet->dport && r->id.tcpdiag_dport) goto next_normal; - if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk)) - goto next_normal; - if (tcpdiag_fill(skb, sk, r->tcpdiag_ext, - NETLINK_CB(cb->skb).pid, - cb->nlh->nlmsg_seq) <= 0) { + if (tcpdiag_dump_sock(skb, sk, cb) < 0) { read_unlock_bh(&head->lock); goto done; } @@ -571,11 +702,7 @@ if (r->id.tcpdiag_dport != inet->dport && r->id.tcpdiag_dport) goto next_dying; - if (bc && !tcpdiag_bc_run(RTA_DATA(bc), RTA_PAYLOAD(bc), sk)) - goto next_dying; - if (tcpdiag_fill(skb, sk, r->tcpdiag_ext, - NETLINK_CB(cb->skb).pid, - cb->nlh->nlmsg_seq) <= 0) { + if (tcpdiag_dump_sock(skb, sk, cb) < 0) { read_unlock_bh(&head->lock); goto done; } @@ -657,9 +784,19 @@ } } -void __init tcpdiag_init(void) +static int __init tcpdiag_init(void) { tcpnl = netlink_kernel_create(NETLINK_TCPDIAG, tcpdiag_rcv); if (tcpnl == NULL) - panic("tcpdiag_init: Cannot create netlink socket."); + return -ENOMEM; + return 0; } + +static void __exit tcpdiag_exit(void) +{ + sock_release(tcpnl->sk_socket); +} + +module_init(tcpdiag_init); +module_exit(tcpdiag_exit); +MODULE_LICENSE("GPL"); diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c --- a/net/ipv4/tcp_input.c 2004-12-24 13:37:04 -08:00 +++ b/net/ipv4/tcp_input.c 2004-12-24 13:37:04 -08:00 @@ -2369,25 +2369,19 @@ { struct tcp_opt *tp = tcp_sk(sk); struct tcp_skb_cb *scb = TCP_SKB_CB(skb); - __u32 mss = tcp_skb_mss(skb); - __u32 snd_una = tp->snd_una; - __u32 orig_seq, seq; - __u32 packets_acked = 0; + __u32 seq = tp->snd_una; + __u32 packets_acked; int acked = 0; /* If we get here, the whole TSO packet has not been * acked. */ - BUG_ON(!after(scb->end_seq, snd_una)); + BUG_ON(!after(scb->end_seq, seq)); - seq = orig_seq = scb->seq; - while (!after(seq + mss, snd_una)) { - packets_acked++; - seq += mss; - } - - if (tcp_trim_head(sk, skb, (seq - orig_seq))) + packets_acked = tcp_skb_pcount(skb); + if (tcp_trim_head(sk, skb, seq - scb->seq)) return 0; + packets_acked -= tcp_skb_pcount(skb); if (packets_acked) { __u8 sacked = scb->sacked; @@ -3034,8 +3028,8 @@ tp->snd_wscale = *(__u8 *)ptr; if(tp->snd_wscale > 14) { if(net_ratelimit()) - printk("tcp_parse_options: Illegal window " - "scaling value %d >14 received.", + printk(KERN_INFO "tcp_parse_options: Illegal window " + "scaling value %d >14 received.\n", tp->snd_wscale); tp->snd_wscale = 14; } @@ -4963,7 +4957,6 @@ EXPORT_SYMBOL(sysctl_tcp_ecn); EXPORT_SYMBOL(sysctl_tcp_reordering); -EXPORT_SYMBOL(tcp_cwnd_application_limited); EXPORT_SYMBOL(tcp_parse_options); EXPORT_SYMBOL(tcp_rcv_established); EXPORT_SYMBOL(tcp_rcv_state_process); diff -Nru a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c --- a/net/ipv4/tcp_ipv4.c 2004-12-24 13:36:34 -08:00 +++ b/net/ipv4/tcp_ipv4.c 2004-12-24 13:36:34 -08:00 @@ -448,8 +448,8 @@ } /* Optimize the common listener case. */ -inline struct sock *tcp_v4_lookup_listener(u32 daddr, unsigned short hnum, - int dif) +static inline struct sock *tcp_v4_lookup_listener(u32 daddr, + unsigned short hnum, int dif) { struct sock *sk = NULL; struct hlist_head *head; @@ -535,6 +535,8 @@ return sk; } +EXPORT_SYMBOL_GPL(tcp_v4_lookup); + static inline __u32 tcp_v4_init_sequence(struct sock *sk, struct sk_buff *skb) { return secure_tcp_sequence_number(skb->nh.iph->daddr, @@ -2596,6 +2598,7 @@ struct proto tcp_prot = { .name = "TCP", + .owner = THIS_MODULE, .close = tcp_close, .connect = tcp_v4_connect, .disconnect = tcp_disconnect, @@ -2653,7 +2656,6 @@ EXPORT_SYMBOL(tcp_v4_conn_request); EXPORT_SYMBOL(tcp_v4_connect); EXPORT_SYMBOL(tcp_v4_do_rcv); -EXPORT_SYMBOL(tcp_v4_lookup_listener); EXPORT_SYMBOL(tcp_v4_rebuild_header); EXPORT_SYMBOL(tcp_v4_remember_stamp); EXPORT_SYMBOL(tcp_v4_send_check); @@ -2663,8 +2665,7 @@ EXPORT_SYMBOL(tcp_proc_register); EXPORT_SYMBOL(tcp_proc_unregister); #endif -#ifdef CONFIG_SYSCTL EXPORT_SYMBOL(sysctl_local_port_range); EXPORT_SYMBOL(sysctl_max_syn_backlog); EXPORT_SYMBOL(sysctl_tcp_low_latency); -#endif + diff -Nru a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c --- a/net/ipv4/tcp_minisocks.c 2004-12-24 13:37:04 -08:00 +++ b/net/ipv4/tcp_minisocks.c 2004-12-24 13:37:04 -08:00 @@ -706,7 +706,7 @@ sock_lock_init(newsk); bh_lock_sock(newsk); - newsk->sk_dst_lock = RW_LOCK_UNLOCKED; + rwlock_init(&newsk->sk_dst_lock); atomic_set(&newsk->sk_rmem_alloc, 0); skb_queue_head_init(&newsk->sk_receive_queue); atomic_set(&newsk->sk_wmem_alloc, 0); @@ -719,7 +719,7 @@ newsk->sk_userlocks = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK; newsk->sk_backlog.head = newsk->sk_backlog.tail = NULL; newsk->sk_send_head = NULL; - newsk->sk_callback_lock = RW_LOCK_UNLOCKED; + rwlock_init(&newsk->sk_callback_lock); skb_queue_head_init(&newsk->sk_error_queue); newsk->sk_write_space = sk_stream_write_space; @@ -1075,7 +1075,3 @@ EXPORT_SYMBOL(tcp_create_openreq_child); EXPORT_SYMBOL(tcp_timewait_state_process); EXPORT_SYMBOL(tcp_tw_deschedule); - -#ifdef CONFIG_SYSCTL -EXPORT_SYMBOL(sysctl_tcp_tw_recycle); -#endif diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c --- a/net/ipv4/tcp_output.c 2004-12-24 13:37:01 -08:00 +++ b/net/ipv4/tcp_output.c 2004-12-24 13:37:01 -08:00 @@ -455,9 +455,13 @@ { struct tcp_opt *tp = tcp_sk(sk); struct sk_buff *buff; - int nsize = skb->len - len; + int nsize; u16 flags; + nsize = skb_headlen(skb) - len; + if (nsize < 0) + nsize = 0; + if (skb_cloned(skb) && skb_is_nonlinear(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) @@ -562,8 +566,6 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len) { - struct tcp_opt *tp = tcp_sk(sk); - if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) return -ENOMEM; @@ -586,7 +588,8 @@ /* Any change of skb->len requires recalculation of tso * factor and mss. */ - tcp_set_skb_tso_segs(skb, tp->mss_cache_std); + if (tcp_skb_pcount(skb) > 1) + tcp_set_skb_tso_segs(skb, tcp_skb_mss(skb)); return 0; } @@ -1102,6 +1105,8 @@ /* Update global TCP statistics. */ TCP_INC_STATS(TCP_MIB_RETRANSSEGS); + tp->total_retrans++; + #if FASTRETRANS_DEBUG > 0 if (TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_RETRANS) { if (net_ratelimit()) @@ -1715,12 +1720,7 @@ } } -EXPORT_SYMBOL(tcp_acceptable_seq); EXPORT_SYMBOL(tcp_connect); -EXPORT_SYMBOL(tcp_connect_init); EXPORT_SYMBOL(tcp_make_synack); -EXPORT_SYMBOL(tcp_send_synack); EXPORT_SYMBOL(tcp_simple_retransmit); EXPORT_SYMBOL(tcp_sync_mss); -EXPORT_SYMBOL(tcp_write_wakeup); -EXPORT_SYMBOL(tcp_write_xmit); diff -Nru a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c --- a/net/ipv4/tcp_timer.c 2004-12-24 13:37:19 -08:00 +++ b/net/ipv4/tcp_timer.c 2004-12-24 13:37:19 -08:00 @@ -36,7 +36,9 @@ static void tcp_delack_timer(unsigned long); static void tcp_keepalive_timer (unsigned long data); -const char timer_bug_msg[] = KERN_DEBUG "tcpbug: unknown timer value\n"; +#ifdef TCP_DEBUG +const char tcp_timer_bug_msg[] = KERN_DEBUG "tcpbug: unknown timer value\n"; +#endif /* * Using different timers for retransmit, delayed acks and probes @@ -651,3 +653,6 @@ EXPORT_SYMBOL(tcp_delete_keepalive_timer); EXPORT_SYMBOL(tcp_init_xmit_timers); EXPORT_SYMBOL(tcp_reset_keepalive_timer); +#ifdef TCP_DEBUG +EXPORT_SYMBOL(tcp_timer_bug_msg); +#endif diff -Nru a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c --- a/net/ipv6/tcp_ipv6.c 2004-12-24 13:36:56 -08:00 +++ b/net/ipv6/tcp_ipv6.c 2004-12-24 13:36:56 -08:00 @@ -262,7 +262,7 @@ score = 1; if (!ipv6_addr_any(&np->rcv_saddr)) { - if (ipv6_addr_cmp(&np->rcv_saddr, daddr)) + if (!ipv6_addr_equal(&np->rcv_saddr, daddr)) continue; score++; } @@ -321,8 +321,8 @@ if(*((__u32 *)&(tw->tw_dport)) == ports && sk->sk_family == PF_INET6) { - if(!ipv6_addr_cmp(&tw->tw_v6_daddr, saddr) && - !ipv6_addr_cmp(&tw->tw_v6_rcv_saddr, daddr) && + if(ipv6_addr_equal(&tw->tw_v6_daddr, saddr) && + ipv6_addr_equal(&tw->tw_v6_rcv_saddr, daddr) && (!sk->sk_bound_dev_if || sk->sk_bound_dev_if == dif)) goto hit; } @@ -364,6 +364,8 @@ return sk; } +EXPORT_SYMBOL_GPL(tcp_v6_lookup); + /* * Open request hash tables. @@ -404,8 +406,8 @@ prev = &req->dl_next) { if (req->rmt_port == rport && req->class->family == AF_INET6 && - !ipv6_addr_cmp(&req->af.v6_req.rmt_addr, raddr) && - !ipv6_addr_cmp(&req->af.v6_req.loc_addr, laddr) && + ipv6_addr_equal(&req->af.v6_req.rmt_addr, raddr) && + ipv6_addr_equal(&req->af.v6_req.loc_addr, laddr) && (!req->af.v6_req.iif || req->af.v6_req.iif == iif)) { BUG_TRAP(req->sk == NULL); *prevp = prev; @@ -461,8 +463,8 @@ if(*((__u32 *)&(tw->tw_dport)) == ports && sk2->sk_family == PF_INET6 && - !ipv6_addr_cmp(&tw->tw_v6_daddr, saddr) && - !ipv6_addr_cmp(&tw->tw_v6_rcv_saddr, daddr) && + ipv6_addr_equal(&tw->tw_v6_daddr, saddr) && + ipv6_addr_equal(&tw->tw_v6_rcv_saddr, daddr) && sk2->sk_bound_dev_if == sk->sk_bound_dev_if) { struct tcp_opt *tp = tcp_sk(sk); @@ -608,7 +610,7 @@ } if (tp->ts_recent_stamp && - ipv6_addr_cmp(&np->daddr, &usin->sin6_addr)) { + !ipv6_addr_equal(&np->daddr, &usin->sin6_addr)) { tp->ts_recent = 0; tp->ts_recent_stamp = 0; tp->write_seq = 0; @@ -1802,6 +1804,7 @@ struct ipv6_pinfo *np = inet6_sk(sk); struct flowi fl; struct dst_entry *dst; + struct in6_addr *final_p = NULL, final; memset(&fl, 0, sizeof(fl)); fl.proto = IPPROTO_TCP; @@ -1815,7 +1818,9 @@ if (np->opt && np->opt->srcrt) { struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt; + ipv6_addr_copy(&final, &fl.fl6_dst); ipv6_addr_copy(&fl.fl6_dst, rt0->addr); + final_p = &final; } dst = __sk_dst_check(sk, np->dst_cookie); @@ -1828,6 +1833,9 @@ return err; } + if (final_p) + ipv6_addr_copy(&fl.fl6_dst, final_p); + if ((err = xfrm_lookup(&dst, &fl, sk, 0)) < 0) { sk->sk_route_caps = 0; dst_release(dst); @@ -2124,6 +2132,7 @@ struct proto tcpv6_prot = { .name = "TCPv6", + .owner = THIS_MODULE, .close = tcp_close, .connect = tcp_v6_connect, .disconnect = tcp_disconnect, ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-20 23:06 Hubert Tonneau
0 siblings, 0 replies; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-20 23:06 UTC (permalink / raw)
To: David S. Miller, Alexey Kuznetsov, Nivedita Singhvi
Cc: Stephen Hemminger, romieu, kuznet, niv, rick.jones2, netdev
I've noticed something very interesting:
if trying to send to a gigabit connected Mac OSX instead of 100 Mbps connected,
then there is no drastic slowdown when switching Linux 2.6.9 to 2.6.10
> Any chance you could
> send me just the following from your boxes:
> (Before and after the transfer)
>
> - /proc/net/snmp
> - /proc/net/netstat
Here are the requested extra informations:
2.6.10-ac10 before:
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 47336 0 0 0 0 0 47197 127721 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 2 0 0 0 0 0 0 0 2 0 0 0 0 417 0 417 0 0 0 0 0 0 0 0 0 0
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 40 209 0 2 7 46158 126953 156 0 243
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 332 417 0 336
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLoss TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnSyn TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory T
CPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures
TcpExt: 0 0 0 0 0 0 0 0 0 0 94 0 0 0 0 0 452 0 0 0 0 9499 215 241030 0 7583 377 16696 3330 123 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 0 0 7 0 0 0 0 0 0 0 0 0 90 0 0 2 0 0 0
2.6.10-ac10 after sending to the 100 Mbps connected Mac OSX:
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 70100 0 0 0 0 0 69901 214176 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 2 0 0 0 0 0 0 0 2 0 0 0 0 421 0 421 0 0 0 0 0 0 0 0 0 0
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 49 263 0 2 9 68728 213354 284 0 315
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 382 421 0 386
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLoss TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnSyn TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory T
CPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures
TcpExt: 0 0 0 0 0 0 0 0 0 0 105 0 0 0 0 0 804 0 0 0 0 12808 215 310763 0 11460 472 26236 5086 247 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 247 0 0 11 0 0 0 0 0 0 0 0 0 123 0 0 2 0 0 0
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: 2.6.10 TCP troubles -- suggested patch @ 2005-02-16 20:00 Hubert Tonneau 0 siblings, 0 replies; 40+ messages in thread From: Hubert Tonneau @ 2005-02-16 20:00 UTC (permalink / raw) To: David S. Miller, Alexey Kuznetsov Cc: shemminger, romieu, kuznet, niv, rick.jones2, netdev David S. Miller wrote: > > Hubert, do you have netfilter enabled in the 2.6.10 kernel you are running? > > I'm asking because the TCP changes in 2.6.10 are pretty benign > (attached for the curious who want to review along), whereas > netfilter had major updates particularly in the TCP connection > tracking code. There is no netfilter on this server. > I also reviewed 2.6.10-ac11 for anything interesting wrt. TCP and the > only thing in there is the tcp_retrans_try_collapse() missing check > to avoid collapsing TSO segments. I'm using 2.6.10-ac11 for security reasons. I could use 2.6.10-as1 as well. As far as I know, they all behave exactly the same from the TCP point of view. The difference is definetly between stock 2.6.9 and stock 2.6.10 If it helps, you can send me a patch reverting TCP changes between 2.6.10 and 2.6.9, and I'll give it a spin, just to be sure that the problem is truely related to TCP code, not other changes side effects. Anyway, here is the set of settings I'm using to build the kernel, and no module is loaded while the test is running: CONFIG_2GB: y CONFIG_ACPI: y CONFIG_ACPI_AC: m CONFIG_ACPI_BATTERY: m CONFIG_ACPI_BUTTON: m CONFIG_ACPI_FAN: m CONFIG_ACPI_PROCESSOR: y CONFIG_ACPI_SLEEP: y CONFIG_ACPI_THERMAL: y CONFIG_ACPI_VIDEO: m CONFIG_APM_RTC_IS_GMT: y CONFIG_ATALK: m CONFIG_AUTODETECT_RAID: y CONFIG_AUTOFS_FS: m CONFIG_BINFMT_ELF: y CONFIG_BINFMT_MISC: y CONFIG_BLK_DEV_CMD640: y CONFIG_BLK_DEV_FD: m CONFIG_BLK_DEV_GENERIC: y CONFIG_BLK_DEV_IDE: y CONFIG_BLK_DEV_IDECD: m CONFIG_BLK_DEV_IDEDISK: y CONFIG_BLK_DEV_IDEDMA: y CONFIG_BLK_DEV_IDEDMA_PCI: y CONFIG_BLK_DEV_IDEPCI: y CONFIG_BLK_DEV_IDESCSI: m CONFIG_BLK_DEV_LOOP: m CONFIG_BLK_DEV_MD: y CONFIG_BLK_DEV_NBD: m CONFIG_BLK_DEV_PIIX: y CONFIG_BLK_DEV_RAM: m CONFIG_BLK_DEV_RZ1000: y CONFIG_BLK_DEV_SD: y CONFIG_BLK_DEV_SR: m CONFIG_BLK_DEV_TRIRON: y CONFIG_BSD_PROCESS_ACCT: y CONFIG_CHR_DEV_SG: m CONFIG_CHR_DEV_ST: m CONFIG_CODA_FS: m CONFIG_E1000: y CONFIG_EXPERIMENTAL: y CONFIG_EXT2_FS: y CONFIG_EXT3_FS: y CONFIG_EXT3_FS_XATTR: y CONFIG_FAT_FS: m CONFIG_FILTER: y CONFIG_FUSION: y CONFIG_FUSION_CTL: m CONFIG_FUSION_ISENSE: m CONFIG_FUSION_LAN: m CONFIG_HFSPLUS_FS: m CONFIG_HFS_FS: m CONFIG_HIGHMEM: y CONFIG_HIGHMEM4G: y CONFIG_HPET_TIMER: y CONFIG_HPFS_FS: m CONFIG_IDE: y CONFIG_IDEDMA_AUTO: y CONFIG_IDEDMA_ONLYDISK: y CONFIG_IDEDMA_PCI_AUTO: y CONFIG_IDEPCI_SHARE_IRQ: y CONFIG_IDE_GENERIC: y CONFIG_INET: y CONFIG_INPUT: y CONFIG_INPUT_KEYBDEV: m CONFIG_INPUT_KEYBOARD: y CONFIG_INPUT_MOUSE: y CONFIG_INPUT_MOUSEDEV: m CONFIG_IP_ALIAS: y CONFIG_IP_ROUTE_VERBOSE: y CONFIG_IRQBALANCE: y CONFIG_ISO9660_FS: m CONFIG_KCORE_ELF: y CONFIG_KEYBOARD_ATKBD: y CONFIG_LEGACY_PTYS: y CONFIG_LOCKD: m CONFIG_M386: n CONFIG_M486: n CONFIG_M586: n CONFIG_M686: n CONFIG_MAC_PARTITION: y CONFIG_MD: y CONFIG_MD_BOOT: y CONFIG_MD_LINEAR: y CONFIG_MD_LVM: n CONFIG_MD_MIRRORING: y CONFIG_MD_RAID0: y CONFIG_MD_RAID1: y CONFIG_MD_RAID5: y CONFIG_MD_STRIPED: y CONFIG_MD_TRANSLUCENT: n CONFIG_MODULES: y CONFIG_MODULE_UNLOAD: y CONFIG_MOUSE: m CONFIG_MOUSE_PS2: y CONFIG_MPENTIUM4: y CONFIG_MSDOS_FS: m CONFIG_MTRR: y CONFIG_NET: y CONFIG_NETDEVICES: y CONFIG_NET_ETHERNET: y CONFIG_NFSD: m CONFIG_NFS_FS: m CONFIG_NLS: y CONFIG_NLS_CODEPAGE_437: m CONFIG_NLS_CODEPAGE_850: m CONFIG_NLS_ISO8859_1: m CONFIG_NLS_UTF8: m CONFIG_NTFS_FS: m CONFIG_OOM_KILLER: y CONFIG_PACKET: y CONFIG_PARPORT: m CONFIG_PARPORT_PC: m CONFIG_PCI: y CONFIG_PCI_BIOS: y CONFIG_PCI_GOANY: y CONFIG_PCI_OLD_PROC: y CONFIG_PCI_QUIRKS: y CONFIG_PIIX_TUNING: y CONFIG_PM: y CONFIG_PPP: m CONFIG_PPPOE: m CONFIG_PPP_ASYNC: m CONFIG_PPP_BSDCOMP: m CONFIG_PPP_DEFLATE: m CONFIG_PPP_FILTER: y CONFIG_PPP_SYNC_TTY: m CONFIG_PREEMPT: y CONFIG_PRINTER: m CONFIG_PRINTER_READBACK: y CONFIG_PROC_FS: y CONFIG_PSMOUSE: y CONFIG_QNX4FS_FS: m CONFIG_REGPARM: y CONFIG_RTC: y CONFIG_SCSI: y CONFIG_SCSI_PROC_FS: y CONFIG_SERIAL: m CONFIG_SERIAL_8250: m CONFIG_SHAPER: m CONFIG_SLIP: m CONFIG_SMB_FS: m CONFIG_SMP: y CONFIG_SOUND: m CONFIG_SUNRPC: m CONFIG_SYSCTL: y CONFIG_SYSVIPC: y CONFIG_UFS_FS: m CONFIG_UMSDOS_FS: m CONFIG_UNIX: y CONFIG_USB: m CONFIG_USB_ACM: m CONFIG_USB_AUDIO: m CONFIG_USB_CDCETHER: m CONFIG_USB_DEVICEFS: y CONFIG_USB_EHCI_HCD: m CONFIG_USB_HID: m CONFIG_USB_HIDINPUT: y CONFIG_USB_KBD: m CONFIG_USB_MOUSE: m CONFIG_USB_OHCI: m CONFIG_USB_OHCI_HCD: m CONFIG_USB_PRINTER: m CONFIG_USB_SERIAL: m CONFIG_USB_STORAGE: m CONFIG_USB_UHCI: m CONFIG_USB_UHCI_ALT: m CONFIG_USB_UHCI_HCD: m CONFIG_VFAT_FS: m CONFIG_VGA_CONSOLE: y CONFIG_VT: y CONFIG_VT_CONSOLE: y CONFIG_X86_MCE: y CONFIG_X86_UP_APIC: y CONFIG_X86_UP_IOAPIC: y Since we are at it, here are the hardware components of the box: 8086 Intel Corporation 254C E7501 0 Host Controller 8086 Intel Corporation 2543 E7500/E7501 0 HI_B Virtual PCI-to-PCI Bridge 8086 Intel Corporation 2545 E7500/E7501 0 HI_C Virtual PCI-to-PCI Bridge 8086 Intel Corporation 2547 E7500/E7501 0 HI_D Virtual PCI-to-PCI Bridge 8086 Intel Corporation 2482 82801CA/CAM 10 USB Controller 8086 Intel Corporation 244E 82801BA/CA/DB, 6300ESB 0 Hub Interface to PCI Bridge 8086 Intel Corporation 2480 82801CA 0 LPC Interface Bridge 8086 Intel Corporation 248B 82801CA 0 UltraATA/100 IDE Controller 8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller 8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller 8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller 8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller 8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller 8086 Intel Corporation 1461 14611014 0 I/OxAPIC Interrupt Controller 8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge 8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge 8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge 8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge 8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge 8086 Intel Corporation 1460 82870P2 0 Hub Interface-to-PCI Bridge 8086 Intel Corporation 1026 82545GM 18 Gigabit Ethernet Controller 8086 Intel Corporation 100D 82544GC 1C Gigabit Ethernet Controller (LOM) 8086 Intel Corporation 0309 80303 0 I/O Processor PCI-to-PCI Bridge Unit 1000 LSI Logic 0030 LSI53C1020/1030 78 PCI-X to Ultra320 SCSI Controller 1000 LSI Logic 0030 LSI53C1020/1030 79 PCI-X to Ultra320 SCSI Controller 1002 ATI Technologies 4752 Rage XL PCI 0 And the interrupts (while running 2.6.9): CPU0 CPU1 0: 159132374 132686719 IO-APIC-edge timer 1: 9 0 IO-APIC-edge i8042 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-level acpi 14: 1 0 IO-APIC-edge ide0 24: 22225220 0 IO-APIC-level eth0 28: 4 134406507 IO-APIC-level eth1 120: 532730 578109 IO-APIC-level ioc0 121: 1931739 1327672 IO-APIC-level ioc1 NMI: 0 0 LOC: 291863458 291863528 ERR: 0 MIS: 0 /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0:2512143307 20914618 0 0 0 0 0 0 1951489031 52933097 0 0 0 0 0 0 eth1:943883086 75451745 0 0 0 0 0 0 201914508 171409895 0 0 0 0 0 0 lo:2247204588 748445 0 0 0 0 0 0 2247204588 748445 0 0 0 0 0 0 /proc/net/route Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT eth0 207C29D5 00000000 0001 0 0 0 F0FFFFFF 0 0 0 eth1 00606B0A 00000000 0001 0 0 0 00FFFFFF 0 0 0 eth0 00000000 217C29D5 0003 0 0 0 00000000 0 0 0 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-13 10:52 Hubert Tonneau
2005-02-14 14:12 ` Alexey Kuznetsov
0 siblings, 1 reply; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-13 10:52 UTC (permalink / raw)
To: Alexey Kuznetsov, David S. Miller
Cc: Alexey Kuznetsov, rick.jones2, shemminger, romieu, netdev
Alexey Kuznetsov wrote:
>
> Exactly. That's why the next test should be with disabled TSO in 2.6.9.
> If too rare PSHs were a problem, it will show as the same disaster there.
After,
ethtool -K eth1 tso off
the result is unchanged on 2.6.9 (14 seconds for 105 MB).
After,
ethtool -K eth1 tso off
the result is also unchanged on 2.6.10-ac11 with no extra TCP patch (325 seconds).
Settings for eth1:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: g
Current message level: 0x00000007 (7)
Link detected: yes
PS:
Please sorry for the long delay I have to run tests, and the reason is that
it's a production server, so I cannot make tests in the middle of the day,
it's remote, so in order to switch the kernel, I have to upload the new one,
and then upload again the old one to switch back, and the best connection
I have these days is 30 Kbps modem connection. It will improve on monday since
I'll have a 128 Kbps ADSL connection.
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-13 10:52 Hubert Tonneau @ 2005-02-14 14:12 ` Alexey Kuznetsov 0 siblings, 0 replies; 40+ messages in thread From: Alexey Kuznetsov @ 2005-02-14 14:12 UTC (permalink / raw) To: Hubert Tonneau Cc: Alexey Kuznetsov, David S. Miller, rick.jones2, shemminger, romieu, netdev Hello! > ethtool -K eth1 tso off > the result is unchanged on 2.6.9 (14 seconds for 105 MB). > > After, > ethtool -K eth1 tso off > the result is also unchanged on 2.6.10-ac11 with no extra TCP patch (325 seconds). Well, it means the theory was wrong... tso is innocent. To make a new theory we need a tcpdump of 2.6.10 with disabled tso. > it's a production server, I hope we can stay in its normal configuration now. TSO may be kept disabled. Alexey ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch
@ 2005-02-10 21:53 Hubert Tonneau
2005-02-10 22:36 ` Rick Jones
0 siblings, 1 reply; 40+ messages in thread
From: Hubert Tonneau @ 2005-02-10 21:53 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Francois Romieu, Alexey Kuznetsov, netdev
It does not seem to solve the problem:
. Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX
. Linux 2.6.10 with the TCP patch still takes 325 seconds.
Stephen Hemminger wrote:
>
> Please try this patch, based on Alexey's suggestion:
>
> > That's one quick and simple idea: set PSH on each tso segment.
> > Seems, it is always good. Hardware will preserve it only on the last skb and
> > everyone will be happy.
>
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> # 2005/02/09 11:00:57-08:00 shemminger@linux.site
> # Always set PUSH on TSO multi-segment frames
> # to workaround bugs in MacOSX
> #
> # net/ipv4/tcp_output.c
> # 2005/02/09 11:00:44-08:00 shemminger@linux.site +8 -0
> # Always set PUSH on TSO multi-segment frames
> # to workaround bugs in MacOSX
> #
> diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> --- a/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00
> +++ b/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00
> @@ -754,6 +754,14 @@
> break;
> }
>
> + /* Force push to be on for any large TSO frames
> + * to workaround problems with busted implementations
> + * like MacOSX that hold off delivery of data until
> + * push.
> + */
> + if (tcp_skb_pcount(skb) > 1)
> + TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
> +
> TCP_SKB_CB(skb)->when = tcp_time_stamp;
> if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
> break;
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-10 21:53 Hubert Tonneau @ 2005-02-10 22:36 ` Rick Jones 2005-02-11 1:16 ` David S. Miller 0 siblings, 1 reply; 40+ messages in thread From: Rick Jones @ 2005-02-10 22:36 UTC (permalink / raw) To: Hubert Tonneau Cc: Stephen Hemminger, Francois Romieu, Alexey Kuznetsov, netdev Hubert Tonneau wrote: > It does not seem to solve the problem: > . Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX > . Linux 2.6.10 with the TCP patch still takes 325 seconds. is there a packet trace somewhere? rick jones ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-10 22:36 ` Rick Jones @ 2005-02-11 1:16 ` David S. Miller 0 siblings, 0 replies; 40+ messages in thread From: David S. Miller @ 2005-02-11 1:16 UTC (permalink / raw) To: Rick Jones; +Cc: hubert.tonneau, shemminger, romieu, kuznet, netdev On Thu, 10 Feb 2005 14:36:40 -0800 Rick Jones <rick.jones2@hp.com> wrote: > Hubert Tonneau wrote: > > It does not seem to solve the problem: > > . Linux 2.6.9 takes 15 seconds to copy 105 MB to the Mac OSX > > . Linux 2.6.10 with the TCP patch still takes 325 seconds. > > > is there a packet trace somewhere? I know what's wrong, no trace needed, Stephen's patch misses tcp_push_one() and similar. He only added the PSH bit setting to tcp_write_xmit(). Hubert, try this patch instead. ===== net/ipv4/tcp_output.c 1.77 vs edited ===== --- 1.77/net/ipv4/tcp_output.c 2005-01-18 12:23:36 -08:00 +++ edited/net/ipv4/tcp_output.c 2005-02-10 16:42:42 -08:00 @@ -408,6 +408,16 @@ sk->sk_send_head = skb; } +static inline void tcp_tso_set_push(struct sk_buff *skb) +{ + /* Force push to be on for any TSO frames to workaround + * problems with busted implementations like Mac OS-X that + * hold off socket reveive wakeups until push is seen. + */ + if (tcp_skb_pcount(skb) > 1) + TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; +} + /* Send _single_ skb sitting at the send head. This function requires * true push pending frames to setup probe timer etc. */ @@ -419,6 +429,7 @@ if (tcp_snd_test(tp, skb, cur_mss, TCP_NAGLE_PUSH)) { /* Send it out now. */ TCP_SKB_CB(skb)->when = tcp_time_stamp; + tcp_tso_set_push(skb); if (!tcp_transmit_skb(sk, skb_clone(skb, sk->sk_allocation))) { sk->sk_send_head = NULL; tp->snd_nxt = TCP_SKB_CB(skb)->end_seq; @@ -755,6 +766,7 @@ } TCP_SKB_CB(skb)->when = tcp_time_stamp; + tcp_tso_set_push(skb); if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC))) break; @@ -1096,6 +1108,7 @@ * is still in somebody's hands, else make a clone. */ TCP_SKB_CB(skb)->when = tcp_time_stamp; + tcp_tso_set_push(skb); err = tcp_transmit_skb(sk, (skb_cloned(skb) ? pskb_copy(skb, GFP_ATOMIC): @@ -1668,6 +1681,7 @@ TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; TCP_SKB_CB(skb)->when = tcp_time_stamp; + tcp_tso_set_push(skb); err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)); if (!err) { update_send_head(sk, tp, skb); ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <050QTJA12@server5.heliogroup.fr>]
* Re: 2.6.10 TCP troubles -- suggested patch [not found] <050QTJA12@server5.heliogroup.fr> @ 2005-02-09 18:59 ` Stephen Hemminger 2005-02-09 20:25 ` David S. Miller 0 siblings, 1 reply; 40+ messages in thread From: Stephen Hemminger @ 2005-02-09 18:59 UTC (permalink / raw) To: Hubert Tonneau; +Cc: Francois Romieu, Alexey Kuznetsov, netdev Please try this patch, based on Alexey's suggestion: > That's one quick and simple idea: set PSH on each tso segment. > Seems, it is always good. Hardware will preserve it only on the last skb and > everyone will be happy. # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/02/09 11:00:57-08:00 shemminger@linux.site # Always set PUSH on TSO multi-segment frames # to workaround bugs in MacOSX # # net/ipv4/tcp_output.c # 2005/02/09 11:00:44-08:00 shemminger@linux.site +8 -0 # Always set PUSH on TSO multi-segment frames # to workaround bugs in MacOSX # diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c --- a/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00 +++ b/net/ipv4/tcp_output.c 2005-02-09 11:01:12 -08:00 @@ -754,6 +754,14 @@ break; } + /* Force push to be on for any large TSO frames + * to workaround problems with busted implementations + * like MacOSX that hold off delivery of data until + * push. + */ + if (tcp_skb_pcount(skb) > 1) + TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; + TCP_SKB_CB(skb)->when = tcp_time_stamp; if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC))) break; ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: 2.6.10 TCP troubles -- suggested patch 2005-02-09 18:59 ` Stephen Hemminger @ 2005-02-09 20:25 ` David S. Miller 0 siblings, 0 replies; 40+ messages in thread From: David S. Miller @ 2005-02-09 20:25 UTC (permalink / raw) To: Stephen Hemminger; +Cc: hubert.tonneau, romieu, kuznet, netdev On Wed, 9 Feb 2005 10:59:09 -0800 Stephen Hemminger <shemminger@osdl.org> wrote: > Please try this patch, based on Alexey's suggestion: -EBADINDENTATION :-) ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2005-02-20 23:06 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-11 21:55 2.6.10 TCP troubles -- suggested patch Hubert Tonneau
2005-02-11 22:54 ` Rick Jones
2005-02-11 23:09 ` Nivedita Singhvi
2005-02-11 23:40 ` Rick Jones
2005-02-12 1:08 ` David S. Miller
2005-02-12 1:09 ` David S. Miller
2005-02-12 14:31 ` Alexey Kuznetsov
2005-02-12 19:28 ` David S. Miller
2005-02-12 19:44 ` Leonid Grossman
2005-02-12 19:52 ` Alexey Kuznetsov
2005-02-15 23:25 ` David S. Miller
2005-02-12 20:19 ` rick jones
2005-02-12 20:28 ` David S. Miller
2005-02-12 20:56 ` Alexey Kuznetsov
2005-02-12 21:27 ` Nivedita Singhvi
2005-02-12 21:43 ` rick jones
2005-02-12 22:00 ` Alexey Kuznetsov
2005-02-13 1:29 ` rick jones
2005-02-11 23:04 ` Stephen Hemminger
2005-02-12 1:07 ` David S. Miller
2005-02-12 12:11 ` Andi Kleen
2005-02-12 19:23 ` David S. Miller
2005-02-12 21:30 ` Andi Kleen
2005-02-12 14:16 ` Alexey Kuznetsov
2005-02-12 19:41 ` David S. Miller
2005-02-12 20:03 ` Alexey Kuznetsov
2005-02-15 23:26 ` David S. Miller
2005-02-15 23:42 ` Rick Jones
2005-02-15 23:23 ` David S. Miller
2005-02-16 9:13 ` Alexey Kuznetsov
2005-02-16 17:50 ` David S. Miller
-- strict thread matches above, loose matches on Subject: below --
2005-02-20 23:06 Hubert Tonneau
2005-02-16 20:00 Hubert Tonneau
2005-02-13 10:52 Hubert Tonneau
2005-02-14 14:12 ` Alexey Kuznetsov
2005-02-10 21:53 Hubert Tonneau
2005-02-10 22:36 ` Rick Jones
2005-02-11 1:16 ` David S. Miller
[not found] <050QTJA12@server5.heliogroup.fr>
2005-02-09 18:59 ` Stephen Hemminger
2005-02-09 20:25 ` David S. Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).