* A Linux TCP SACK Question @ 2008-04-04 4:54 Wenji Wu 2008-04-04 16:27 ` John Heffner 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-04 4:54 UTC (permalink / raw) To: netdev Hi, Could any body help me out with Linux TCP SACK? Thanks in advance. I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before? thanks, wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-04 4:54 A Linux TCP SACK Question Wenji Wu @ 2008-04-04 16:27 ` John Heffner 2008-04-04 17:49 ` Wenji Wu 0 siblings, 1 reply; 56+ messages in thread From: John Heffner @ 2008-04-04 16:27 UTC (permalink / raw) To: Wenji Wu; +Cc: netdev Unless you're sending very fast, where the computational overhead of processing SACK blocks is slowing you down, this is not expected behavior. Do you have more detail? What is the window size, and how much reordering? Full binary tcpdumps are very useful in diagnosing this type of problem. -John On Thu, Apr 3, 2008 at 9:54 PM, Wenji Wu <wenji@fnal.gov> wrote: > Hi, Could any body help me out with Linux TCP SACK? Thanks in advance. > > I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before? > > > thanks, > > wenji > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 16:27 ` John Heffner @ 2008-04-04 17:49 ` Wenji Wu 2008-04-04 18:07 ` John Heffner 2008-04-04 20:00 ` Ilpo Järvinen 0 siblings, 2 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-04 17:49 UTC (permalink / raw) To: 'John Heffner'; +Cc: netdev Hi, John, Thanks, I just sat with Richard Clarson and repeat the phenomenon. The experiment works as: Sender --- Router --- Receiver Iperf is sending from the sender to the receiver. In between there is an emulated router which runs netem. The emulated router has two interfaces, both with netem configured. One interface emulates the forward path and the other for the reverse path. Both netem interfaces are configured with 1.5ms delay and 0.15ms variance. No packet drops. Every system runs Linux 2.6.24. When sack is on, the throughput is around 180Mbps When sack is off, the throughput is around 260Mbps I am sure it is not due to the computational overhead of the processing SACK block. All of these systems are multi-core platforms, with 2G+ CPU. I run TOP to verify, CPUs are idle most of time. I was thinking that if the reordered ACKs/SACKs cause confusion in the sender, and sender will unnecessarily reduce either the CWND or the TCP_REORDERING threshold. I might need to take a serious look at the SACK implementation. I will send out the tcpdump files soon, Thanks, wenji -----Original Message----- From: John Heffner [mailto:johnwheffner@gmail.com] Sent: Friday, April 04, 2008 11:28 AM To: Wenji Wu Cc: netdev@vger.kernel.org Subject: Re: A Linux TCP SACK Question Unless you're sending very fast, where the computational overhead of processing SACK blocks is slowing you down, this is not expected behavior. Do you have more detail? What is the window size, and how much reordering? Full binary tcpdumps are very useful in diagnosing this type of problem. -John On Thu, Apr 3, 2008 at 9:54 PM, Wenji Wu <wenji@fnal.gov> wrote: > Hi, Could any body help me out with Linux TCP SACK? Thanks in advance. > > I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before? > > > thanks, > > wenji > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-04 17:49 ` Wenji Wu @ 2008-04-04 18:07 ` John Heffner 2008-04-04 20:00 ` Ilpo Järvinen 1 sibling, 0 replies; 56+ messages in thread From: John Heffner @ 2008-04-04 18:07 UTC (permalink / raw) To: wenji; +Cc: netdev On Fri, Apr 4, 2008 at 10:49 AM, Wenji Wu <wenji@fnal.gov> wrote: > I was thinking that if the reordered ACKs/SACKs cause confusion in the > sender, and sender will unnecessarily reduce either the CWND or the > TCP_REORDERING threshold. I might need to take a serious look at the SACK > implementation. It sounds very likely that you're encountering a bug or thinko in the sack code. This actually brings to mind an old topic -- NCR (RFC4653). There was some discussion of implementing this, which I think is simpler and more robust than Linux's current reordering threshold calculation. -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 17:49 ` Wenji Wu 2008-04-04 18:07 ` John Heffner @ 2008-04-04 20:00 ` Ilpo Järvinen 2008-04-04 20:07 ` Wenji Wu 2008-04-04 21:15 ` Wenji Wu 1 sibling, 2 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-04 20:00 UTC (permalink / raw) To: Wenji Wu; +Cc: 'John Heffner', Netdev On Fri, 4 Apr 2008, Wenji Wu wrote: > Every system runs Linux 2.6.24. You should have reported kernel version right from the beginning. It may have a huge effect... ;-) > When sack is on, the throughput is around 180Mbps > When sack is off, the throughput is around 260Mbps Not a surprise, once some reordering is detected, SACK TCP switches away from FACK to something that's not what you'd expect (in 2.6.24), you should try 2.6.25-rcs first in which the non-FACK is very close to RFC3517. > I was thinking that if the reordered ACKs/SACKs cause confusion in the > sender, and sender will unnecessarily reduce either the CWND or the > TCP_REORDERING threshold. I might need to take a serious look at the > SACK implementation. I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it is recoded/updated since then. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-04 20:00 ` Ilpo Järvinen @ 2008-04-04 20:07 ` Wenji Wu 2008-04-04 21:15 ` Wenji Wu 1 sibling, 0 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-04 20:07 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: 'John Heffner', Netdev > On Fri, 4 Apr 2008, Wenji Wu wrote: > > > Every system runs Linux 2.6.24. > > You should have reported kernel version right from the beginning. It > may > have a huge effect... ;-) > > > When sack is on, the throughput is around 180Mbps > > When sack is off, the throughput is around 260Mbps > > Not a surprise, once some reordering is detected, SACK TCP switches > away > from FACK to something that's not what you'd expect (in 2.6.24), you > should try 2.6.25-rcs first in which the non-FACK is very close to > RFC3517. > > > I was thinking that if the reordered ACKs/SACKs cause cjavascript:parent.send('smtp') Send Message Sendonfusion in the > > sender, and sender will unnecessarily reduce either the CWND or the > > TCP_REORDERING threshold. I might need to take a serious look at the > > > SACK implementation. > > I'd suggest that you don't waste too much effort for 2.6.24. ...Most > of it > is recoded/updated since then. Thanks, i would try it on the latest version and report the results. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 20:00 ` Ilpo Järvinen 2008-04-04 20:07 ` Wenji Wu @ 2008-04-04 21:15 ` Wenji Wu 2008-04-04 21:33 ` Ilpo Järvinen 1 sibling, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-04 21:15 UTC (permalink / raw) To: 'Ilpo Järvinen'; +Cc: 'John Heffner', 'Netdev' >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it >is recoded/updated since then. Hi, Ilpo, I just tried it on 2.6.25-rc8. The result is still the same: the throughput with SACK on is less than with SACK off. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 21:15 ` Wenji Wu @ 2008-04-04 21:33 ` Ilpo Järvinen 2008-04-04 21:39 ` Ilpo Järvinen 2008-04-04 21:40 ` Wenji Wu 0 siblings, 2 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-04 21:33 UTC (permalink / raw) To: Wenji Wu; +Cc: 'John Heffner', 'Netdev' On Fri, 4 Apr 2008, Wenji Wu wrote: > > >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it > >is recoded/updated since then. > > I just tried it on 2.6.25-rc8. The result is still the same: the throughput > with SACK on is less than with SACK off. Hmm, can you also try if playing around with FRTO setting makes some difference (tcp_frto sysctl)? -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 21:33 ` Ilpo Järvinen @ 2008-04-04 21:39 ` Ilpo Järvinen 2008-04-04 22:14 ` Wenji Wu 2008-04-04 21:40 ` Wenji Wu 1 sibling, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-04 21:39 UTC (permalink / raw) To: Wenji Wu; +Cc: 'John Heffner', 'Netdev' [-- Attachment #1: Type: TEXT/PLAIN, Size: 614 bytes --] On Sat, 5 Apr 2008, Ilpo Järvinen wrote: > On Fri, 4 Apr 2008, Wenji Wu wrote: > > > > > >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it > > >is recoded/updated since then. > > > > I just tried it on 2.6.25-rc8. The result is still the same: the throughput > > with SACK on is less than with SACK off. > > Hmm, can you also try if playing around with FRTO setting makes some > difference (tcp_frto sysctl)? ...Assuming it wasn't disabled already. If you find that there's significant difference, you could try also with SACK+basic FRTO (set the tcp_frto sysctl to 1). -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 21:39 ` Ilpo Järvinen @ 2008-04-04 22:14 ` Wenji Wu 2008-04-05 17:42 ` Ilpo Järvinen 2008-04-05 21:17 ` Sangtae Ha 0 siblings, 2 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-04 22:14 UTC (permalink / raw) To: 'Ilpo Järvinen'; +Cc: 'John Heffner', 'Netdev' >...Assuming it wasn't disabled already. If you find that there's >significant difference, you could try also with SACK+basic FRTO (set >the tcp_frto sysctl to 1). No, still the same. I tried tcp_frto with 0, 1, 2. SACK On is worse than SACK off. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-04 22:14 ` Wenji Wu @ 2008-04-05 17:42 ` Ilpo Järvinen 2008-04-05 21:17 ` Sangtae Ha 1 sibling, 0 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-05 17:42 UTC (permalink / raw) To: Wenji Wu; +Cc: 'John Heffner', 'Netdev' On Fri, 4 Apr 2008, Wenji Wu wrote: > > >...Assuming it wasn't disabled already. If you find that there's > >significant difference, you could try also with SACK+basic FRTO (set > >the tcp_frto sysctl to 1). > > No, still the same. I tried tcp_frto with 0, 1, 2. > > SACK On is worse than SACK off. No easy solution then, we'll have to take a look on tcpdumps. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-04 22:14 ` Wenji Wu 2008-04-05 17:42 ` Ilpo Järvinen @ 2008-04-05 21:17 ` Sangtae Ha 2008-04-06 20:27 ` Wenji Wu 1 sibling, 1 reply; 56+ messages in thread From: Sangtae Ha @ 2008-04-05 21:17 UTC (permalink / raw) To: wenji; +Cc: Ilpo Järvinen, John Heffner, Netdev [-- Attachment #1: Type: text/plain, Size: 964 bytes --] Can you run the attached script and run your testing again? I think it might be the problem of your dual cores balance the interrupts on your testing NIC. As we do a lot of things with SACK, cache misses and etc. might affect your performance. In default setting, I disabled tcp segment offload and did a smp affinity setting to CPU 0. Please change "INF" to your interface name and let us know the results. Sangtae On Fri, Apr 4, 2008 at 6:14 PM, Wenji Wu <wenji@fnal.gov> wrote: > > >...Assuming it wasn't disabled already. If you find that there's > >significant difference, you could try also with SACK+basic FRTO (set > >the tcp_frto sysctl to 1). > > No, still the same. I tried tcp_frto with 0, 1, 2. > > SACK On is worse than SACK off. > > wenji > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: tuning.sh --] [-- Type: application/x-sh, Size: 1753 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-05 21:17 ` Sangtae Ha @ 2008-04-06 20:27 ` Wenji Wu 2008-04-06 22:43 ` Sangtae Ha 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-06 20:27 UTC (permalink / raw) To: Sangtae Ha; +Cc: Ilpo Järvinen, John Heffner, Netdev > Can you run the attached script and run your testing again? > I think it might be the problem of your dual cores balance the > interrupts on your testing NIC. > As we do a lot of things with SACK, cache misses and etc. might affect > your performance. > > In default setting, I disabled tcp segment offload and did a smp > affinity setting to CPU 0. > Please change "INF" to your interface name and let us know the results. I bound the network interrupts and iperf both the CPU0, and CPU0 will be ilde most of the time. The results are still the same. At this throughput level, the SACK processing won't take much CPU. It is not the interrupt/cpu affinity that cause the difference. I am beleving that it is the ACK reordering that cuase the confusion in the sender, which lead the sender uncecessarily to reduce CWND or REORDERING_THRESHOLD. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-06 20:27 ` Wenji Wu @ 2008-04-06 22:43 ` Sangtae Ha 2008-04-07 14:56 ` Wenji Wu 0 siblings, 1 reply; 56+ messages in thread From: Sangtae Ha @ 2008-04-06 22:43 UTC (permalink / raw) To: Wenji Wu; +Cc: Ilpo Järvinen, John Heffner, Netdev When our 40 students had the same lab experiment comparing between TCP-SACK and TCP-NewReno, they had come up with similar results. The settings are identical to your setting (one linux sender, one linux receiver, and one nettem machine in between) . When we introduced some loss using a nettem, TCP-SACK showed a bit better performance while they had similar throughput most of cases. I don't think reorderings frequently happened in your directly connected networking scenario. Please post your tcpdump file for clearing out all doubts. Sangtae On 4/6/08, Wenji Wu <wenji@fnal.gov> wrote: > > > > Can you run the attached script and run your testing again? > > I think it might be the problem of your dual cores balance the > > interrupts on your testing NIC. > > As we do a lot of things with SACK, cache misses and etc. might affect > > your performance. > > > > In default setting, I disabled tcp segment offload and did a smp > > affinity setting to CPU 0. > > Please change "INF" to your interface name and let us know the results. > > I bound the network interrupts and iperf both the CPU0, and CPU0 will be ilde most of the time. The results are still the same. > > At this throughput level, the SACK processing won't take much CPU. > > It is not the interrupt/cpu affinity that cause the difference. > > I am beleving that it is the ACK reordering that cuase the confusion in the sender, which lead the sender uncecessarily to reduce CWND or REORDERING_THRESHOLD. > > wenji > -- ---------------------------------------------------------------- Sangtae Ha, http://www4.ncsu.edu/~sha2 PhD. Student, Department of Computer Science, North Carolina State University, USA ---------------------------------------------------------------- ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-06 22:43 ` Sangtae Ha @ 2008-04-07 14:56 ` Wenji Wu 2008-04-08 6:36 ` Ilpo Järvinen 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-07 14:56 UTC (permalink / raw) To: 'Sangtae Ha' Cc: 'Ilpo Järvinen', 'John Heffner', 'Netdev' >I don't think reorderings frequently happened in your directly >connected networking scenario. Please post your tcpdump file for >clearing out all doubts. https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/ Two tcpdump files: one with SACK on, the other with SACK off. The test configures described in my previous emails. Best, wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-07 14:56 ` Wenji Wu @ 2008-04-08 6:36 ` Ilpo Järvinen 2008-04-08 12:33 ` Wenji Wu ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-08 6:36 UTC (permalink / raw) To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' On Mon, 7 Apr 2008, Wenji Wu wrote: > >I don't think reorderings frequently happened in your directly > >connected networking scenario. Please post your tcpdump file for > >clearing out all doubts. > > https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/ > > Two tcpdump files: one with SACK on, the other with SACK off. The test > configures described in my previous emails. NewReno never retransmitted anything in them (except at the very end of the transfer). Probably something related to how tp->reordering behaves I suppose... ijjarvin@pointhope:~/linux/debug$ /usr/sbin/tcpdump -n -r nosack | grep "4888[35] >" | cut -d ' ' -f 7- | cut -d ':' -f 1 | awk '{if ($1 < old) {print $1}; old=$1;}' reading from file nosack, link-type EN10MB (Ethernet) 1 641080641 ijjarvin@pointhope:~/linux/debug$ ijjarvin@pointhope:~/linux/debug$ /usr/sbin/tcpdump -n -r sack | grep "4888[35] >" | cut -d ' ' -f 7- | cut -d ':' -f 1 | awk '{if ($1 < old) {print $1}; old=$1;}' reading from file sack, link-type EN10MB (Ethernet) 1 7265 10161 141929 175233 196953 446558881 3542223511 ijjarvin@pointhope:~/linux/debug$ This is probably far fetched but could you tell us how you make sure that earlier connection's metrics are not affecting the latter connection? Ie., the discovered reordering is not transferred across the flows (in CBI like manner) and thus newreno has unfair advantage? -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 6:36 ` Ilpo Järvinen @ 2008-04-08 12:33 ` Wenji Wu 2008-04-08 13:45 ` Ilpo Järvinen 2008-04-08 15:57 ` John Heffner 2008-04-08 14:07 ` John Heffner 2008-04-14 16:10 ` Wenji Wu 2 siblings, 2 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-08 12:33 UTC (permalink / raw) To: Ilpo Järvinen Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' > NewReno never retransmitted anything in them (except at the very end > of > the transfer). Probably something related to how tp->reordering behaves > I suppose... Yes, the adaptive tp->reordering will play a role here. > This is probably far fetched but could you tell us how you make sure > that > earlier connection's metrics are not affecting the latter connection? > > Ie., the discovered reordering is not transferred across the flows (in > CBI > like manner) and thus newreno has unfair advantage? You can reverse the order of the tests, with SACK option on/off. The results are still the same. Also, according to the source code, tp->reordering will be initialized to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new connection is established. After that, tp->reordering is controlled by the the adaptive algorithm wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 12:33 ` Wenji Wu @ 2008-04-08 13:45 ` Ilpo Järvinen 2008-04-08 14:30 ` Wenji Wu 2008-04-08 15:57 ` John Heffner 1 sibling, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-08 13:45 UTC (permalink / raw) To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' On Tue, 8 Apr 2008, Wenji Wu wrote: > > NewReno never retransmitted anything in them (except at the very end > > of > > the transfer). Probably something related to how tp->reordering behaves > > I suppose... > > Yes, the adaptive tp->reordering will play a role here. ...What is not clear to me why NewReno does not go to recovery at least once near the beginning, or at least it won't result in a retransmission. In which kernel version this dump comes from? 2.6.24 newreno is crippled with TSO as was recently discovered, ie., it won't mark lost super skbs at head and thus won't retransmit them. Also 2.6.25-rcs are still broken (though they'll transmit too much, I'll not go detail in here), DaveM now has the fix for 2.6.25-rcs in net-2.6. > > This is probably far fetched but could you tell us how you make sure > > that > > earlier connection's metrics are not affecting the latter connection? > > > > Ie., the discovered reordering is not transferred across the flows (in > > CBI > > like manner) and thus newreno has unfair advantage? > > You can reverse the order of the tests, with SACK option on/off. The > results are still the same. Ok. I just wanted to make sure so that we don't end up trace some test setup issue :-). > Also, according to the source code, tp->reordering will be initialized > to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new > connection is established. In addition, in tcp_init_metrics(): if (dst_metric(dst, RTAX_REORDERING) && tp->reordering != dst_metric(dst, RTAX_REORDERING)) { tcp_disable_fack(tp); tp->reordering = dst_metric(dst, RTAX_REORDERING); } > After that, tp->reordering is controlled by the the adaptive algorithm Yes, however, the algorithm will be vastly different in those two cases. NewReno stuff is in tcp_check_reno_reordering() and other place in tcp_try_undo_partial() but the latter is only happening in recovery I think. SACK on the other has number of callsites to tcp_update_reordering, check for yourself. This might be due to my change which made tcp_check_reno_reordering to be called earlier than it used to be (to remove a transition state during which sacked_out contained stale info including some already cumulative ACKed segments). I was quite unsure if I can safely do that. It's not clear to me how your test could cause sacked_out > packets_out-1 to occur though, which is necessary for tcp_update_reordering to get called with newreno. The ACK reordering should just make the number of duplicate acks smaller because part of them get discarded as old ones as a newer cumulative ACK often arrives a bit "ahead" of it's time making rest smaller sequenced ACKs very close to no-op. ...Though I didn't yet do a awk magic to prove that it won't happen in the non-sack dump. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 13:45 ` Ilpo Järvinen @ 2008-04-08 14:30 ` Wenji Wu 2008-04-08 14:59 ` Ilpo Järvinen 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-08 14:30 UTC (permalink / raw) To: Ilpo Järvinen Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' > > Yes, the adaptive tp->reordering will play a role here. > > ...What is not clear to me why NewReno does not go to recovery at > least > once near the beginning, or at least it won't result in a retransmission. The problem cause me two weeks' time to debug! With 3 DupACKs, tcp_ack() calls tcp_fastretrans_alert(), and which in turn calls tcp_xmit_retransmit_queue(). Within tcp_xmit_retransmit_queue(), there is a line of code that would cause the problem above: ...................................................................................................... /* Forward retransmissions are possible only during Recovery. */ 1999 if (icsk->icsk_ca_state != TCP_CA_Recovery) 2000 return; 2001 2002 /* No forward retransmissions in Reno are possible. */ 2003 if (tcp_is_reno(tp)) 2004 return; ..................................................................................................... if you look at "tcp_is_reno", you would see that with SACK off, Reno does not do retransmit, it will return!!! Really do not understand why these two lines of code exist there!!! Also, this code still in 2.6.25. > In which kernel version this dump comes from? 2.6.24 newreno is > crippled > with TSO as was recently discovered, ie., it won't mark lost super > skbs > at head and thus won't retransmit them. Also 2.6.25-rcs are still > broken > (though they'll transmit too much, I'll not go detail in here), DaveM > now > has the fix for 2.6.25-rcs in net-2.6. The dumped file is from 2.6.24. 2.6.25's is similiar. > > You can reverse the order of the tests, with SACK option on/off. The > > > results are still the same. > > Ok. I just wanted to make sure so that we don't end up trace some test > > setup issue :-). > > > Also, according to the source code, tp->reordering will be > initialized > > to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new > > connection is established. > > In addition, in tcp_init_metrics(): > > if (dst_metric(dst, RTAX_REORDERING) && > tp->reordering != dst_metric(dst, RTAX_REORDERING)) { > tcp_disable_fack(tp); > tp->reordering = dst_metric(dst, RTAX_REORDERING); > } Good to know this, thanks wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 14:30 ` Wenji Wu @ 2008-04-08 14:59 ` Ilpo Järvinen 2008-04-08 15:27 ` Wenji Wu 2008-04-14 22:47 ` Wenji Wu 0 siblings, 2 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-08 14:59 UTC (permalink / raw) To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' On Tue, 8 Apr 2008, Wenji Wu wrote: > With 3 DupACKs, tcp_ack() calls tcp_fastretrans_alert(), and which in > turn calls tcp_xmit_retransmit_queue(). Yeah. It should. > Within tcp_xmit_retransmit_queue(), there is a line of code that would > cause the problem above: > > ...................................................................................................... > /* Forward retransmissions are possible only during Recovery. */ > 1999 if (icsk->icsk_ca_state != TCP_CA_Recovery) > 2000 return; > > 2001 > 2002 /* No forward retransmissions in Reno are possible. */ > 2003 if (tcp_is_reno(tp)) > 2004 return; > > ..................................................................................................... > > if you look at "tcp_is_reno", you would see that with SACK off, Reno > does not do retransmit, it will return!!! Your analysis is missing something important here, there are two loops there :-). One for retransmitting assumed lost segments that's above those lines you quoted! The other below is for non-lost marked similar to what is specified by RFC3517's Rule 3 for NextSeg, which definately won't apply for newreno nor should be executed. > Really do not understand why these two lines of code exist there!!! > > Also, this code still in 2.6.25. Sure, but there's nothing wrong with them! 2.6.24 just is currently broken if you have TSO+NewReno because it won't do the correct lost marking which is a necessary preparation step for the loop above that, too bad as I just figured that out one/two days ago so there's no fix yet available :-). > > In which kernel version this dump comes from? 2.6.24 newreno is > > crippled > > with TSO as was recently discovered, ie., it won't mark lost super > > skbs > > at head and thus won't retransmit them. Also 2.6.25-rcs are still > > broken > > (though they'll transmit too much, I'll not go detail in here), DaveM > > now > > has the fix for 2.6.25-rcs in net-2.6. > > The dumped file is from 2.6.24. 2.6.25's is similiar. It's a bit hard for me to believe, considering what the last weeks debug has revealed about internals of it. Have you checked it from the dumps or from the overall results, a similarity in the latter could be due to other factors related to the differences in reordering detection between NewReno/SACK. > > In addition, in tcp_init_metrics(): > > > > if (dst_metric(dst, RTAX_REORDERING) && > > tp->reordering != dst_metric(dst, RTAX_REORDERING)) { > > tcp_disable_fack(tp); > > tp->reordering = dst_metric(dst, RTAX_REORDERING); > > } > > Good to know this, thanks ...There might be some bug which causes it to get skipped under some circumstances though (which I haven't yet remembered to fix). I don't remember too well anymore, probably some goto which caused skipping most of what's in there. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 14:59 ` Ilpo Järvinen @ 2008-04-08 15:27 ` Wenji Wu 2008-04-08 17:26 ` Ilpo Järvinen 2008-04-14 22:47 ` Wenji Wu 1 sibling, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-08 15:27 UTC (permalink / raw) To: Ilpo Järvinen Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' > It's a bit hard for me to believe, considering what the last weeks > debug > has revealed about internals of it. Have you checked it from the dumps > or > from the overall results, a similarity in the latter could be due to > other factors related to the differences in reordering detection > between > NewReno/SACK. > > ...There might be some bug which causes it to get skipped under some > circumstances though (which I haven't yet remembered to fix). I don't > > remember too well anymore, probably some goto which caused skipping > most > of what's in there. > Get back to you later, and post the tcpdump file for 2.6.25. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 15:27 ` Wenji Wu @ 2008-04-08 17:26 ` Ilpo Järvinen 0 siblings, 0 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-08 17:26 UTC (permalink / raw) To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' On Tue, 8 Apr 2008, Wenji Wu wrote: > > > It's a bit hard for me to believe, considering what the last weeks > > debug > > has revealed about internals of it. Have you checked it from the dumps > > or > > from the overall results, a similarity in the latter could be due to > > other factors related to the differences in reordering detection > > between > > NewReno/SACK. > > > > Get back to you later, and post the tcpdump file for 2.6.25. Please, if possible use a kernel version where my today applied tcp fixes are in, ie., at least DaveM's net-2.6 already has them, I didn't check if Linus has pulled them in yet. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: RE: A Linux TCP SACK Question 2008-04-08 14:59 ` Ilpo Järvinen 2008-04-08 15:27 ` Wenji Wu @ 2008-04-14 22:47 ` Wenji Wu 2008-04-15 0:48 ` John Heffner 1 sibling, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-14 22:47 UTC (permalink / raw) To: 'Ilpo Järvinen'; +Cc: 'Netdev' Hi, Ilpo, Could the throughput difference with SACK ON/OFF be due to the following code in tcp_ack()? 3120 if (tcp_ack_is_dubious(sk, flag)) { 3121 /* Advance CWND, if state allows this. */ 3122 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd && 3123 tcp_may_raise_cwnd(sk, flag)) 3124 tcp_cong_avoid(sk, ack, prior_in_flight, 0); 3125 tcp_fastretrans_alert(sk, prior_packets - tp->packets_out, flag); 3126 } else { 3127 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd) 3128 tcp_cong_avoid(sk, ack, prior_in_flight, 1); 3129 } In my tests, there are actually no packet drops, just severe packet reordering in both forward and reverse paths. With good tcp_reordering auto-tuning, there are few retransmissions. (1) With SACK option off, the reorder ACKs will not cause much harm to the throughput. As you have pointed out in the email that "The ACK reordering should just make the number of duplicate acks smaller because part of them get discarded as old ones as a newer cumulative ACK often arrives a bit "ahead" of its time making rest smaller sequenced ACKs very close to on-op." If there are any ACK advancement, tcp_cong_avoid() will be called. (2) With the sack option is on. If the ACKs do not advance the left edge of the window, those ACKs will go to "old_ack" of "tcp_ack()", no much processing except sack-tagging the corresponding packets in the retransmission queue. tcp_cong_avoid() will not be called. However, if the ACKs advance the left edge of the window and these ACKs include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122, which is stricter than the if-condition at line 3127. So, the congestion window with SACK on would be smaller than with SACK off. If you run tcptrace and xplot on the files I posted, you would see lots ACKs will advance the left edge of the window, and include SACK blocks. Not quite sure, just a guess. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-14 22:47 ` Wenji Wu @ 2008-04-15 0:48 ` John Heffner 2008-04-15 8:25 ` Ilpo Järvinen ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: John Heffner @ 2008-04-15 0:48 UTC (permalink / raw) To: wenji; +Cc: Ilpo Järvinen, Netdev On Mon, Apr 14, 2008 at 3:47 PM, Wenji Wu <wenji@fnal.gov> wrote: > Hi, Ilpo, > > Could the throughput difference with SACK ON/OFF be due to the following > code in tcp_ack()? > > 3120 if (tcp_ack_is_dubious(sk, flag)) { > 3121 /* Advance CWND, if state allows this. */ > 3122 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd && > 3123 tcp_may_raise_cwnd(sk, flag)) > 3124 tcp_cong_avoid(sk, ack, prior_in_flight, 0); > 3125 tcp_fastretrans_alert(sk, prior_packets - > tp->packets_out, flag); > 3126 } else { > 3127 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd) > 3128 tcp_cong_avoid(sk, ack, prior_in_flight, 1); > 3129 } > > In my tests, there are actually no packet drops, just severe packet > reordering in both forward and reverse paths. With good tcp_reordering > auto-tuning, there are few retransmissions. > > (1) With SACK option off, the reorder ACKs will not cause much harm to the > throughput. As you have pointed out in the email that "The ACK reordering > > should just make the number of duplicate acks smaller because part of them > get discarded as old ones as a newer cumulative ACK often arrives a bit > "ahead" of its time making rest smaller sequenced ACKs very close to on-op." > > If there are any ACK advancement, tcp_cong_avoid() will be called. > > (2) With the sack option is on. If the ACKs do not advance the left edge of > the window, those ACKs will go to "old_ack" of "tcp_ack()", no much > processing except sack-tagging the corresponding packets in the > retransmission queue. tcp_cong_avoid() will not be called. > > However, if the ACKs advance the left edge of the window and these ACKs > include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the > calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122, > which is stricter than the if-condition at line 3127. > > So, the congestion window with SACK on would be smaller than with SACK off. > > If you run tcptrace and xplot on the files I posted, you would see lots ACKs > will advance the left edge of the window, and include SACK blocks. > > Not quite sure, just a guess. I had considered this, but it would seem that tcp_may_raise_cwnd() in this case *should* return true, right? Sill the mystery remains as to why *both* are going so slowly. You mentioned you're using a web100 kernel. What are the final values of all the variables for the connections (grab with readall)? Thanks, -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 0:48 ` John Heffner @ 2008-04-15 8:25 ` Ilpo Järvinen 2008-04-15 18:01 ` Wenji Wu 2008-04-15 15:45 ` Wenji Wu 2008-04-15 16:39 ` Wenji Wu 2 siblings, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-15 8:25 UTC (permalink / raw) To: John Heffner, wenji; +Cc: Netdev On Mon, 14 Apr 2008, John Heffner wrote: > On Mon, Apr 14, 2008 at 3:47 PM, Wenji Wu <wenji@fnal.gov> wrote: > > > > Could the throughput difference with SACK ON/OFF be due to the following > > code in tcp_ack()? > > > > 3120 if (tcp_ack_is_dubious(sk, flag)) { > > 3121 /* Advance CWND, if state allows this. */ > > 3122 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd && > > 3123 tcp_may_raise_cwnd(sk, flag)) > > 3124 tcp_cong_avoid(sk, ack, prior_in_flight, 0); > > 3125 tcp_fastretrans_alert(sk, prior_packets - > > tp->packets_out, flag); > > 3126 } else { > > 3127 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd) > > 3128 tcp_cong_avoid(sk, ack, prior_in_flight, 1); > > 3129 } > > > > In my tests, there are actually no packet drops, just severe packet > > reordering in both forward and reverse paths. With good tcp_reordering > > auto-tuning, there are few retransmissions. > > > > > > (1) With SACK option off, the reorder ACKs will not cause much harm to the > > throughput. As you have pointed out in the email that "The ACK reordering > > > > should just make the number of duplicate acks smaller because part of them > > get discarded as old ones as a newer cumulative ACK often arrives a bit > > "ahead" of its time making rest smaller sequenced ACKs very close to on-op." ...Please note that these are considered as old ACKs, so that we do goto old_ack, which is equal for both SACK and NewReno. ...So it won't make any difference between them. > > If there are any ACK advancement, tcp_cong_avoid() will be called. NewReno case analysis is not exactly what you assume, if there was at least on duplicate ACK already, the ca_state will be CA_Disorder for NewReno which makes ack_is_dubious true. You probably assumed it goes to the other branch directly? > > (2) With the sack option is on. If the ACKs do not advance the left edge of > > the window, those ACKs will go to "old_ack" of "tcp_ack()", no much > > processing except sack-tagging the corresponding packets in the > > retransmission queue. tcp_cong_avoid() will not be called. No, this is not right. The old_ack happens only if left edge backtracks, in which case we obviously should discard as it's stale information (except SACK may reveal something not yet known which is why sacktag is called there). This same applies regardless of SACK (no tagging of course). ...Hmm, there's one questionable part in here in the code (I doubt it makes any difference here though). If new sack info is discovered, we don't retransmit but send new data (if window allows) even when in recovery where TCP should retransmit first. > > However, if the ACKs advance the left edge of the window and these ACKs > > include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the > > calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122, > > which is stricter than the if-condition at line 3127. > > > > So, the congestion window with SACK on would be smaller than with SACK off. I think you might have found a bug though it won't affect you but makes that check easier to pass actually: Questionable thing is that || in tcp_may_raise_cwnd (might not be intentional)... But in your case, during initial slow-start that condition in tcp_may_raise_cwnd will always be true (if you've metrics are cleared as they should). Because: (...not important || 1) && 1 because cwnd < ssthresh. After that, when you don't have ECE nor are in recovery, tcp_may_raise_cwnd results in this: (1 || ...not calculated) && 1, so it should always allow increment in your case except when in recovery which hardly makes up for the difference you're seeing... > > If you run tcptrace and xplot on the files I posted, you would see > > lots ACKs will advance the left edge of the window, and include SACK > > blocks. This would only make difference if any of those SACK blocks were new. If they're not, DATA_SACKED_ACKED won't be set in flag. > > Not quite sure, just a guess. You seem to be missing the third case, which I tried to point out earlier. The case where left edge remains the same. I think it makes a huge difference here (I'll analyse non-recovery case here): NewReno goes always to fastretrans_alert, to default branch, and because it's is_dupack, it increments sacked_out through tcp_add_reno_sack. Effectively packets_in_flight is reduced by one and TCP is able to send a new segment out. Now with SACK there are two cases: SACK and newly discovere SACK info (for simplicity, lets assume just one newly discovered sacked segment). Sacktag marks that segment and increment sacked_out, effectively making packets_in_flight equal to the case with NewReno. It goes to fastretrans_alert and makes all similar maneuvers as NewReno (except if enough SACK blocks have arrived to trigger recovery while NewReno would not have enough dupACKs collected, I doubt that this makes the difference though, I'll need no-metricsed logs to verify the number of recoveries to confirm that they're quite few). SACK and no new SACK info. Sacktag won't find anything to mark, thus sacked_out remains the same. It goes to fastretrans_alert because ca_state is CA_Disorder. But, now we did lose one segment compared with NewReno because we didn't increment sacked_out making packets_in_flight to stay in the amount it was before. Thus we cannot send new data segment out and fall behind the NewReno. > I had considered this, but it would seem that tcp_may_raise_cwnd() in > this case *should* return true, right? Yes, it seems. Though I think that it's unintentional. I'd say that that || should be && but I might be wrong. > Sill the mystery remains as to why *both* are going so slowly. You > mentioned you're using a web100 kernel. What are the final values of > all the variables for the connections (grab with readall)? ...I think that due to reordering, one will lose part of the cwnd increments because of old ACKs as they won't allow you to add more segments to the network, at some point of time the lossage will be large enough to stall the growth of the cwnd (if in congestion avoidance with the small increment). With slow start it seems not that self-evident that such level exists though it might. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 8:25 ` Ilpo Järvinen @ 2008-04-15 18:01 ` Wenji Wu 2008-04-15 22:40 ` John Heffner 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-15 18:01 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: John Heffner, Netdev > No, this is not right. The old_ack happens only if left edge > backtracks, in which case we obviously should discard as it's stale > information (except SACK may reveal something not yet known which is > why sacktag is called there). This same applies regardless of SACK (no > > tagging of course). Yes, I mis-present myself in the last email. What I meant is the left edge backtrack case as you have pointed out. > > I think you might have found a bug though it won't affect you but > makes > that check easier to pass actually: > > Questionable thing is that || in tcp_may_raise_cwnd (might not be > intentional)... > > But in your case, during initial slow-start that condition in > tcp_may_raise_cwnd will always be true (if you've metrics are cleared > as > they should). Because: (...not important || 1) && 1 because cwnd < > ssthresh. After that, when you don't have ECE nor are in recovery, > tcp_may_raise_cwnd results in this: (1 || ...not calculated) && 1, so > it > should always allow increment in your case except when in recovery > which > hardly makes up for the difference you're seeing... You are right, I just printed out the return value of tcp_may_raise_cwnd(). It is all one! > This would only make difference if any of those SACK blocks were new. > If > they're not, DATA_SACKED_ACKED won't be set in flag. > > > > Not quite sure, just a guess. > > You seem to be missing the third case, which I tried to point out > earlier. The case where left edge remains the same. I think it makes a > > huge difference here (I'll analyse non-recovery case here): > > NewReno goes always to fastretrans_alert, to default branch, and > because > it's is_dupack, it increments sacked_out through tcp_add_reno_sack. > Effectively packets_in_flight is reduced by one and TCP is able to > send > a new segment out. > > Now with SACK there are two cases: > > SACK and newly discovere SACK info (for simplicity, lets assume just > one > newly discovered sacked segment). Sacktag marks that segment and > increment > sacked_out, effectively making packets_in_flight equal to the case > with > NewReno. It goes to fastretrans_alert and makes all similar maneuvers > as > NewReno (except if enough SACK blocks have arrived to trigger recovery > > while NewReno would not have enough dupACKs collected, I doubt that > this > makes the difference though, I'll need no-metricsed logs to verify the > > number of recoveries to confirm that they're quite few). > > SACK and no new SACK info. Sacktag won't find anything to mark, thus > sacked_out remains the same. It goes to fastretrans_alert because > ca_state > is CA_Disorder. But, now we did lose one segment compared with NewReno > > because we didn't increment sacked_out making packets_in_flight to > stay in > the amount it was before. Thus we cannot send new data segment out and > > fall behind the NewReno. Agree with you. Thanks. You did give me a good class on Linux ACK/SACK implementation. Thank you. > > I had considered this, but it would seem that tcp_may_raise_cwnd() in > > this case *should* return true, right? > > Yes, it seems. Though I think that it's unintentional. I'd say that > that > || should be && but I might be wrong. Yes, It is all ture! wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 18:01 ` Wenji Wu @ 2008-04-15 22:40 ` John Heffner 2008-04-16 8:27 ` David Miller 2008-04-16 14:46 ` Wenji Wu 0 siblings, 2 replies; 56+ messages in thread From: John Heffner @ 2008-04-15 22:40 UTC (permalink / raw) To: Wenji Wu; +Cc: Ilpo Järvinen, Netdev [-- Attachment #1: Type: text/plain, Size: 73 bytes --] Wenji, can you try this out? Patch against net-2.6.26. Thanks, -John [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-Increase-the-max_burst-threshold-from-3-to-tp-reord.patch --] [-- Type: text/x-diff; name=0001-Increase-the-max_burst-threshold-from-3-to-tp-reord.patch, Size: 1491 bytes --] From 4cb2a9fd1d497b02bfdd06f71b499d441ca10aee Mon Sep 17 00:00:00 2001 From: John Heffner <johnwheffner@gmail.com> Date: Tue, 15 Apr 2008 15:26:39 -0700 Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering. This change is necessary to allow cwnd to grow during persistent reordering. Cwnd moderation is applied when in the disorder state and an ack that fills the hole comes in. If the hole was greater than 3 packets, but less than tp->reordering, cwnd will shrink when it should not have. Signed-off-by: John Heffner <jheffner@napa.(none)> --- include/net/tcp.h | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 2c14edf..633147c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -787,11 +787,14 @@ extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); /* Slow start with delack produces 3 packets of burst, so that - * it is safe "de facto". + * it is safe "de facto". This will be the default - same as + * the default reordering threshold - but if reordering increases, + * we must be able to allow cwnd to burst at least this much in order + * to not pull it back when holes are filled. */ static __inline__ __u32 tcp_max_burst(const struct tcp_sock *tp) { - return 3; + return tp->reordering; } /* Returns end sequence number of the receiver's advertised window */ -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-15 22:40 ` John Heffner @ 2008-04-16 8:27 ` David Miller 2008-04-16 9:21 ` Ilpo Järvinen 2008-04-16 14:46 ` Wenji Wu 1 sibling, 1 reply; 56+ messages in thread From: David Miller @ 2008-04-16 8:27 UTC (permalink / raw) To: johnwheffner; +Cc: wenji, ilpo.jarvinen, netdev From: "John Heffner" <johnwheffner@gmail.com> Date: Tue, 15 Apr 2008 15:40:05 -0700 > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering. > > This change is necessary to allow cwnd to grow during persistent > reordering. Cwnd moderation is applied when in the disorder state > and an ack that fills the hole comes in. If the hole was greater > than 3 packets, but less than tp->reordering, cwnd will shrink when > it should not have. > > Signed-off-by: John Heffner <jheffner@napa.(none)> I think this patch is correct, or at least more correct than what this code is doing right now. Any objections to my adding this to net-2.6.26? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-16 8:27 ` David Miller @ 2008-04-16 9:21 ` Ilpo Järvinen 2008-04-16 9:35 ` David Miller 2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner 0 siblings, 2 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-16 9:21 UTC (permalink / raw) To: David Miller; +Cc: johnwheffner, wenji, Netdev On Wed, 16 Apr 2008, David Miller wrote: > From: "John Heffner" <johnwheffner@gmail.com> > Date: Tue, 15 Apr 2008 15:40:05 -0700 > > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering. > > > > This change is necessary to allow cwnd to grow during persistent > > reordering. Cwnd moderation is applied when in the disorder state > > and an ack that fills the hole comes in. If the hole was greater > > than 3 packets, but less than tp->reordering, cwnd will shrink when > > it should not have. > > > > Signed-off-by: John Heffner <jheffner@napa.(none)> > > I think this patch is correct, or at least more correct than what > this code is doing right now. > > Any objections to my adding this to net-2.6.26? I don't have objections. But I want to note that tp->reordering does not consider the situation on that specific ACK because its value might originate a number of segments and even RTTs back. I think it could be possible to find a more appropriate value for max_burst locally to an ACK. ...Though it might be a bit over-engineered solution. For SACK we calculate similar metric anyway in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3 could perhaps be used (I'm not sure if these would be foolproof in recovery though). -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-16 9:21 ` Ilpo Järvinen @ 2008-04-16 9:35 ` David Miller 2008-04-16 14:50 ` Wenji Wu 2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu 2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner 1 sibling, 2 replies; 56+ messages in thread From: David Miller @ 2008-04-16 9:35 UTC (permalink / raw) To: ilpo.jarvinen; +Cc: johnwheffner, wenji, netdev From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Wed, 16 Apr 2008 12:21:38 +0300 (EEST) > On Wed, 16 Apr 2008, David Miller wrote: > > > From: "John Heffner" <johnwheffner@gmail.com> > > Date: Tue, 15 Apr 2008 15:40:05 -0700 > > > > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering. ... > > Any objections to my adding this to net-2.6.26? > > I don't have objections. > > But I want to note that tp->reordering does not consider the situation on > that specific ACK because its value might originate a number of segments > and even RTTs back. I think it could be possible to find a more > appropriate value for max_burst locally to an ACK. ...Though it might be a > bit over-engineered solution. For SACK we calculate similar metric anyway > in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at > cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3 > could perhaps be used (I'm not sure if these would be foolproof in > recovery though). Right, we can tweak this thing further later. *beep* *beep* I've added John's patch to net-2.6.26 ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-16 9:35 ` David Miller @ 2008-04-16 14:50 ` Wenji Wu 2008-04-18 6:52 ` David Miller 2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu 1 sibling, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-16 14:50 UTC (permalink / raw) To: 'David Miller', ilpo.jarvinen; +Cc: johnwheffner, netdev >Right, we can tweak this thing further later. >*beep* *beep* >I've added John's patch to net-2.6.26 I just tried with John's patch. It works, saturating the 1Gbps in my test. Without the patch, the throughput is around 180Mbps with SACK On, 250Mbps with SACK off. The same test environment described in my previous emails. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-16 14:50 ` Wenji Wu @ 2008-04-18 6:52 ` David Miller 0 siblings, 0 replies; 56+ messages in thread From: David Miller @ 2008-04-18 6:52 UTC (permalink / raw) To: wenji; +Cc: ilpo.jarvinen, johnwheffner, netdev From: Wenji Wu <wenji@fnal.gov> Date: Wed, 16 Apr 2008 09:50:19 -0500 > >I've added John's patch to net-2.6.26 > > I just tried with John's patch. It works, saturating the 1Gbps in my test. > > Without the patch, the throughput is around 180Mbps with SACK On, 250Mbps > with SACK off. > > The same test environment described in my previous emails. After this patch cooks for a couple more days I'll submit it to -stable. Thanks for your report and all of your testing Wenji. Thanks John for the patch. ^ permalink raw reply [flat|nested] 56+ messages in thread
* about Linux adaptivly adjusting ssthresh 2008-04-16 9:35 ` David Miller 2008-04-16 14:50 ` Wenji Wu @ 2008-08-27 14:38 ` Wenji Wu 2008-08-27 22:48 ` John Heffner 1 sibling, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-08-27 14:38 UTC (permalink / raw) To: 'David Miller', ilpo.jarvinen; +Cc: johnwheffner, netdev Hi, all, Could anybody help me out with Linux adaptively adjusting ssthresh? Thanks in advance. I understand that the latest Linux is able to adaptively adjust ssthresh to avoid retransmission. Could anybody tell me which algorithms have been implemented for the adaptive ssthresh adjust? Thanks, wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting ssthresh 2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu @ 2008-08-27 22:48 ` John Heffner 2008-08-28 0:53 ` Wenji Wu 0 siblings, 1 reply; 56+ messages in thread From: John Heffner @ 2008-08-27 22:48 UTC (permalink / raw) To: wenji; +Cc: David Miller, ilpo.jarvinen, netdev On Wed, Aug 27, 2008 at 7:38 AM, Wenji Wu <wenji@fnal.gov> wrote: > > Hi, all, > > Could anybody help me out with Linux adaptively adjusting ssthresh? Thanks > in advance. > > I understand that the latest Linux is able to adaptively adjust ssthresh to > avoid retransmission. Could anybody tell me which algorithms have been > implemented for the adaptive ssthresh adjust? A little more detail would be helpful. Are you referring to caching ssthresh between connections, or something going on during a connection? Various congestion control modules use ssthresh differently, so a comprehensive answer would be difficult. -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting ssthresh 2008-08-27 22:48 ` John Heffner @ 2008-08-28 0:53 ` Wenji Wu 2008-08-28 6:34 ` Ilpo Järvinen 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-08-28 0:53 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, ilpo.jarvinen, netdev > A little more detail would be helpful. Are you referring to caching > ssthresh between connections, or something going on during a > connection? Various congestion control modules use ssthresh > differently, so a comprehensive answer would be difficult. Thanks John, I am referring to the adaptive ssthresh adjusting during a connection. thanks, wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting ssthresh 2008-08-28 0:53 ` Wenji Wu @ 2008-08-28 6:34 ` Ilpo Järvinen 2008-08-28 14:20 ` about Linux adaptivly adjusting dupthresh Wenji Wu 0 siblings, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-08-28 6:34 UTC (permalink / raw) To: Wenji Wu; +Cc: John Heffner, David Miller, Netdev On Wed, 27 Aug 2008, Wenji Wu wrote: > > > A little more detail would be helpful. Are you referring to caching > > ssthresh between connections, or something going on during a > > connection? Various congestion control modules use ssthresh > > differently, so a comprehensive answer would be difficult. > > > Thanks John, I am referring to the adaptive ssthresh adjusting during a > connection. ??? Every now and then (once we detect some losses) snd_ssthresh is set to a halved flightsize as given by, well, you know those standards that say something about it :-). So I (like John) seem to somewhat miss the point of your question here. Or did you perhaps refer to rcv_ssthresh (which I wouldn't ever call to with a plain "ssthresh")? -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* about Linux adaptivly adjusting dupthresh 2008-08-28 6:34 ` Ilpo Järvinen @ 2008-08-28 14:20 ` Wenji Wu 2008-08-28 18:53 ` Ilpo Järvinen 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-08-28 14:20 UTC (permalink / raw) To: 'Ilpo Järvinen' Cc: 'John Heffner', 'David Miller', 'Netdev' Sorry, I made a mistake in the last post, what I mean is "algorithms adaptively adjust TCP reordering threshold dupthresh". I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust dupthresh to deal with packet reordering. Are there any other reordering-tolerant algorithms implemented in Linux? Thanks, wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting dupthresh 2008-08-28 14:20 ` about Linux adaptivly adjusting dupthresh Wenji Wu @ 2008-08-28 18:53 ` Ilpo Järvinen 2008-08-28 19:30 ` Wenji Wu 0 siblings, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-08-28 18:53 UTC (permalink / raw) To: Wenji Wu; +Cc: 'John Heffner', 'David Miller', 'Netdev' On Thu, 28 Aug 2008, Wenji Wu wrote: > Sorry, I made a mistake in the last post, what I mean is "algorithms > adaptively adjust TCP reordering threshold dupthresh". Ah, that makes much more sense. :-) > I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust > dupthresh to deal with packet reordering. Are there any other > reordering-tolerant algorithms implemented in Linux? First about adaptive dupthresh: In addition to DSACK, we use never-retransmitted block's cumulative ACKs to increase the dupthresh (see tcp_clean_rtx_queue). Then there's some newreno thing when dupacks > packets_out but I've never really figured it fully out if that's doing the correct thing when doing + tp->packets_out besides the most simple case (see tcp_check_reno_reordering). I don't think that eifel adjusts dupthresh though it can remove ambiguity problem and thus we can use the never-retransmitted block acked detection more often. Also, there's some added logic for small-windowed case to reduce dupthresh temporarily (at the smallest to 3 or whatever the default is) if window is not large enough to generate the incremented (see tcp_time_to_recover). Again, I'm not too sure what you mean by "reordering tolerant", but here are some things that may be related: FACK -> RFC3517 auto-fallback if reordering is detected (basically holes are only counted with FACK in the more-than-dupthresh check). I guess Eifel like timestamp checking belongs to this category (in tcp_try_undo_partial). If latency spike + reordering occurs, SACK FRTO might help but I think it depends on scenario. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: about Linux adaptivly adjusting dupthresh 2008-08-28 18:53 ` Ilpo Järvinen @ 2008-08-28 19:30 ` Wenji Wu 0 siblings, 0 replies; 56+ messages in thread From: Wenji Wu @ 2008-08-28 19:30 UTC (permalink / raw) To: 'Ilpo Järvinen' Cc: 'John Heffner', 'David Miller', 'Netdev' Thanks, -----Original Message----- From: Ilpo Järvinen [mailto:ilpo.jarvinen@helsinki.fi] Sent: Thursday, August 28, 2008 1:53 PM To: Wenji Wu Cc: 'John Heffner'; 'David Miller'; 'Netdev' Subject: Re: about Linux adaptivly adjusting dupthresh On Thu, 28 Aug 2008, Wenji Wu wrote: > Sorry, I made a mistake in the last post, what I mean is "algorithms > adaptively adjust TCP reordering threshold dupthresh". Ah, that makes much more sense. :-) > I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust > dupthresh to deal with packet reordering. Are there any other > reordering-tolerant algorithms implemented in Linux? First about adaptive dupthresh: In addition to DSACK, we use never-retransmitted block's cumulative ACKs to increase the dupthresh (see tcp_clean_rtx_queue). Then there's some newreno thing when dupacks > packets_out but I've never really figured it fully out if that's doing the correct thing when doing + tp->packets_out besides the most simple case (see tcp_check_reno_reordering). I don't think that eifel adjusts dupthresh though it can remove ambiguity problem and thus we can use the never-retransmitted block acked detection more often. Also, there's some added logic for small-windowed case to reduce dupthresh temporarily (at the smallest to 3 or whatever the default is) if window is not large enough to generate the incremented (see tcp_time_to_recover). Again, I'm not too sure what you mean by "reordering tolerant", but here are some things that may be related: FACK -> RFC3517 auto-fallback if reordering is detected (basically holes are only counted with FACK in the more-than-dupthresh check). I guess Eifel like timestamp checking belongs to this category (in tcp_try_undo_partial). If latency spike + reordering occurs, SACK FRTO might help but I think it depends on scenario. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-16 9:21 ` Ilpo Järvinen 2008-04-16 9:35 ` David Miller @ 2008-04-16 14:40 ` John Heffner 2008-04-16 15:03 ` Ilpo Järvinen 1 sibling, 1 reply; 56+ messages in thread From: John Heffner @ 2008-04-16 14:40 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: David Miller, wenji, Netdev On Wed, Apr 16, 2008 at 2:21 AM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > > On Wed, 16 Apr 2008, David Miller wrote: > > > From: "John Heffner" <johnwheffner@gmail.com> > > Date: Tue, 15 Apr 2008 15:40:05 -0700 > > > > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering. > > > > > > This change is necessary to allow cwnd to grow during persistent > > > reordering. Cwnd moderation is applied when in the disorder state > > > and an ack that fills the hole comes in. If the hole was greater > > > than 3 packets, but less than tp->reordering, cwnd will shrink when > > > it should not have. > > > > > > Signed-off-by: John Heffner <jheffner@napa.(none)> > > > > I think this patch is correct, or at least more correct than what > > this code is doing right now. > > > > Any objections to my adding this to net-2.6.26? > > I don't have objections. > > But I want to note that tp->reordering does not consider the situation on > that specific ACK because its value might originate a number of segments > and even RTTs back. I think it could be possible to find a more > appropriate value for max_burst locally to an ACK. ...Though it might be a > bit over-engineered solution. For SACK we calculate similar metric anyway > in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at > cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3 > could perhaps be used (I'm not sure if these would be foolproof in > recovery though). Reordering is generally a random process resulting from a packet traversing parallel queues. (In the case of netem, the random process is explicitly defined by simulation.) As reordering is created by packets sitting in queues, these queues *should* be able to absorb a burst of at least the reordering size. That's at least my justification for using the reordering threshold as max_burst, along with the fact that it should prevent cwnd from getting clamped. Anyway, max_burst isn't a standard. TCP makes no guarantees that it won't burst a full window. If anything, I actually think that in most cases we'd be better off without it. It's harmful to high-bdp flows because it pulls down cwnd, which has a long-term effect in response to a short-term event. -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner @ 2008-04-16 15:03 ` Ilpo Järvinen 0 siblings, 0 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-16 15:03 UTC (permalink / raw) To: John Heffner; +Cc: David Miller, wenji, Netdev [-- Attachment #1: Type: TEXT/PLAIN, Size: 1634 bytes --] On Wed, 16 Apr 2008, John Heffner wrote: > On Wed, Apr 16, 2008 at 2:21 AM, Ilpo Järvinen > <ilpo.jarvinen@helsinki.fi> wrote: > > > > But I want to note that tp->reordering does not consider the situation on > > that specific ACK because its value might originate a number of segments > > and even RTTs back. I think it could be possible to find a more > > appropriate value for max_burst locally to an ACK. ...Though it might be a > > bit over-engineered solution. For SACK we calculate similar metric anyway > > in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at > > cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3 > > could perhaps be used (I'm not sure if these would be foolproof in > > recovery though). > > Reordering is generally a random process resulting from a packet > traversing parallel queues. (In the case of netem, the random process > is explicitly defined by simulation.) As reordering is created by > packets sitting in queues, these queues *should* be able to absorb a > burst of at least the reordering size. That's at least my > justification for using the reordering threshold as max_burst, along > with the fact that it should prevent cwnd from getting clamped. Sure, but combined with other phenomena such as ACK compression (and appropriate ACK pattern & pre TCP state), one might end up generating much larger bursts than just tp->reordering. Though it's probably not any worse than ACK compression already can cause e.g. after spurious RTO. And one is quite guaranteed to run out of something else too before things get too nasty. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: RE: A Linux TCP SACK Question 2008-04-15 22:40 ` John Heffner 2008-04-16 8:27 ` David Miller @ 2008-04-16 14:46 ` Wenji Wu 1 sibling, 0 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-16 14:46 UTC (permalink / raw) To: 'John Heffner'; +Cc: 'Ilpo Järvinen', 'Netdev' >Wenji, can you try this out? Patch against net-2.6.26. I just try with the new patch. It works, saturating the 1Gbps link. The experiment works as: Sender --- Router --- Receiver Iperf is sending from the sender to the receiver. In between there is an emulated router which runs netem. The emulated router has two interfaces, both with netem configured. One interface emulates the forward path and the other for the reverse path. Both netem interfaces are configured with 1.5ms delay and 0.15ms variance. No packet drops. Kernel 2.6.25-rc9 patched with the file you provided Thanks, wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 0:48 ` John Heffner 2008-04-15 8:25 ` Ilpo Järvinen @ 2008-04-15 15:45 ` Wenji Wu 2008-04-15 16:39 ` Wenji Wu 2 siblings, 0 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-15 15:45 UTC (permalink / raw) To: John Heffner; +Cc: Ilpo Järvinen, Netdev > Sill the mystery remains as to why *both* are going so slowly. You > mentioned you're using a web100 kernel. What are the final values of > all the variables for the connections (grab with readall)? Kernel 2.6.24, "echo 1 > /proc/sys/net/ipv4/tcp_no_metrics_save" With SACK off: Throughtpu 256Mbps Connection 6 (198.2.1.2 38054 131.225.2.16 5001) State 1 SACKEnabled 0 TimestampsEnabled 1 NagleEnabled 1 ECNEnabled 0 SndWinScale 11 RcvWinScale 7 ActiveOpen 1 MSSRcvd 0 WinScaleRcvd 11 WinScaleSent 7 PktsOut 221715 DataPktsOut 221715 DataBytesOut 324429992 PktsIn 215245 DataPktsIn 0 DataBytesIn 0 SndUna 2784091744 SndNxt 2784091744 SndMax 2784091744 ThruBytesAcked 321011738 SndISS 2463080006 RcvNxt 1309516114 ThruBytesReceived 0 RecvISS 1309516114 StartTimeSec 1208273537 StartTimeUsec 293029 Duration 14594853 SndLimTransSender 6 SndLimBytesSender 23960 SndLimTimeSender 4137 SndLimTransCwnd 5 SndLimBytesCwnd 324406032 SndLimTimeCwnd 10046308 SndLimTransRwin 0 SndLimBytesRwin 0 SndLimTimeRwin 0 SlowStart 0 CongAvoid 0 CongestionSignals 4 OtherReductions 13167 X_OtherReductionsCV 0 X_OtherReductionsCM 13167 CongestionOverCount 54 CurCwnd 4344 MaxCwnd 173760 CurSsthresh 94894680 LimCwnd 4294965848 MaxSsthresh 94894680 MinSsthresh 4344 FastRetran 4 Timeouts 0 SubsequentTimeouts 0 CurTimeoutCount 0 AbruptTimeouts 0 PktsRetrans 17 BytesRetrans 24616 DupAcksIn 59556 SACKsRcvd 0 SACKBlocksRcvd 0 PreCongSumCwnd 375032 PreCongSumRTT 12 PostCongSumRTT 15 PostCongCountRTT 4 ECERcvd 0 SendStall 0 QuenchRcvd 0 RetranThresh 29 NonRecovDA 0 AckAfterFR 0 DSACKDups 0 SampleRTT 3 SmoothedRTT 3 RTTVar 50 MaxRTT 46 MinRTT 2 SumRTT 158191 CountRTT 47830 CurRTO 203 MaxRTO 237 MinRTO 203 CurMSS 1448 MaxMSS 1448 MinMSS 524 X_Sndbuf 1919232 X_Rcvbuf 87380 CurRetxQueue 0 MaxRetxQueue 0 CurAppWQueue 1786832 MaxAppWQueue 1886744 CurRwinSent 5888 MaxRwinSent 5888 MinRwinSent 5840 LimRwin 0 DupAcksOut 0 CurReasmQueue 0 MaxReasmQueue 0 CurAppRQueue 0 MaxAppRQueue 0 X_rcv_ssthresh 5840 X_wnd_clamp 64087 X_dbg1 5888 X_dbg2 536 X_dbg3 5840 X_dbg4 0 CurRwinRcvd 3137536 MaxRwinRcvd 3137536 MinRwinRcvd 17896 LocalAddressType 1 LocalAddress 198.2.1.2 LocalPort 38054 RemAddress 131.225.2.16 RemPort 5001 X_RcvRTT 0 ............................................................... With SACK On Throughput: 178Mbps Connection 3 (131.225.2.22 22 131.225.82.152 52973) State 5 SACKEnabled 3 TimestampsEnabled 1 NagleEnabled 0 ECNEnabled 0 SndWinScale 11 RcvWinScale 7 ActiveOpen 0 MSSRcvd 0 WinScaleRcvd 11 WinScaleSent 7 PktsOut 230 DataPktsOut 230 DataBytesOut 25783 PktsIn 353 DataPktsIn 164 DataBytesIn 11120 SndUna 2809669838 SndNxt 2809669838 SndMax 2809669838 ThruBytesAcked 18423 SndISS 2809651415 RcvNxt 2817947310 ThruBytesReceived 11120 RecvISS 2817936190 StartTimeSec 1208271915 StartTimeUsec 71844 Duration 2362591841 SndLimTransSender 6 SndLimBytesSender 25783 SndLimTimeSender 2273927770 SndLimTransCwnd 5 SndLimBytesCwnd 0 SndLimTimeCwnd 1047 SndLimTransRwin 0 SndLimBytesRwin 0 SndLimTimeRwin 0 SlowStart 0 CongAvoid 0 CongestionSignals 0 OtherReductions 0 X_OtherReductionsCV 0 X_OtherReductionsCM 0 CongestionOverCount 0 CurCwnd 5792 MaxCwnd 13032 CurSsthresh 4294966376 LimCwnd 4294965848 MaxSsthresh 0 MinSsthresh 4294967295 FastRetran 0 Timeouts 0 SubsequentTimeouts 0 CurTimeoutCount 0 AbruptTimeouts 0 PktsRetrans 0 BytesRetrans 0 DupAcksIn 0 SACKsRcvd 0 SACKBlocksRcvd 0 PreCongSumCwnd 0 PreCongSumRTT 0 PostCongSumRTT 0 PostCongCountRTT 0 ECERcvd 0 SendStall 0 QuenchRcvd 0 RetranThresh 3 NonRecovDA 0 AckAfterFR 0 DSACKDups 0 SampleRTT 0 SmoothedRTT 3 RTTVar 50 MaxRTT 40 MinRTT 0 SumRTT 1269 CountRTT 221 CurRTO 203 MaxRTO 234 MinRTO 201 CurMSS 1448 MaxMSS 1448 MinMSS 1428 X_Sndbuf 16384 X_Rcvbuf 87380 CurRetxQueue 0 MaxRetxQueue 0 CurAppWQueue 0 MaxAppWQueue 0 CurRwinSent 14208 MaxRwinSent 14208 MinRwinSent 5792 LimRwin 8365440 DupAcksOut 0 CurReasmQueue 0 MaxReasmQueue 0 CurAppRQueue 0 MaxAppRQueue 1152 X_rcv_ssthresh 14144 X_wnd_clamp 64087 X_dbg1 14208 X_dbg2 1152 X_dbg3 14144 X_dbg4 0 CurRwinRcvd 3749888 MaxRwinRcvd 3749888 MinRwinRcvd 3747840 LocalAddressType 1 LocalAddress 131.225.2.22 LocalPort 22 RemAddress 131.225.82.152 RemPort 52973 X_RcvRTT 405000 .................................................................. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 0:48 ` John Heffner 2008-04-15 8:25 ` Ilpo Järvinen 2008-04-15 15:45 ` Wenji Wu @ 2008-04-15 16:39 ` Wenji Wu 2008-04-15 17:01 ` John Heffner 2 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-15 16:39 UTC (permalink / raw) To: John Heffner; +Cc: Ilpo Järvinen, Netdev My fault, resent. > Sill the mystery remains as to why *both* are going so slowly. You > mentioned you're using a web100 kernel. What are the final values of > all the variables for the connections (grab with readall)? kernel 2.6.24 "echo 1 > /proc/sys/net/ipv4/tcp_no_metric_save" .............................................................................................. WIth SACK On, throughput: 179Mbps Connection 4 (198.2.1.2 56648 131.225.2.16 5001) State 1 SACKEnabled 3 TimestampsEnabled 1 NagleEnabled 1 ECNEnabled 0 SndWinScale 11 RcvWinScale 7 ActiveOpen 1 MSSRcvd 0 WinScaleRcvd 11 WinScaleSent 7 PktsOut 154770 DataPktsOut 154770 DataBytesOut 226294264 PktsIn 149398 DataPktsIn 0 DataBytesIn 0 SndUna 930060039 SndNxt 930060039 SndMax 930060039 ThruBytesAcked 224092186 SndISS 705967853 RcvNxt 4282199280 ThruBytesReceived 0 RecvISS 4282199280 StartTimeSec 1208277286 StartTimeUsec 813964 Duration 13984145 SndLimTransSender 3 SndLimBytesSender 7208 SndLimTimeSender 3107 SndLimTransCwnd 2 SndLimBytesCwnd 226287056 SndLimTimeCwnd 10003734 SndLimTransRwin 0 SndLimBytesRwin 0 SndLimTimeRwin 0 SlowStart 0 CongAvoid 0 CongestionSignals 2 OtherReductions 19402 X_OtherReductionsCV 0 X_OtherReductionsCM 19402 CongestionOverCount 13 CurCwnd 4344 MaxCwnd 102808 CurSsthresh 94894680 LimCwnd 4294965848 MaxSsthresh 94894680 MinSsthresh 7240 FastRetran 2 Timeouts 0 SubsequentTimeouts 0 CurTimeoutCount 0 AbruptTimeouts 0 PktsRetrans 7 BytesRetrans 10136 DupAcksIn 41940 SACKsRcvd 118692 SACKBlocksRcvd 189919 PreCongSumCwnd 91224 PreCongSumRTT 6 PostCongSumRTT 7 PostCongCountRTT 2 ECERcvd 0 SendStall 0 QuenchRcvd 0 RetranThresh 30 NonRecovDA 0 AckAfterFR 0 DSACKDups 0 SampleRTT 3 SmoothedRTT 3 RTTVar 50 MaxRTT 4 MinRTT 2 SumRTT 142655 CountRTT 43932 CurRTO 203 MaxRTO 204 MinRTO 203 CurMSS 1448 MaxMSS 1448 MinMSS 524 X_Sndbuf 206976 X_Rcvbuf 87380 CurRetxQueue 0 MaxRetxQueue 0 CurAppWQueue 130320 MaxAppWQueue 237472 CurRwinSent 5888 MaxRwinSent 5888 MinRwinSent 5840 LimRwin 0 DupAcksOut 0 CurReasmQueue 0 MaxReasmQueue 0 CurAppRQueue 0 MaxAppRQueue 0 X_rcv_ssthresh 5840 X_wnd_clamp 64087 X_dbg1 5888 X_dbg2 536 X_dbg3 5840 X_dbg4 0 CurRwinRcvd 3137536 MaxRwinRcvd 3137536 MinRwinRcvd 17896 LocalAddressType 1 LocalAddress 198.2.1.2 LocalPort 56648 RemAddress 131.225.2.16 RemPort 5001 X_RcvRTT 0 [root@gw004 ipv4]# .................................................................. WIth SACK Off: Throughput: 258Mbps Connection 5 (198.2.1.2 43578 131.225.2.16 5001) State 1 SACKEnabled 0 TimestampsEnabled 1 NagleEnabled 1 ECNEnabled 0 SndWinScale 11 RcvWinScale 7 ActiveOpen 1 MSSRcvd 0 WinScaleRcvd 11 WinScaleSent 7 PktsOut 223011 DataPktsOut 223011 DataBytesOut 326318584 PktsIn 216404 DataPktsIn 0 DataBytesIn 0 SndUna 4002973902 SndNxt 4002973902 SndMax 4002973902 ThruBytesAcked 322904090 SndISS 3680069812 RcvNxt 2942495629 ThruBytesReceived 0 RecvISS 2942495629 StartTimeSec 1208277475 StartTimeUsec 779859 Duration 18149747 SndLimTransSender 4 SndLimBytesSender 10456 SndLimTimeSender 3787 SndLimTransCwnd 3 SndLimBytesCwnd 326308128 SndLimTimeCwnd 10006059 SndLimTransRwin 0 SndLimBytesRwin 0 SndLimTimeRwin 0 SlowStart 0 CongAvoid 0 CongestionSignals 3 OtherReductions 13166 X_OtherReductionsCV 0 X_OtherReductionsCM 13166 CongestionOverCount 37 CurCwnd 10136 MaxCwnd 173760 CurSsthresh 94894680 LimCwnd 4294965848 MaxSsthresh 94894680 MinSsthresh 46336 FastRetran 3 Timeouts 0 SubsequentTimeouts 0 CurTimeoutCount 0 AbruptTimeouts 0 PktsRetrans 7 BytesRetrans 10136 DupAcksIn 59484 SACKsRcvd 0 SACKBlocksRcvd 0 PreCongSumCwnd 286704 PreCongSumRTT 12 PostCongSumRTT 11 PostCongCountRTT 3 ECERcvd 0 SendStall 0 QuenchRcvd 0 RetranThresh 23 NonRecovDA 0 AckAfterFR 0 DSACKDups 0 SampleRTT 4 SmoothedRTT 4 RTTVar 50 MaxRTT 6 MinRTT 2 SumRTT 159332 CountRTT 48291 CurRTO 204 MaxRTO 204 MinRTO 203 CurMSS 1448 MaxMSS 1448 MinMSS 524 X_Sndbuf 451584 X_Rcvbuf 87380 CurRetxQueue 0 MaxRetxQueue 0 CurAppWQueue 373584 MaxAppWQueue 454672 CurRwinSent 5888 MaxRwinSent 5888 MinRwinSent 5840 LimRwin 0 DupAcksOut 0 CurReasmQueue 0 MaxReasmQueue 0 CurAppRQueue 0 MaxAppRQueue 0 X_rcv_ssthresh 5840 X_wnd_clamp 64087 X_dbg1 5888 X_dbg2 536 X_dbg3 5840 X_dbg4 0 CurRwinRcvd 3137536 MaxRwinRcvd 3137536 MinRwinRcvd 17896 LocalAddressType 1 LocalAddress 198.2.1.2 LocalPort 43578 RemAddress 131.225.2.16 RemPort 5001 X_RcvRTT 0 [root@gw004 ipv4]# ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 16:39 ` Wenji Wu @ 2008-04-15 17:01 ` John Heffner 2008-04-15 17:08 ` Ilpo Järvinen 2008-04-15 17:55 ` Wenji Wu 0 siblings, 2 replies; 56+ messages in thread From: John Heffner @ 2008-04-15 17:01 UTC (permalink / raw) To: Wenji Wu; +Cc: Ilpo Järvinen, Netdev On Tue, Apr 15, 2008 at 9:39 AM, Wenji Wu <wenji@fnal.gov> wrote: > SlowStart 0 > CongAvoid 0 > CongestionSignals 3 > OtherReductions 13166 > X_OtherReductionsCV 0 > X_OtherReductionsCM 13166 > CongestionOverCount 37 > CurCwnd 10136 > > MaxCwnd 173760 > CurSsthresh 94894680 > LimCwnd 4294965848 > MaxSsthresh 94894680 > MinSsthresh 46336 We can see that in both cases you are getting throttled by tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why it's reaching this code - I would have thought that the high tp->reordering would prevent this. Ilpo, do you have any insights? It's not all that surprising that packets_in_flight is a higher value with newreno than sack, which would explain the higher window with newreno. Wenji, the web100 kernel has a sysctl - WAD_MaxBurst. I suspect it may make a significant difference if you set this to a large value. -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 17:01 ` John Heffner @ 2008-04-15 17:08 ` Ilpo Järvinen 2008-04-15 17:23 ` John Heffner 2008-04-15 17:55 ` Wenji Wu 1 sibling, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-15 17:08 UTC (permalink / raw) To: John Heffner; +Cc: Wenji Wu, Netdev On Tue, 15 Apr 2008, John Heffner wrote: > On Tue, Apr 15, 2008 at 9:39 AM, Wenji Wu <wenji@fnal.gov> wrote: > > SlowStart 0 > > CongAvoid 0 > > CongestionSignals 3 > > OtherReductions 13166 > > X_OtherReductionsCV 0 > > X_OtherReductionsCM 13166 > > CongestionOverCount 37 > > CurCwnd 10136 > > > > MaxCwnd 173760 > > CurSsthresh 94894680 > > LimCwnd 4294965848 > > MaxSsthresh 94894680 > > MinSsthresh 46336 > > > We can see that in both cases you are getting throttled by > tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why > it's reaching this code - I would have thought that the high > tp->reordering would prevent this. Ilpo, do you have any insights? What makes you think so? It's called from tcp_try_to_open as anyone can read from the source, basically when our state is CA_Disorder (some very small portion might happen in ca_recovery besides that). -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 17:08 ` Ilpo Järvinen @ 2008-04-15 17:23 ` John Heffner 2008-04-15 18:00 ` Matt Mathis 0 siblings, 1 reply; 56+ messages in thread From: John Heffner @ 2008-04-15 17:23 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: Wenji Wu, Netdev On Tue, Apr 15, 2008 at 10:08 AM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > On Tue, 15 Apr 2008, John Heffner wrote: > > We can see that in both cases you are getting throttled by > > tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why > > it's reaching this code - I would have thought that the high > > tp->reordering would prevent this. Ilpo, do you have any insights? > > What makes you think so? It's called from tcp_try_to_open as anyone can > read from the source, basically when our state is CA_Disorder (some very > small portion might happen in ca_recovery besides that). This is what X_OtherReductionsCM instruments, and that was the only thing holding back cwnd. I just looked at the source, and indeed it will be called on every ack when we are in the disorder state. Limiting cwnd to packets_in_flight() + 3 here is going to prevent cwnd from growing when the reordering is greater than 3. Making max_burst at least tp->reordering should help some, though I'm not sure it's the right thing to do. -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-15 17:23 ` John Heffner @ 2008-04-15 18:00 ` Matt Mathis 0 siblings, 0 replies; 56+ messages in thread From: Matt Mathis @ 2008-04-15 18:00 UTC (permalink / raw) To: =?X-UNKNOWN?Q?Ilpo_J=E4rvinen?=; +Cc: John Heffner, Wenji Wu, Netdev [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN; FORMAT=flowed, Size: 2018 bytes --] In some future kernel release, I would consider changing it to limit cwnd to be less than packets_in_flight() + reorder + 3(?). If the network is reordering packets, then it has to accept bursts, otherwise TCP can never open the window. The +3 (or some other constant) is still needed because TCP has to send extra packets at the point where the window changes. As an alternative, you could write a research paper on how the network could do LIFO packet scheduling so the reordering serves as a congestion signal to the stacks. I bet it would have some really interesting properties. Oh wait, April 1st was 2 weeks ago. Thanks, --MM-- On Tue, 15 Apr 2008, John Heffner wrote: > On Tue, Apr 15, 2008 at 10:08 AM, Ilpo Järvinen > <ilpo.jarvinen@helsinki.fi> wrote: >> On Tue, 15 Apr 2008, John Heffner wrote: >> > We can see that in both cases you are getting throttled by >> > tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why >> > it's reaching this code - I would have thought that the high >> > tp->reordering would prevent this. Ilpo, do you have any insights? >> >> What makes you think so? It's called from tcp_try_to_open as anyone can >> read from the source, basically when our state is CA_Disorder (some very >> small portion might happen in ca_recovery besides that). > > This is what X_OtherReductionsCM instruments, and that was the only > thing holding back cwnd. > > I just looked at the source, and indeed it will be called on every ack > when we are in the disorder state. Limiting cwnd to > packets_in_flight() + 3 here is going to prevent cwnd from growing > when the reordering is greater than 3. Making max_burst at least > tp->reordering should help some, though I'm not sure it's the right > thing to do. > > -John > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: RE: A Linux TCP SACK Question 2008-04-15 17:01 ` John Heffner 2008-04-15 17:08 ` Ilpo Järvinen @ 2008-04-15 17:55 ` Wenji Wu 1 sibling, 0 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-15 17:55 UTC (permalink / raw) To: 'John Heffner'; +Cc: 'Ilpo Järvinen', 'Netdev' >We can see that in both cases you are getting throttled by >tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why >it's reaching this code - I would have thought that the high >tp->reordering would prevent this. Ilpo, do you have any insights? >It's not all that surprising that packets_in_flight is a higher value >with newreno than sack, which would explain the higher window with >newreno. >Wenji, the web100 kernel has a sysctl - WAD_MaxBurst. I suspect it >may make a significant difference if you set this to a large value. It is surprising! When I increase WAD_MaxBurst (Patched with Web100) from 3 to 20, the throughput in both cases (SACK ON/OFF) will saturate the 1Gbps Link!!! ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-08 12:33 ` Wenji Wu 2008-04-08 13:45 ` Ilpo Järvinen @ 2008-04-08 15:57 ` John Heffner 1 sibling, 0 replies; 56+ messages in thread From: John Heffner @ 2008-04-08 15:57 UTC (permalink / raw) To: Wenji Wu; +Cc: Ilpo Järvinen, Sangtae Ha, Netdev On Tue, Apr 8, 2008 at 5:33 AM, Wenji Wu <wenji@fnal.gov> wrote: > > NewReno never retransmitted anything in them (except at the very end > > of > > the transfer). Probably something related to how tp->reordering behaves > > I suppose... > > Yes, the adaptive tp->reordering will play a role here. I remember several years ago when I first looked at chronic reordering with a high BDP, the problem I had was that: 1) Only acks of new data can advance cwnd, and these only advance by the normal amount per ack, so cwnd grows very slowly. 2) Reordering caused slow start to exit early, before the reordering threshold had adapted 3) The "undo" code didn't work well because of cwnd moderation 4) There were bugs in the reordering calculation that caused the threshold to be pulled back Some of these shouldn't matter to you because your rtt is low, but I thought i would be worth mentioning. I'm not sure what is keeping your cwnd from growing -- it always seems to be within a small range in both cases, which is not right unless there's a bottleneck at the sender. The fact reno does a little better than sack seems like the less important problem. Also, what's the behavior when turning off reordering, in each or both directions? -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question 2008-04-08 6:36 ` Ilpo Järvinen 2008-04-08 12:33 ` Wenji Wu @ 2008-04-08 14:07 ` John Heffner 2008-04-14 16:10 ` Wenji Wu 2 siblings, 0 replies; 56+ messages in thread From: John Heffner @ 2008-04-08 14:07 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: Wenji Wu, Sangtae Ha, Netdev On Mon, Apr 7, 2008 at 11:36 PM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > > On Mon, 7 Apr 2008, Wenji Wu wrote: > > > >I don't think reorderings frequently happened in your directly > > >connected networking scenario. Please post your tcpdump file for > > >clearing out all doubts. > > > > https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/ > > > > Two tcpdump files: one with SACK on, the other with SACK off. The test > > configures described in my previous emails. > > NewReno never retransmitted anything in them (except at the very end of > the transfer). Probably something related to how tp->reordering behaves > I suppose... Yes, this looks very suspicious. Can we see this again with TSO off? -John ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-08 6:36 ` Ilpo Järvinen 2008-04-08 12:33 ` Wenji Wu 2008-04-08 14:07 ` John Heffner @ 2008-04-14 16:10 ` Wenji Wu 2008-04-14 16:48 ` Ilpo Järvinen 2 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-14 16:10 UTC (permalink / raw) To: 'Ilpo Järvinen' Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' Hi, Ilop, The latest results have been posted to: https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/ The kernel under test is: Linux-2.6.25-rc9. I have checked with its changelog, which shows that your latest fix is included. In the tests, I vary the tcp_frto (0, 1, and 2) with SACK On/Off. The experiment works as: Sender --- Router --- Receiver Iperf is sending from the sender to the receiver. In between there is an emulated router which runs netem. The emulated router has two interfaces, both with netem configured. One interface emulates the forward path and the other for the reverse path. Both netem interfaces are configured with 1.5ms delay and 0.15ms variance. No packet drops in tests and packet capturing. All of these systems are multi-core platforms, with 2G+ CPU. I run TOP to verify, CPUs are idle most of time. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question 2008-04-14 16:10 ` Wenji Wu @ 2008-04-14 16:48 ` Ilpo Järvinen 2008-04-14 22:07 ` Wenji Wu 0 siblings, 1 reply; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-14 16:48 UTC (permalink / raw) To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev' On Mon, 14 Apr 2008, Wenji Wu wrote: > The latest results have been posted to: > > https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/ > > The kernel under test is: Linux-2.6.25-rc9. I have checked with its > changelog, which shows that your latest fix is included. Hmm, now there are even less retransmissions (barely some with the SACK in the end). I suppose the reordering detection is good enough to kill them. ...You could perhaps figure that out from MIBs if you would want to. > In the tests, I vary the tcp_frto (0, 1, and 2) with SACK On/Off. ...I should have said it more clearly last time already that these are not significant with your workload. > The experiment works as: > > Sender --- Router --- Receiver > > Iperf is sending from the sender to the receiver. In between there is an > emulated router which runs netem. The emulated router has two interfaces, > both with netem configured. One interface emulates the forward path and the > other for the reverse path. Both netem interfaces are configured with 1.5ms > delay and 0.15ms variance. No packet drops in tests and packet capturing. ...How about this theory: Forward path reordering causes duplicate ACKs due to old segments. These are threated differently for NewReno and SACK: NewReno => Sends new data out (limited xmit; it's not limited to two segments in linux as per RFC, however, RFC doesn't consider autotuning of DupThresh either) SACK => No new SACK block discovered. Packets in flight remains the same, and thus no new segment is sent. ...What do other think? I guess it should be visible with fwd path reordering alone, though the added distance with reverse path reordering might act as amplifier because NewReno benefits from shorter RTTed packets when fwd path old segment arrived while SACK losses its ability to increase outstanding data... ...A quick look into it with tcptrace's outstanding date plot, it seems that NewReno levels ~100000 and SACK ~68000. ...I think SACK just knows too much? :-/ > All of these systems are multi-core platforms, with 2G+ CPU. I run > TOP to verify, CPUs are idle most of time. Thanks for adding this for other. I agree with you that this is not an cpu horsepower issue. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-14 16:48 ` Ilpo Järvinen @ 2008-04-14 22:07 ` Wenji Wu 2008-04-15 8:23 ` Ilpo Järvinen 0 siblings, 1 reply; 56+ messages in thread From: Wenji Wu @ 2008-04-14 22:07 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: 'Netdev' > Hmm, now there are even less retransmissions (barely some with > the SACK in the end). > > I suppose the reordering detection is good enough to kill them. ...You > > could perhaps figure that out from MIBs if you would want to. > Yes, the web100 shows that the tcp_reordering could be as large as 127. I just rerun the following experimetns to show why there are few retransmissions in my previous posts. (1) Flush the sytem routing cache by running "ip route flush cache" before running and tcpdumping the traffic (2) Before running and tcpdumping the traffic, run a data transmission test to generate tcp_reordering in the routing cache. Do not flush the routing cache. Then running and tcpdumping the traffic. Both experiments with sack off. The results is posted to https://plone3.fnal.gov/P0/WAN/Members/wenji/adaptive_tcp_reordering/ So, the few retransmissions in my previous post are really caused by the routing cache. But flushing cahce has nothing to do with SACK on/off. Still the trhoughput with SACK off is better than that of with SACK on. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-14 22:07 ` Wenji Wu @ 2008-04-15 8:23 ` Ilpo Järvinen 0 siblings, 0 replies; 56+ messages in thread From: Ilpo Järvinen @ 2008-04-15 8:23 UTC (permalink / raw) To: Wenji Wu; +Cc: 'Netdev' On Mon, 14 Apr 2008, Wenji Wu wrote: > > > Hmm, now there are even less retransmissions (barely some with > > the SACK in the end). > > > > I suppose the reordering detection is good enough to kill them. ...You > > > > could perhaps figure that out from MIBs if you would want to. > > > > Yes, the web100 shows that the tcp_reordering could be as large as 127. It should get large, though I suspect newreno's new value: tp->packets_out + addend might have tp->packets_out too much. > I just rerun the following experimetns to show why there are few > retransmissions in my previous posts. > > (1) Flush the sytem routing cache by running "ip route flush cache" > before running and tcpdumping the traffic I didn't know that works, tcp_no_metrics_save sysctl seems to prevent saving them from an running TCP flow when a flow ends. > (2) Before running and tcpdumping the traffic, run a data transmission > test to generate tcp_reordering in the routing cache. Do not flush the > routing cache. Then running and tcpdumping the traffic. > > Both experiments with sack off. > > The results is posted to > https://plone3.fnal.gov/P0/WAN/Members/wenji/adaptive_tcp_reordering/ > > So, the few retransmissions in my previous post are really caused by the > routing cache. Yes. Remember however that initial metrics has also have effect on initial ssthresh, so one must be very careful to not cause unfairness through them if metrics are not cleared. > But flushing cahce has nothing to do with SACK on/off. Still the > trhoughput with SACK off is better than that of with SACK on. Yes, I think it alone would never explain it. Though difference in initial ssthresh might have been the explanation for the different level where outstanding data settled with the logs without any retransmissions. -- i. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question 2008-04-04 21:33 ` Ilpo Järvinen 2008-04-04 21:39 ` Ilpo Järvinen @ 2008-04-04 21:40 ` Wenji Wu 1 sibling, 0 replies; 56+ messages in thread From: Wenji Wu @ 2008-04-04 21:40 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: 'John Heffner', 'Netdev' > On Fri, 4 Apr 2008, Wenji Wu wrote: > > > > > >I'd suggest that you don't waste too much effort for 2.6.24. > ...Most of it > > >is recoded/updated since then. > > > > I just tried it on 2.6.25-rc8. The result is still the same: the throughput > > with SACK on is less than with SACK off. > > Hmm, can you also try if playing around with FRTO setting makes some > difference (tcp_frto sysctl)? Still the same, I just tried with FRTO, FACK. No difference, SACK on is worse than SACK off. wenji ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2008-08-28 19:30 UTC | newest] Thread overview: 56+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-04-04 4:54 A Linux TCP SACK Question Wenji Wu 2008-04-04 16:27 ` John Heffner 2008-04-04 17:49 ` Wenji Wu 2008-04-04 18:07 ` John Heffner 2008-04-04 20:00 ` Ilpo Järvinen 2008-04-04 20:07 ` Wenji Wu 2008-04-04 21:15 ` Wenji Wu 2008-04-04 21:33 ` Ilpo Järvinen 2008-04-04 21:39 ` Ilpo Järvinen 2008-04-04 22:14 ` Wenji Wu 2008-04-05 17:42 ` Ilpo Järvinen 2008-04-05 21:17 ` Sangtae Ha 2008-04-06 20:27 ` Wenji Wu 2008-04-06 22:43 ` Sangtae Ha 2008-04-07 14:56 ` Wenji Wu 2008-04-08 6:36 ` Ilpo Järvinen 2008-04-08 12:33 ` Wenji Wu 2008-04-08 13:45 ` Ilpo Järvinen 2008-04-08 14:30 ` Wenji Wu 2008-04-08 14:59 ` Ilpo Järvinen 2008-04-08 15:27 ` Wenji Wu 2008-04-08 17:26 ` Ilpo Järvinen 2008-04-14 22:47 ` Wenji Wu 2008-04-15 0:48 ` John Heffner 2008-04-15 8:25 ` Ilpo Järvinen 2008-04-15 18:01 ` Wenji Wu 2008-04-15 22:40 ` John Heffner 2008-04-16 8:27 ` David Miller 2008-04-16 9:21 ` Ilpo Järvinen 2008-04-16 9:35 ` David Miller 2008-04-16 14:50 ` Wenji Wu 2008-04-18 6:52 ` David Miller 2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu 2008-08-27 22:48 ` John Heffner 2008-08-28 0:53 ` Wenji Wu 2008-08-28 6:34 ` Ilpo Järvinen 2008-08-28 14:20 ` about Linux adaptivly adjusting dupthresh Wenji Wu 2008-08-28 18:53 ` Ilpo Järvinen 2008-08-28 19:30 ` Wenji Wu 2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner 2008-04-16 15:03 ` Ilpo Järvinen 2008-04-16 14:46 ` Wenji Wu 2008-04-15 15:45 ` Wenji Wu 2008-04-15 16:39 ` Wenji Wu 2008-04-15 17:01 ` John Heffner 2008-04-15 17:08 ` Ilpo Järvinen 2008-04-15 17:23 ` John Heffner 2008-04-15 18:00 ` Matt Mathis 2008-04-15 17:55 ` Wenji Wu 2008-04-08 15:57 ` John Heffner 2008-04-08 14:07 ` John Heffner 2008-04-14 16:10 ` Wenji Wu 2008-04-14 16:48 ` Ilpo Järvinen 2008-04-14 22:07 ` Wenji Wu 2008-04-15 8:23 ` Ilpo Järvinen 2008-04-04 21:40 ` Wenji Wu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).