* A Linux TCP SACK Question
@ 2008-04-04 4:54 Wenji Wu
2008-04-04 16:27 ` John Heffner
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 4:54 UTC (permalink / raw)
To: netdev
Hi, Could any body help me out with Linux TCP SACK? Thanks in advance.
I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before?
thanks,
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-04 4:54 A Linux TCP SACK Question Wenji Wu
@ 2008-04-04 16:27 ` John Heffner
2008-04-04 17:49 ` Wenji Wu
0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-04 16:27 UTC (permalink / raw)
To: Wenji Wu; +Cc: netdev
Unless you're sending very fast, where the computational overhead of
processing SACK blocks is slowing you down, this is not expected
behavior. Do you have more detail? What is the window size, and how
much reordering?
Full binary tcpdumps are very useful in diagnosing this type of problem.
-John
On Thu, Apr 3, 2008 at 9:54 PM, Wenji Wu <wenji@fnal.gov> wrote:
> Hi, Could any body help me out with Linux TCP SACK? Thanks in advance.
>
> I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before?
>
>
> thanks,
>
> wenji
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 16:27 ` John Heffner
@ 2008-04-04 17:49 ` Wenji Wu
2008-04-04 18:07 ` John Heffner
2008-04-04 20:00 ` Ilpo Järvinen
0 siblings, 2 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 17:49 UTC (permalink / raw)
To: 'John Heffner'; +Cc: netdev
Hi, John,
Thanks,
I just sat with Richard Clarson and repeat the phenomenon.
The experiment works as:
Sender --- Router --- Receiver
Iperf is sending from the sender to the receiver. In between there is an
emulated router which runs netem. The emulated router has two interfaces,
both with netem configured. One interface emulates the forward path and the
other for the reverse path. Both netem interfaces are configured with 1.5ms
delay and 0.15ms variance. No packet drops. Every system runs Linux 2.6.24.
When sack is on, the throughput is around 180Mbps
When sack is off, the throughput is around 260Mbps
I am sure it is not due to the computational overhead of the processing SACK
block. All of these systems are multi-core platforms, with 2G+ CPU. I run
TOP to verify, CPUs are idle most of time.
I was thinking that if the reordered ACKs/SACKs cause confusion in the
sender, and sender will unnecessarily reduce either the CWND or the
TCP_REORDERING threshold. I might need to take a serious look at the SACK
implementation.
I will send out the tcpdump files soon,
Thanks,
wenji
-----Original Message-----
From: John Heffner [mailto:johnwheffner@gmail.com]
Sent: Friday, April 04, 2008 11:28 AM
To: Wenji Wu
Cc: netdev@vger.kernel.org
Subject: Re: A Linux TCP SACK Question
Unless you're sending very fast, where the computational overhead of
processing SACK blocks is slowing you down, this is not expected
behavior. Do you have more detail? What is the window size, and how
much reordering?
Full binary tcpdumps are very useful in diagnosing this type of problem.
-John
On Thu, Apr 3, 2008 at 9:54 PM, Wenji Wu <wenji@fnal.gov> wrote:
> Hi, Could any body help me out with Linux TCP SACK? Thanks in advance.
>
> I run iperf to send traffic from sender to receiver. and add packet
reordering in both forward and reverse directions. I found when I turn off
the SACK/DSACK option, the throughput is better than with the SACK/DSACK on?
How could it happen in this way? did anybody encounter this phenomenon
before?
>
>
> thanks,
>
> wenji
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-04 17:49 ` Wenji Wu
@ 2008-04-04 18:07 ` John Heffner
2008-04-04 20:00 ` Ilpo Järvinen
1 sibling, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-04 18:07 UTC (permalink / raw)
To: wenji; +Cc: netdev
On Fri, Apr 4, 2008 at 10:49 AM, Wenji Wu <wenji@fnal.gov> wrote:
> I was thinking that if the reordered ACKs/SACKs cause confusion in the
> sender, and sender will unnecessarily reduce either the CWND or the
> TCP_REORDERING threshold. I might need to take a serious look at the SACK
> implementation.
It sounds very likely that you're encountering a bug or thinko in the sack code.
This actually brings to mind an old topic -- NCR (RFC4653). There was
some discussion of implementing this, which I think is simpler and
more robust than Linux's current reordering threshold calculation.
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 17:49 ` Wenji Wu
2008-04-04 18:07 ` John Heffner
@ 2008-04-04 20:00 ` Ilpo Järvinen
2008-04-04 20:07 ` Wenji Wu
2008-04-04 21:15 ` Wenji Wu
1 sibling, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-04 20:00 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'John Heffner', Netdev
On Fri, 4 Apr 2008, Wenji Wu wrote:
> Every system runs Linux 2.6.24.
You should have reported kernel version right from the beginning. It may
have a huge effect... ;-)
> When sack is on, the throughput is around 180Mbps
> When sack is off, the throughput is around 260Mbps
Not a surprise, once some reordering is detected, SACK TCP switches away
from FACK to something that's not what you'd expect (in 2.6.24), you
should try 2.6.25-rcs first in which the non-FACK is very close to
RFC3517.
> I was thinking that if the reordered ACKs/SACKs cause confusion in the
> sender, and sender will unnecessarily reduce either the CWND or the
> TCP_REORDERING threshold. I might need to take a serious look at the
> SACK implementation.
I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it
is recoded/updated since then.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-04 20:00 ` Ilpo Järvinen
@ 2008-04-04 20:07 ` Wenji Wu
2008-04-04 21:15 ` Wenji Wu
1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 20:07 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: 'John Heffner', Netdev
> On Fri, 4 Apr 2008, Wenji Wu wrote:
>
> > Every system runs Linux 2.6.24.
>
> You should have reported kernel version right from the beginning. It
> may
> have a huge effect... ;-)
>
> > When sack is on, the throughput is around 180Mbps
> > When sack is off, the throughput is around 260Mbps
>
> Not a surprise, once some reordering is detected, SACK TCP switches
> away
> from FACK to something that's not what you'd expect (in 2.6.24), you
> should try 2.6.25-rcs first in which the non-FACK is very close to
> RFC3517.
>
> > I was thinking that if the reordered ACKs/SACKs cause cjavascript:parent.send('smtp')
Send Message
Sendonfusion in the
> > sender, and sender will unnecessarily reduce either the CWND or the
> > TCP_REORDERING threshold. I might need to take a serious look at the
>
> > SACK implementation.
>
> I'd suggest that you don't waste too much effort for 2.6.24. ...Most
> of it
> is recoded/updated since then.
Thanks, i would try it on the latest version and report the results.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 20:00 ` Ilpo Järvinen
2008-04-04 20:07 ` Wenji Wu
@ 2008-04-04 21:15 ` Wenji Wu
2008-04-04 21:33 ` Ilpo Järvinen
1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 21:15 UTC (permalink / raw)
To: 'Ilpo Järvinen'; +Cc: 'John Heffner', 'Netdev'
>I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it
>is recoded/updated since then.
Hi, Ilpo,
I just tried it on 2.6.25-rc8. The result is still the same: the throughput
with SACK on is less than with SACK off.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 21:15 ` Wenji Wu
@ 2008-04-04 21:33 ` Ilpo Järvinen
2008-04-04 21:39 ` Ilpo Järvinen
2008-04-04 21:40 ` Wenji Wu
0 siblings, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-04 21:33 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'John Heffner', 'Netdev'
On Fri, 4 Apr 2008, Wenji Wu wrote:
>
> >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it
> >is recoded/updated since then.
>
> I just tried it on 2.6.25-rc8. The result is still the same: the throughput
> with SACK on is less than with SACK off.
Hmm, can you also try if playing around with FRTO setting makes some
difference (tcp_frto sysctl)?
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 21:33 ` Ilpo Järvinen
@ 2008-04-04 21:39 ` Ilpo Järvinen
2008-04-04 22:14 ` Wenji Wu
2008-04-04 21:40 ` Wenji Wu
1 sibling, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-04 21:39 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'John Heffner', 'Netdev'
[-- Attachment #1: Type: TEXT/PLAIN, Size: 614 bytes --]
On Sat, 5 Apr 2008, Ilpo Järvinen wrote:
> On Fri, 4 Apr 2008, Wenji Wu wrote:
>
> >
> > >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it
> > >is recoded/updated since then.
> >
> > I just tried it on 2.6.25-rc8. The result is still the same: the throughput
> > with SACK on is less than with SACK off.
>
> Hmm, can you also try if playing around with FRTO setting makes some
> difference (tcp_frto sysctl)?
...Assuming it wasn't disabled already. If you find that there's
significant difference, you could try also with SACK+basic FRTO (set
the tcp_frto sysctl to 1).
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-04 21:33 ` Ilpo Järvinen
2008-04-04 21:39 ` Ilpo Järvinen
@ 2008-04-04 21:40 ` Wenji Wu
1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 21:40 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: 'John Heffner', 'Netdev'
> On Fri, 4 Apr 2008, Wenji Wu wrote:
>
> >
> > >I'd suggest that you don't waste too much effort for 2.6.24.
> ...Most of it
> > >is recoded/updated since then.
> >
> > I just tried it on 2.6.25-rc8. The result is still the same: the throughput
> > with SACK on is less than with SACK off.
>
> Hmm, can you also try if playing around with FRTO setting makes some
> difference (tcp_frto sysctl)?
Still the same, I just tried with FRTO, FACK. No difference, SACK on is worse than SACK off.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 21:39 ` Ilpo Järvinen
@ 2008-04-04 22:14 ` Wenji Wu
2008-04-05 17:42 ` Ilpo Järvinen
2008-04-05 21:17 ` Sangtae Ha
0 siblings, 2 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 22:14 UTC (permalink / raw)
To: 'Ilpo Järvinen'; +Cc: 'John Heffner', 'Netdev'
>...Assuming it wasn't disabled already. If you find that there's
>significant difference, you could try also with SACK+basic FRTO (set
>the tcp_frto sysctl to 1).
No, still the same. I tried tcp_frto with 0, 1, 2.
SACK On is worse than SACK off.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-04 22:14 ` Wenji Wu
@ 2008-04-05 17:42 ` Ilpo Järvinen
2008-04-05 21:17 ` Sangtae Ha
1 sibling, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-05 17:42 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'John Heffner', 'Netdev'
On Fri, 4 Apr 2008, Wenji Wu wrote:
>
> >...Assuming it wasn't disabled already. If you find that there's
> >significant difference, you could try also with SACK+basic FRTO (set
> >the tcp_frto sysctl to 1).
>
> No, still the same. I tried tcp_frto with 0, 1, 2.
>
> SACK On is worse than SACK off.
No easy solution then, we'll have to take a look on tcpdumps.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-04 22:14 ` Wenji Wu
2008-04-05 17:42 ` Ilpo Järvinen
@ 2008-04-05 21:17 ` Sangtae Ha
2008-04-06 20:27 ` Wenji Wu
1 sibling, 1 reply; 56+ messages in thread
From: Sangtae Ha @ 2008-04-05 21:17 UTC (permalink / raw)
To: wenji; +Cc: Ilpo Järvinen, John Heffner, Netdev
[-- Attachment #1: Type: text/plain, Size: 964 bytes --]
Can you run the attached script and run your testing again?
I think it might be the problem of your dual cores balance the
interrupts on your testing NIC.
As we do a lot of things with SACK, cache misses and etc. might affect
your performance.
In default setting, I disabled tcp segment offload and did a smp
affinity setting to CPU 0.
Please change "INF" to your interface name and let us know the results.
Sangtae
On Fri, Apr 4, 2008 at 6:14 PM, Wenji Wu <wenji@fnal.gov> wrote:
>
> >...Assuming it wasn't disabled already. If you find that there's
> >significant difference, you could try also with SACK+basic FRTO (set
> >the tcp_frto sysctl to 1).
>
> No, still the same. I tried tcp_frto with 0, 1, 2.
>
> SACK On is worse than SACK off.
>
> wenji
>
>
> --
>
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: tuning.sh --]
[-- Type: application/x-sh, Size: 1753 bytes --]
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-05 21:17 ` Sangtae Ha
@ 2008-04-06 20:27 ` Wenji Wu
2008-04-06 22:43 ` Sangtae Ha
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-06 20:27 UTC (permalink / raw)
To: Sangtae Ha; +Cc: Ilpo Järvinen, John Heffner, Netdev
> Can you run the attached script and run your testing again?
> I think it might be the problem of your dual cores balance the
> interrupts on your testing NIC.
> As we do a lot of things with SACK, cache misses and etc. might affect
> your performance.
>
> In default setting, I disabled tcp segment offload and did a smp
> affinity setting to CPU 0.
> Please change "INF" to your interface name and let us know the results.
I bound the network interrupts and iperf both the CPU0, and CPU0 will be ilde most of the time. The results are still the same.
At this throughput level, the SACK processing won't take much CPU.
It is not the interrupt/cpu affinity that cause the difference.
I am beleving that it is the ACK reordering that cuase the confusion in the sender, which lead the sender uncecessarily to reduce CWND or REORDERING_THRESHOLD.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-06 20:27 ` Wenji Wu
@ 2008-04-06 22:43 ` Sangtae Ha
2008-04-07 14:56 ` Wenji Wu
0 siblings, 1 reply; 56+ messages in thread
From: Sangtae Ha @ 2008-04-06 22:43 UTC (permalink / raw)
To: Wenji Wu; +Cc: Ilpo Järvinen, John Heffner, Netdev
When our 40 students had the same lab experiment comparing between
TCP-SACK and TCP-NewReno, they had come up with similar results. The
settings are identical to your setting (one linux sender, one linux
receiver, and one nettem machine in between) . When we introduced some
loss using a nettem, TCP-SACK showed a bit better performance while
they had similar throughput most of cases.
I don't think reorderings frequently happened in your directly
connected networking scenario. Please post your tcpdump file for
clearing out all doubts.
Sangtae
On 4/6/08, Wenji Wu <wenji@fnal.gov> wrote:
>
>
> > Can you run the attached script and run your testing again?
> > I think it might be the problem of your dual cores balance the
> > interrupts on your testing NIC.
> > As we do a lot of things with SACK, cache misses and etc. might affect
> > your performance.
> >
> > In default setting, I disabled tcp segment offload and did a smp
> > affinity setting to CPU 0.
> > Please change "INF" to your interface name and let us know the results.
>
> I bound the network interrupts and iperf both the CPU0, and CPU0 will be ilde most of the time. The results are still the same.
>
> At this throughput level, the SACK processing won't take much CPU.
>
> It is not the interrupt/cpu affinity that cause the difference.
>
> I am beleving that it is the ACK reordering that cuase the confusion in the sender, which lead the sender uncecessarily to reduce CWND or REORDERING_THRESHOLD.
>
> wenji
>
--
----------------------------------------------------------------
Sangtae Ha, http://www4.ncsu.edu/~sha2
PhD. Student,
Department of Computer Science,
North Carolina State University, USA
----------------------------------------------------------------
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-06 22:43 ` Sangtae Ha
@ 2008-04-07 14:56 ` Wenji Wu
2008-04-08 6:36 ` Ilpo Järvinen
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-07 14:56 UTC (permalink / raw)
To: 'Sangtae Ha'
Cc: 'Ilpo Järvinen', 'John Heffner',
'Netdev'
>I don't think reorderings frequently happened in your directly
>connected networking scenario. Please post your tcpdump file for
>clearing out all doubts.
https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
Two tcpdump files: one with SACK on, the other with SACK off. The test
configures described in my previous emails.
Best,
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-07 14:56 ` Wenji Wu
@ 2008-04-08 6:36 ` Ilpo Järvinen
2008-04-08 12:33 ` Wenji Wu
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 6:36 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
On Mon, 7 Apr 2008, Wenji Wu wrote:
> >I don't think reorderings frequently happened in your directly
> >connected networking scenario. Please post your tcpdump file for
> >clearing out all doubts.
>
> https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
>
> Two tcpdump files: one with SACK on, the other with SACK off. The test
> configures described in my previous emails.
NewReno never retransmitted anything in them (except at the very end of
the transfer). Probably something related to how tp->reordering behaves
I suppose...
ijjarvin@pointhope:~/linux/debug$ /usr/sbin/tcpdump -n -r nosack | grep
"4888[35] >" | cut -d ' ' -f 7- | cut -d ':' -f 1 | awk '{if ($1 < old)
{print $1}; old=$1;}'
reading from file nosack, link-type EN10MB (Ethernet)
1
641080641
ijjarvin@pointhope:~/linux/debug$
ijjarvin@pointhope:~/linux/debug$ /usr/sbin/tcpdump -n -r sack | grep
"4888[35] >" | cut -d ' ' -f 7- | cut -d ':' -f 1 | awk '{if ($1 < old)
{print $1}; old=$1;}'
reading from file sack, link-type EN10MB (Ethernet)
1
7265
10161
141929
175233
196953
446558881
3542223511
ijjarvin@pointhope:~/linux/debug$
This is probably far fetched but could you tell us how you make sure that
earlier connection's metrics are not affecting the latter connection?
Ie., the discovered reordering is not transferred across the flows (in CBI
like manner) and thus newreno has unfair advantage?
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 6:36 ` Ilpo Järvinen
@ 2008-04-08 12:33 ` Wenji Wu
2008-04-08 13:45 ` Ilpo Järvinen
2008-04-08 15:57 ` John Heffner
2008-04-08 14:07 ` John Heffner
2008-04-14 16:10 ` Wenji Wu
2 siblings, 2 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-08 12:33 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
> NewReno never retransmitted anything in them (except at the very end
> of
> the transfer). Probably something related to how tp->reordering behaves
> I suppose...
Yes, the adaptive tp->reordering will play a role here.
> This is probably far fetched but could you tell us how you make sure
> that
> earlier connection's metrics are not affecting the latter connection?
>
> Ie., the discovered reordering is not transferred across the flows (in
> CBI
> like manner) and thus newreno has unfair advantage?
You can reverse the order of the tests, with SACK option on/off. The results are still the same.
Also, according to the source code, tp->reordering will be initialized to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new connection is established. After that, tp->reordering is controlled by the the adaptive algorithm
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 12:33 ` Wenji Wu
@ 2008-04-08 13:45 ` Ilpo Järvinen
2008-04-08 14:30 ` Wenji Wu
2008-04-08 15:57 ` John Heffner
1 sibling, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 13:45 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
On Tue, 8 Apr 2008, Wenji Wu wrote:
> > NewReno never retransmitted anything in them (except at the very end
> > of
> > the transfer). Probably something related to how tp->reordering behaves
> > I suppose...
>
> Yes, the adaptive tp->reordering will play a role here.
...What is not clear to me why NewReno does not go to recovery at least
once near the beginning, or at least it won't result in a retransmission.
In which kernel version this dump comes from? 2.6.24 newreno is crippled
with TSO as was recently discovered, ie., it won't mark lost super skbs
at head and thus won't retransmit them. Also 2.6.25-rcs are still broken
(though they'll transmit too much, I'll not go detail in here), DaveM now
has the fix for 2.6.25-rcs in net-2.6.
> > This is probably far fetched but could you tell us how you make sure
> > that
> > earlier connection's metrics are not affecting the latter connection?
> >
> > Ie., the discovered reordering is not transferred across the flows (in
> > CBI
> > like manner) and thus newreno has unfair advantage?
>
> You can reverse the order of the tests, with SACK option on/off. The
> results are still the same.
Ok. I just wanted to make sure so that we don't end up trace some test
setup issue :-).
> Also, according to the source code, tp->reordering will be initialized
> to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new
> connection is established.
In addition, in tcp_init_metrics():
if (dst_metric(dst, RTAX_REORDERING) &&
tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
tcp_disable_fack(tp);
tp->reordering = dst_metric(dst, RTAX_REORDERING);
}
> After that, tp->reordering is controlled by the the adaptive algorithm
Yes, however, the algorithm will be vastly different in those two cases.
NewReno stuff is in tcp_check_reno_reordering() and other place in
tcp_try_undo_partial() but the latter is only happening in recovery I
think. SACK on the other has number of callsites to tcp_update_reordering,
check for yourself.
This might be due to my change which made tcp_check_reno_reordering to be
called earlier than it used to be (to remove a transition state during
which sacked_out contained stale info including some already cumulative
ACKed segments). I was quite unsure if I can safely do that. It's not
clear to me how your test could cause sacked_out > packets_out-1 to occur
though, which is necessary for tcp_update_reordering to get called with
newreno. The ACK reordering should just make the number of duplicate acks
smaller because part of them get discarded as old ones as a newer
cumulative ACK often arrives a bit "ahead" of it's time making rest
smaller sequenced ACKs very close to no-op. ...Though I didn't yet do a
awk magic to prove that it won't happen in the non-sack dump.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-08 6:36 ` Ilpo Järvinen
2008-04-08 12:33 ` Wenji Wu
@ 2008-04-08 14:07 ` John Heffner
2008-04-14 16:10 ` Wenji Wu
2 siblings, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-08 14:07 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: Wenji Wu, Sangtae Ha, Netdev
On Mon, Apr 7, 2008 at 11:36 PM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> On Mon, 7 Apr 2008, Wenji Wu wrote:
>
> > >I don't think reorderings frequently happened in your directly
> > >connected networking scenario. Please post your tcpdump file for
> > >clearing out all doubts.
> >
> > https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
> >
> > Two tcpdump files: one with SACK on, the other with SACK off. The test
> > configures described in my previous emails.
>
> NewReno never retransmitted anything in them (except at the very end of
> the transfer). Probably something related to how tp->reordering behaves
> I suppose...
Yes, this looks very suspicious. Can we see this again with TSO off?
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 13:45 ` Ilpo Järvinen
@ 2008-04-08 14:30 ` Wenji Wu
2008-04-08 14:59 ` Ilpo Järvinen
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-08 14:30 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
> > Yes, the adaptive tp->reordering will play a role here.
>
> ...What is not clear to me why NewReno does not go to recovery at
> least
> once near the beginning, or at least it won't result in a retransmission.
The problem cause me two weeks' time to debug!
With 3 DupACKs, tcp_ack() calls tcp_fastretrans_alert(), and which in turn calls tcp_xmit_retransmit_queue().
Within tcp_xmit_retransmit_queue(), there is a line of code that would cause the problem above:
......................................................................................................
/* Forward retransmissions are possible only during Recovery. */
1999 if (icsk->icsk_ca_state != TCP_CA_Recovery)
2000 return;
2001
2002 /* No forward retransmissions in Reno are possible. */
2003 if (tcp_is_reno(tp))
2004 return;
.....................................................................................................
if you look at "tcp_is_reno", you would see that with SACK off, Reno does not do retransmit, it will return!!!
Really do not understand why these two lines of code exist there!!!
Also, this code still in 2.6.25.
> In which kernel version this dump comes from? 2.6.24 newreno is
> crippled
> with TSO as was recently discovered, ie., it won't mark lost super
> skbs
> at head and thus won't retransmit them. Also 2.6.25-rcs are still
> broken
> (though they'll transmit too much, I'll not go detail in here), DaveM
> now
> has the fix for 2.6.25-rcs in net-2.6.
The dumped file is from 2.6.24. 2.6.25's is similiar.
> > You can reverse the order of the tests, with SACK option on/off. The
>
> > results are still the same.
>
> Ok. I just wanted to make sure so that we don't end up trace some test
>
> setup issue :-).
>
> > Also, according to the source code, tp->reordering will be
> initialized
> > to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new
> > connection is established.
>
> In addition, in tcp_init_metrics():
>
> if (dst_metric(dst, RTAX_REORDERING) &&
> tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
> tcp_disable_fack(tp);
> tp->reordering = dst_metric(dst, RTAX_REORDERING);
> }
Good to know this, thanks
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 14:30 ` Wenji Wu
@ 2008-04-08 14:59 ` Ilpo Järvinen
2008-04-08 15:27 ` Wenji Wu
2008-04-14 22:47 ` Wenji Wu
0 siblings, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 14:59 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
On Tue, 8 Apr 2008, Wenji Wu wrote:
> With 3 DupACKs, tcp_ack() calls tcp_fastretrans_alert(), and which in
> turn calls tcp_xmit_retransmit_queue().
Yeah. It should.
> Within tcp_xmit_retransmit_queue(), there is a line of code that would
> cause the problem above:
>
> ......................................................................................................
> /* Forward retransmissions are possible only during Recovery. */
> 1999 if (icsk->icsk_ca_state != TCP_CA_Recovery)
> 2000 return;
>
> 2001
> 2002 /* No forward retransmissions in Reno are possible. */
> 2003 if (tcp_is_reno(tp))
> 2004 return;
>
> .....................................................................................................
>
> if you look at "tcp_is_reno", you would see that with SACK off, Reno
> does not do retransmit, it will return!!!
Your analysis is missing something important here, there are two loops
there :-). One for retransmitting assumed lost segments that's above those
lines you quoted! The other below is for non-lost marked similar to what
is specified by RFC3517's Rule 3 for NextSeg, which definately won't apply
for newreno nor should be executed.
> Really do not understand why these two lines of code exist there!!!
>
> Also, this code still in 2.6.25.
Sure, but there's nothing wrong with them! 2.6.24 just is currently broken
if you have TSO+NewReno because it won't do the correct lost marking which
is a necessary preparation step for the loop above that, too bad as I just
figured that out one/two days ago so there's no fix yet available :-).
> > In which kernel version this dump comes from? 2.6.24 newreno is
> > crippled
> > with TSO as was recently discovered, ie., it won't mark lost super
> > skbs
> > at head and thus won't retransmit them. Also 2.6.25-rcs are still
> > broken
> > (though they'll transmit too much, I'll not go detail in here), DaveM
> > now
> > has the fix for 2.6.25-rcs in net-2.6.
>
> The dumped file is from 2.6.24. 2.6.25's is similiar.
It's a bit hard for me to believe, considering what the last weeks debug
has revealed about internals of it. Have you checked it from the dumps or
from the overall results, a similarity in the latter could be due to
other factors related to the differences in reordering detection between
NewReno/SACK.
> > In addition, in tcp_init_metrics():
> >
> > if (dst_metric(dst, RTAX_REORDERING) &&
> > tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
> > tcp_disable_fack(tp);
> > tp->reordering = dst_metric(dst, RTAX_REORDERING);
> > }
>
> Good to know this, thanks
...There might be some bug which causes it to get skipped under some
circumstances though (which I haven't yet remembered to fix). I don't
remember too well anymore, probably some goto which caused skipping most
of what's in there.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 14:59 ` Ilpo Järvinen
@ 2008-04-08 15:27 ` Wenji Wu
2008-04-08 17:26 ` Ilpo Järvinen
2008-04-14 22:47 ` Wenji Wu
1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-08 15:27 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
> It's a bit hard for me to believe, considering what the last weeks
> debug
> has revealed about internals of it. Have you checked it from the dumps
> or
> from the overall results, a similarity in the latter could be due to
> other factors related to the differences in reordering detection
> between
> NewReno/SACK.
>
> ...There might be some bug which causes it to get skipped under some
> circumstances though (which I haven't yet remembered to fix). I don't
>
> remember too well anymore, probably some goto which caused skipping
> most
> of what's in there.
>
Get back to you later, and post the tcpdump file for 2.6.25.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 12:33 ` Wenji Wu
2008-04-08 13:45 ` Ilpo Järvinen
@ 2008-04-08 15:57 ` John Heffner
1 sibling, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-08 15:57 UTC (permalink / raw)
To: Wenji Wu; +Cc: Ilpo Järvinen, Sangtae Ha, Netdev
On Tue, Apr 8, 2008 at 5:33 AM, Wenji Wu <wenji@fnal.gov> wrote:
> > NewReno never retransmitted anything in them (except at the very end
> > of
> > the transfer). Probably something related to how tp->reordering behaves
> > I suppose...
>
> Yes, the adaptive tp->reordering will play a role here.
I remember several years ago when I first looked at chronic reordering
with a high BDP, the problem I had was that:
1) Only acks of new data can advance cwnd, and these only advance by
the normal amount per ack, so cwnd grows very slowly.
2) Reordering caused slow start to exit early, before the reordering
threshold had adapted
3) The "undo" code didn't work well because of cwnd moderation
4) There were bugs in the reordering calculation that caused the
threshold to be pulled back
Some of these shouldn't matter to you because your rtt is low, but I
thought i would be worth mentioning. I'm not sure what is keeping
your cwnd from growing -- it always seems to be within a small range
in both cases, which is not right unless there's a bottleneck at the
sender. The fact reno does a little better than sack seems like the
less important problem.
Also, what's the behavior when turning off reordering, in each or both
directions?
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-08 15:27 ` Wenji Wu
@ 2008-04-08 17:26 ` Ilpo Järvinen
0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 17:26 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
On Tue, 8 Apr 2008, Wenji Wu wrote:
>
> > It's a bit hard for me to believe, considering what the last weeks
> > debug
> > has revealed about internals of it. Have you checked it from the dumps
> > or
> > from the overall results, a similarity in the latter could be due to
> > other factors related to the differences in reordering detection
> > between
> > NewReno/SACK.
> >
>
> Get back to you later, and post the tcpdump file for 2.6.25.
Please, if possible use a kernel version where my today applied tcp fixes
are in, ie., at least DaveM's net-2.6 already has them, I didn't check if
Linus has pulled them in yet.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-08 6:36 ` Ilpo Järvinen
2008-04-08 12:33 ` Wenji Wu
2008-04-08 14:07 ` John Heffner
@ 2008-04-14 16:10 ` Wenji Wu
2008-04-14 16:48 ` Ilpo Järvinen
2 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-14 16:10 UTC (permalink / raw)
To: 'Ilpo Järvinen'
Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
Hi, Ilop,
The latest results have been posted to:
https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
The kernel under test is: Linux-2.6.25-rc9. I have checked with its
changelog, which shows that your latest fix is included.
In the tests, I vary the tcp_frto (0, 1, and 2) with SACK On/Off.
The experiment works as:
Sender --- Router --- Receiver
Iperf is sending from the sender to the receiver. In between there is an
emulated router which runs netem. The emulated router has two interfaces,
both with netem configured. One interface emulates the forward path and the
other for the reverse path. Both netem interfaces are configured with 1.5ms
delay and 0.15ms variance. No packet drops in tests and packet capturing.
All of these systems are multi-core platforms, with 2G+ CPU. I run
TOP to verify, CPUs are idle most of time.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-14 16:10 ` Wenji Wu
@ 2008-04-14 16:48 ` Ilpo Järvinen
2008-04-14 22:07 ` Wenji Wu
0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-14 16:48 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'
On Mon, 14 Apr 2008, Wenji Wu wrote:
> The latest results have been posted to:
>
> https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
>
> The kernel under test is: Linux-2.6.25-rc9. I have checked with its
> changelog, which shows that your latest fix is included.
Hmm, now there are even less retransmissions (barely some with
the SACK in the end).
I suppose the reordering detection is good enough to kill them. ...You
could perhaps figure that out from MIBs if you would want to.
> In the tests, I vary the tcp_frto (0, 1, and 2) with SACK On/Off.
...I should have said it more clearly last time already that these are
not significant with your workload.
> The experiment works as:
>
> Sender --- Router --- Receiver
>
> Iperf is sending from the sender to the receiver. In between there is an
> emulated router which runs netem. The emulated router has two interfaces,
> both with netem configured. One interface emulates the forward path and the
> other for the reverse path. Both netem interfaces are configured with 1.5ms
> delay and 0.15ms variance. No packet drops in tests and packet capturing.
...How about this theory:
Forward path reordering causes duplicate ACKs due to old segments. These
are threated differently for NewReno and SACK:
NewReno => Sends new data out (limited xmit; it's not limited to two
segments in linux as per RFC, however, RFC doesn't consider
autotuning of DupThresh either)
SACK => No new SACK block discovered. Packets in flight remains the
same, and thus no new segment is sent.
...What do other think?
I guess it should be visible with fwd path reordering alone, though the
added distance with reverse path reordering might act as amplifier
because NewReno benefits from shorter RTTed packets when fwd path
old segment arrived while SACK losses its ability to increase
outstanding data...
...A quick look into it with tcptrace's outstanding date plot, it seems
that NewReno levels ~100000 and SACK ~68000.
...I think SACK just knows too much? :-/
> All of these systems are multi-core platforms, with 2G+ CPU. I run
> TOP to verify, CPUs are idle most of time.
Thanks for adding this for other. I agree with you that this is not
an cpu horsepower issue.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-14 16:48 ` Ilpo Järvinen
@ 2008-04-14 22:07 ` Wenji Wu
2008-04-15 8:23 ` Ilpo Järvinen
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-14 22:07 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: 'Netdev'
> Hmm, now there are even less retransmissions (barely some with
> the SACK in the end).
>
> I suppose the reordering detection is good enough to kill them. ...You
>
> could perhaps figure that out from MIBs if you would want to.
>
Yes, the web100 shows that the tcp_reordering could be as large as 127.
I just rerun the following experimetns to show why there are few retransmissions in my previous posts.
(1) Flush the sytem routing cache by running "ip route flush cache" before running and tcpdumping the traffic
(2) Before running and tcpdumping the traffic, run a data transmission test to generate tcp_reordering in the routing cache.
Do not flush the routing cache. Then running and tcpdumping the traffic.
Both experiments with sack off.
The results is posted to
https://plone3.fnal.gov/P0/WAN/Members/wenji/adaptive_tcp_reordering/
So, the few retransmissions in my previous post are really caused by the routing cache.
But flushing cahce has nothing to do with SACK on/off. Still the trhoughput with SACK off is better than that of with SACK on.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: RE: A Linux TCP SACK Question
2008-04-08 14:59 ` Ilpo Järvinen
2008-04-08 15:27 ` Wenji Wu
@ 2008-04-14 22:47 ` Wenji Wu
2008-04-15 0:48 ` John Heffner
1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-14 22:47 UTC (permalink / raw)
To: 'Ilpo Järvinen'; +Cc: 'Netdev'
Hi, Ilpo,
Could the throughput difference with SACK ON/OFF be due to the following
code in tcp_ack()?
3120 if (tcp_ack_is_dubious(sk, flag)) {
3121 /* Advance CWND, if state allows this. */
3122 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
3123 tcp_may_raise_cwnd(sk, flag))
3124 tcp_cong_avoid(sk, ack, prior_in_flight, 0);
3125 tcp_fastretrans_alert(sk, prior_packets -
tp->packets_out, flag);
3126 } else {
3127 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
3128 tcp_cong_avoid(sk, ack, prior_in_flight, 1);
3129 }
In my tests, there are actually no packet drops, just severe packet
reordering in both forward and reverse paths. With good tcp_reordering
auto-tuning, there are few retransmissions.
(1) With SACK option off, the reorder ACKs will not cause much harm to the
throughput. As you have pointed out in the email that "The ACK reordering
should just make the number of duplicate acks smaller because part of them
get discarded as old ones as a newer cumulative ACK often arrives a bit
"ahead" of its time making rest smaller sequenced ACKs very close to on-op."
If there are any ACK advancement, tcp_cong_avoid() will be called.
(2) With the sack option is on. If the ACKs do not advance the left edge of
the window, those ACKs will go to "old_ack" of "tcp_ack()", no much
processing except sack-tagging the corresponding packets in the
retransmission queue. tcp_cong_avoid() will not be called.
However, if the ACKs advance the left edge of the window and these ACKs
include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the
calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122,
which is stricter than the if-condition at line 3127.
So, the congestion window with SACK on would be smaller than with SACK off.
If you run tcptrace and xplot on the files I posted, you would see lots ACKs
will advance the left edge of the window, and include SACK blocks.
Not quite sure, just a guess.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-14 22:47 ` Wenji Wu
@ 2008-04-15 0:48 ` John Heffner
2008-04-15 8:25 ` Ilpo Järvinen
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: John Heffner @ 2008-04-15 0:48 UTC (permalink / raw)
To: wenji; +Cc: Ilpo Järvinen, Netdev
On Mon, Apr 14, 2008 at 3:47 PM, Wenji Wu <wenji@fnal.gov> wrote:
> Hi, Ilpo,
>
> Could the throughput difference with SACK ON/OFF be due to the following
> code in tcp_ack()?
>
> 3120 if (tcp_ack_is_dubious(sk, flag)) {
> 3121 /* Advance CWND, if state allows this. */
> 3122 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
> 3123 tcp_may_raise_cwnd(sk, flag))
> 3124 tcp_cong_avoid(sk, ack, prior_in_flight, 0);
> 3125 tcp_fastretrans_alert(sk, prior_packets -
> tp->packets_out, flag);
> 3126 } else {
> 3127 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
> 3128 tcp_cong_avoid(sk, ack, prior_in_flight, 1);
> 3129 }
>
> In my tests, there are actually no packet drops, just severe packet
> reordering in both forward and reverse paths. With good tcp_reordering
> auto-tuning, there are few retransmissions.
>
> (1) With SACK option off, the reorder ACKs will not cause much harm to the
> throughput. As you have pointed out in the email that "The ACK reordering
>
> should just make the number of duplicate acks smaller because part of them
> get discarded as old ones as a newer cumulative ACK often arrives a bit
> "ahead" of its time making rest smaller sequenced ACKs very close to on-op."
>
> If there are any ACK advancement, tcp_cong_avoid() will be called.
>
> (2) With the sack option is on. If the ACKs do not advance the left edge of
> the window, those ACKs will go to "old_ack" of "tcp_ack()", no much
> processing except sack-tagging the corresponding packets in the
> retransmission queue. tcp_cong_avoid() will not be called.
>
> However, if the ACKs advance the left edge of the window and these ACKs
> include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the
> calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122,
> which is stricter than the if-condition at line 3127.
>
> So, the congestion window with SACK on would be smaller than with SACK off.
>
> If you run tcptrace and xplot on the files I posted, you would see lots ACKs
> will advance the left edge of the window, and include SACK blocks.
>
> Not quite sure, just a guess.
I had considered this, but it would seem that tcp_may_raise_cwnd() in
this case *should* return true, right?
Sill the mystery remains as to why *both* are going so slowly. You
mentioned you're using a web100 kernel. What are the final values of
all the variables for the connections (grab with readall)?
Thanks,
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-14 22:07 ` Wenji Wu
@ 2008-04-15 8:23 ` Ilpo Järvinen
0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-15 8:23 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'Netdev'
On Mon, 14 Apr 2008, Wenji Wu wrote:
>
> > Hmm, now there are even less retransmissions (barely some with
> > the SACK in the end).
> >
> > I suppose the reordering detection is good enough to kill them. ...You
> >
> > could perhaps figure that out from MIBs if you would want to.
> >
>
> Yes, the web100 shows that the tcp_reordering could be as large as 127.
It should get large, though I suspect newreno's new value: tp->packets_out
+ addend might have tp->packets_out too much.
> I just rerun the following experimetns to show why there are few
> retransmissions in my previous posts.
>
> (1) Flush the sytem routing cache by running "ip route flush cache"
> before running and tcpdumping the traffic
I didn't know that works, tcp_no_metrics_save sysctl seems to prevent
saving them from an running TCP flow when a flow ends.
> (2) Before running and tcpdumping the traffic, run a data transmission
> test to generate tcp_reordering in the routing cache. Do not flush the
> routing cache. Then running and tcpdumping the traffic.
>
> Both experiments with sack off.
>
> The results is posted to
> https://plone3.fnal.gov/P0/WAN/Members/wenji/adaptive_tcp_reordering/
>
> So, the few retransmissions in my previous post are really caused by the
> routing cache.
Yes. Remember however that initial metrics has also have effect on initial
ssthresh, so one must be very careful to not cause unfairness through them
if metrics are not cleared.
> But flushing cahce has nothing to do with SACK on/off. Still the
> trhoughput with SACK off is better than that of with SACK on.
Yes, I think it alone would never explain it. Though difference in
initial ssthresh might have been the explanation for the different level
where outstanding data settled with the logs without any retransmissions.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 0:48 ` John Heffner
@ 2008-04-15 8:25 ` Ilpo Järvinen
2008-04-15 18:01 ` Wenji Wu
2008-04-15 15:45 ` Wenji Wu
2008-04-15 16:39 ` Wenji Wu
2 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-15 8:25 UTC (permalink / raw)
To: John Heffner, wenji; +Cc: Netdev
On Mon, 14 Apr 2008, John Heffner wrote:
> On Mon, Apr 14, 2008 at 3:47 PM, Wenji Wu <wenji@fnal.gov> wrote:
> >
> > Could the throughput difference with SACK ON/OFF be due to the following
> > code in tcp_ack()?
> >
> > 3120 if (tcp_ack_is_dubious(sk, flag)) {
> > 3121 /* Advance CWND, if state allows this. */
> > 3122 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
> > 3123 tcp_may_raise_cwnd(sk, flag))
> > 3124 tcp_cong_avoid(sk, ack, prior_in_flight, 0);
> > 3125 tcp_fastretrans_alert(sk, prior_packets -
> > tp->packets_out, flag);
> > 3126 } else {
> > 3127 if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
> > 3128 tcp_cong_avoid(sk, ack, prior_in_flight, 1);
> > 3129 }
> >
> > In my tests, there are actually no packet drops, just severe packet
> > reordering in both forward and reverse paths. With good tcp_reordering
> > auto-tuning, there are few retransmissions.
> >
> >
> > (1) With SACK option off, the reorder ACKs will not cause much harm to the
> > throughput. As you have pointed out in the email that "The ACK reordering
> >
> > should just make the number of duplicate acks smaller because part of them
> > get discarded as old ones as a newer cumulative ACK often arrives a bit
> > "ahead" of its time making rest smaller sequenced ACKs very close to on-op."
...Please note that these are considered as old ACKs, so that we do goto
old_ack, which is equal for both SACK and NewReno. ...So it won't make any
difference between them.
> > If there are any ACK advancement, tcp_cong_avoid() will be called.
NewReno case analysis is not exactly what you assume, if there was at
least on duplicate ACK already, the ca_state will be CA_Disorder for
NewReno which makes ack_is_dubious true. You probably assumed it goes
to the other branch directly?
> > (2) With the sack option is on. If the ACKs do not advance the left edge of
> > the window, those ACKs will go to "old_ack" of "tcp_ack()", no much
> > processing except sack-tagging the corresponding packets in the
> > retransmission queue. tcp_cong_avoid() will not be called.
No, this is not right. The old_ack happens only if left edge
backtracks, in which case we obviously should discard as it's stale
information (except SACK may reveal something not yet known which is
why sacktag is called there). This same applies regardless of SACK (no
tagging of course).
...Hmm, there's one questionable part in here in the code (I doubt it
makes any difference here though). If new sack info is discovered, we
don't retransmit but send new data (if window allows) even when in
recovery where TCP should retransmit first.
> > However, if the ACKs advance the left edge of the window and these ACKs
> > include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the
> > calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122,
> > which is stricter than the if-condition at line 3127.
> >
> > So, the congestion window with SACK on would be smaller than with SACK off.
I think you might have found a bug though it won't affect you but makes
that check easier to pass actually:
Questionable thing is that || in tcp_may_raise_cwnd (might not be
intentional)...
But in your case, during initial slow-start that condition in
tcp_may_raise_cwnd will always be true (if you've metrics are cleared as
they should). Because: (...not important || 1) && 1 because cwnd <
ssthresh. After that, when you don't have ECE nor are in recovery,
tcp_may_raise_cwnd results in this: (1 || ...not calculated) && 1, so it
should always allow increment in your case except when in recovery which
hardly makes up for the difference you're seeing...
> > If you run tcptrace and xplot on the files I posted, you would see
> > lots ACKs will advance the left edge of the window, and include SACK
> > blocks.
This would only make difference if any of those SACK blocks were new. If
they're not, DATA_SACKED_ACKED won't be set in flag.
> > Not quite sure, just a guess.
You seem to be missing the third case, which I tried to point out
earlier. The case where left edge remains the same. I think it makes a
huge difference here (I'll analyse non-recovery case here):
NewReno goes always to fastretrans_alert, to default branch, and because
it's is_dupack, it increments sacked_out through tcp_add_reno_sack.
Effectively packets_in_flight is reduced by one and TCP is able to send
a new segment out.
Now with SACK there are two cases:
SACK and newly discovere SACK info (for simplicity, lets assume just one
newly discovered sacked segment). Sacktag marks that segment and increment
sacked_out, effectively making packets_in_flight equal to the case with
NewReno. It goes to fastretrans_alert and makes all similar maneuvers as
NewReno (except if enough SACK blocks have arrived to trigger recovery
while NewReno would not have enough dupACKs collected, I doubt that this
makes the difference though, I'll need no-metricsed logs to verify the
number of recoveries to confirm that they're quite few).
SACK and no new SACK info. Sacktag won't find anything to mark, thus
sacked_out remains the same. It goes to fastretrans_alert because ca_state
is CA_Disorder. But, now we did lose one segment compared with NewReno
because we didn't increment sacked_out making packets_in_flight to stay in
the amount it was before. Thus we cannot send new data segment out and
fall behind the NewReno.
> I had considered this, but it would seem that tcp_may_raise_cwnd() in
> this case *should* return true, right?
Yes, it seems. Though I think that it's unintentional. I'd say that that
|| should be && but I might be wrong.
> Sill the mystery remains as to why *both* are going so slowly. You
> mentioned you're using a web100 kernel. What are the final values of
> all the variables for the connections (grab with readall)?
...I think that due to reordering, one will lose part of the cwnd
increments because of old ACKs as they won't allow you to add more
segments to the network, at some point of time the lossage will be large
enough to stall the growth of the cwnd (if in congestion avoidance with
the small increment). With slow start it seems not that self-evident that
such level exists though it might.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 0:48 ` John Heffner
2008-04-15 8:25 ` Ilpo Järvinen
@ 2008-04-15 15:45 ` Wenji Wu
2008-04-15 16:39 ` Wenji Wu
2 siblings, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 15:45 UTC (permalink / raw)
To: John Heffner; +Cc: Ilpo Järvinen, Netdev
> Sill the mystery remains as to why *both* are going so slowly. You
> mentioned you're using a web100 kernel. What are the final values of
> all the variables for the connections (grab with readall)?
Kernel 2.6.24,
"echo 1 > /proc/sys/net/ipv4/tcp_no_metrics_save"
With SACK off:
Throughtpu 256Mbps
Connection 6 (198.2.1.2 38054 131.225.2.16 5001)
State 1
SACKEnabled 0
TimestampsEnabled 1
NagleEnabled 1
ECNEnabled 0
SndWinScale 11
RcvWinScale 7
ActiveOpen 1
MSSRcvd 0
WinScaleRcvd 11
WinScaleSent 7
PktsOut 221715
DataPktsOut 221715
DataBytesOut 324429992
PktsIn 215245
DataPktsIn 0
DataBytesIn 0
SndUna 2784091744
SndNxt 2784091744
SndMax 2784091744
ThruBytesAcked 321011738
SndISS 2463080006
RcvNxt 1309516114
ThruBytesReceived 0
RecvISS 1309516114
StartTimeSec 1208273537
StartTimeUsec 293029
Duration 14594853
SndLimTransSender 6
SndLimBytesSender 23960
SndLimTimeSender 4137
SndLimTransCwnd 5
SndLimBytesCwnd 324406032
SndLimTimeCwnd 10046308
SndLimTransRwin 0
SndLimBytesRwin 0
SndLimTimeRwin 0
SlowStart 0
CongAvoid 0
CongestionSignals 4
OtherReductions 13167
X_OtherReductionsCV 0
X_OtherReductionsCM 13167
CongestionOverCount 54
CurCwnd 4344
MaxCwnd 173760
CurSsthresh 94894680
LimCwnd 4294965848
MaxSsthresh 94894680
MinSsthresh 4344
FastRetran 4
Timeouts 0
SubsequentTimeouts 0
CurTimeoutCount 0
AbruptTimeouts 0
PktsRetrans 17
BytesRetrans 24616
DupAcksIn 59556
SACKsRcvd 0
SACKBlocksRcvd 0
PreCongSumCwnd 375032
PreCongSumRTT 12
PostCongSumRTT 15
PostCongCountRTT 4
ECERcvd 0
SendStall 0
QuenchRcvd 0
RetranThresh 29
NonRecovDA 0
AckAfterFR 0
DSACKDups 0
SampleRTT 3
SmoothedRTT 3
RTTVar 50
MaxRTT 46
MinRTT 2
SumRTT 158191
CountRTT 47830
CurRTO 203
MaxRTO 237
MinRTO 203
CurMSS 1448
MaxMSS 1448
MinMSS 524
X_Sndbuf 1919232
X_Rcvbuf 87380
CurRetxQueue 0
MaxRetxQueue 0
CurAppWQueue 1786832
MaxAppWQueue 1886744
CurRwinSent 5888
MaxRwinSent 5888
MinRwinSent 5840
LimRwin 0
DupAcksOut 0
CurReasmQueue 0
MaxReasmQueue 0
CurAppRQueue 0
MaxAppRQueue 0
X_rcv_ssthresh 5840
X_wnd_clamp 64087
X_dbg1 5888
X_dbg2 536
X_dbg3 5840
X_dbg4 0
CurRwinRcvd 3137536
MaxRwinRcvd 3137536
MinRwinRcvd 17896
LocalAddressType 1
LocalAddress 198.2.1.2
LocalPort 38054
RemAddress 131.225.2.16
RemPort 5001
X_RcvRTT 0
...............................................................
With SACK On
Throughput: 178Mbps
Connection 3 (131.225.2.22 22 131.225.82.152 52973)
State 5
SACKEnabled 3
TimestampsEnabled 1
NagleEnabled 0
ECNEnabled 0
SndWinScale 11
RcvWinScale 7
ActiveOpen 0
MSSRcvd 0
WinScaleRcvd 11
WinScaleSent 7
PktsOut 230
DataPktsOut 230
DataBytesOut 25783
PktsIn 353
DataPktsIn 164
DataBytesIn 11120
SndUna 2809669838
SndNxt 2809669838
SndMax 2809669838
ThruBytesAcked 18423
SndISS 2809651415
RcvNxt 2817947310
ThruBytesReceived 11120
RecvISS 2817936190
StartTimeSec 1208271915
StartTimeUsec 71844
Duration 2362591841
SndLimTransSender 6
SndLimBytesSender 25783
SndLimTimeSender 2273927770
SndLimTransCwnd 5
SndLimBytesCwnd 0
SndLimTimeCwnd 1047
SndLimTransRwin 0
SndLimBytesRwin 0
SndLimTimeRwin 0
SlowStart 0
CongAvoid 0
CongestionSignals 0
OtherReductions 0
X_OtherReductionsCV 0
X_OtherReductionsCM 0
CongestionOverCount 0
CurCwnd 5792
MaxCwnd 13032
CurSsthresh 4294966376
LimCwnd 4294965848
MaxSsthresh 0
MinSsthresh 4294967295
FastRetran 0
Timeouts 0
SubsequentTimeouts 0
CurTimeoutCount 0
AbruptTimeouts 0
PktsRetrans 0
BytesRetrans 0
DupAcksIn 0
SACKsRcvd 0
SACKBlocksRcvd 0
PreCongSumCwnd 0
PreCongSumRTT 0
PostCongSumRTT 0
PostCongCountRTT 0
ECERcvd 0
SendStall 0
QuenchRcvd 0
RetranThresh 3
NonRecovDA 0
AckAfterFR 0
DSACKDups 0
SampleRTT 0
SmoothedRTT 3
RTTVar 50
MaxRTT 40
MinRTT 0
SumRTT 1269
CountRTT 221
CurRTO 203
MaxRTO 234
MinRTO 201
CurMSS 1448
MaxMSS 1448
MinMSS 1428
X_Sndbuf 16384
X_Rcvbuf 87380
CurRetxQueue 0
MaxRetxQueue 0
CurAppWQueue 0
MaxAppWQueue 0
CurRwinSent 14208
MaxRwinSent 14208
MinRwinSent 5792
LimRwin 8365440
DupAcksOut 0
CurReasmQueue 0
MaxReasmQueue 0
CurAppRQueue 0
MaxAppRQueue 1152
X_rcv_ssthresh 14144
X_wnd_clamp 64087
X_dbg1 14208
X_dbg2 1152
X_dbg3 14144
X_dbg4 0
CurRwinRcvd 3749888
MaxRwinRcvd 3749888
MinRwinRcvd 3747840
LocalAddressType 1
LocalAddress 131.225.2.22
LocalPort 22
RemAddress 131.225.82.152
RemPort 52973
X_RcvRTT 405000
..................................................................
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 0:48 ` John Heffner
2008-04-15 8:25 ` Ilpo Järvinen
2008-04-15 15:45 ` Wenji Wu
@ 2008-04-15 16:39 ` Wenji Wu
2008-04-15 17:01 ` John Heffner
2 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 16:39 UTC (permalink / raw)
To: John Heffner; +Cc: Ilpo Järvinen, Netdev
My fault, resent.
> Sill the mystery remains as to why *both* are going so slowly. You
> mentioned you're using a web100 kernel. What are the final values of
> all the variables for the connections (grab with readall)?
kernel 2.6.24
"echo 1 > /proc/sys/net/ipv4/tcp_no_metric_save"
..............................................................................................
WIth SACK On, throughput: 179Mbps
Connection 4 (198.2.1.2 56648 131.225.2.16 5001)
State 1
SACKEnabled 3
TimestampsEnabled 1
NagleEnabled 1
ECNEnabled 0
SndWinScale 11
RcvWinScale 7
ActiveOpen 1
MSSRcvd 0
WinScaleRcvd 11
WinScaleSent 7
PktsOut 154770
DataPktsOut 154770
DataBytesOut 226294264
PktsIn 149398
DataPktsIn 0
DataBytesIn 0
SndUna 930060039
SndNxt 930060039
SndMax 930060039
ThruBytesAcked 224092186
SndISS 705967853
RcvNxt 4282199280
ThruBytesReceived 0
RecvISS 4282199280
StartTimeSec 1208277286
StartTimeUsec 813964
Duration 13984145
SndLimTransSender 3
SndLimBytesSender 7208
SndLimTimeSender 3107
SndLimTransCwnd 2
SndLimBytesCwnd 226287056
SndLimTimeCwnd 10003734
SndLimTransRwin 0
SndLimBytesRwin 0
SndLimTimeRwin 0
SlowStart 0
CongAvoid 0
CongestionSignals 2
OtherReductions 19402
X_OtherReductionsCV 0
X_OtherReductionsCM 19402
CongestionOverCount 13
CurCwnd 4344
MaxCwnd 102808
CurSsthresh 94894680
LimCwnd 4294965848
MaxSsthresh 94894680
MinSsthresh 7240
FastRetran 2
Timeouts 0
SubsequentTimeouts 0
CurTimeoutCount 0
AbruptTimeouts 0
PktsRetrans 7
BytesRetrans 10136
DupAcksIn 41940
SACKsRcvd 118692
SACKBlocksRcvd 189919
PreCongSumCwnd 91224
PreCongSumRTT 6
PostCongSumRTT 7
PostCongCountRTT 2
ECERcvd 0
SendStall 0
QuenchRcvd 0
RetranThresh 30
NonRecovDA 0
AckAfterFR 0
DSACKDups 0
SampleRTT 3
SmoothedRTT 3
RTTVar 50
MaxRTT 4
MinRTT 2
SumRTT 142655
CountRTT 43932
CurRTO 203
MaxRTO 204
MinRTO 203
CurMSS 1448
MaxMSS 1448
MinMSS 524
X_Sndbuf 206976
X_Rcvbuf 87380
CurRetxQueue 0
MaxRetxQueue 0
CurAppWQueue 130320
MaxAppWQueue 237472
CurRwinSent 5888
MaxRwinSent 5888
MinRwinSent 5840
LimRwin 0
DupAcksOut 0
CurReasmQueue 0
MaxReasmQueue 0
CurAppRQueue 0
MaxAppRQueue 0
X_rcv_ssthresh 5840
X_wnd_clamp 64087
X_dbg1 5888
X_dbg2 536
X_dbg3 5840
X_dbg4 0
CurRwinRcvd 3137536
MaxRwinRcvd 3137536
MinRwinRcvd 17896
LocalAddressType 1
LocalAddress 198.2.1.2
LocalPort 56648
RemAddress 131.225.2.16
RemPort 5001
X_RcvRTT 0
[root@gw004 ipv4]#
..................................................................
WIth SACK Off:
Throughput: 258Mbps
Connection 5 (198.2.1.2 43578 131.225.2.16 5001)
State 1
SACKEnabled 0
TimestampsEnabled 1
NagleEnabled 1
ECNEnabled 0
SndWinScale 11
RcvWinScale 7
ActiveOpen 1
MSSRcvd 0
WinScaleRcvd 11
WinScaleSent 7
PktsOut 223011
DataPktsOut 223011
DataBytesOut 326318584
PktsIn 216404
DataPktsIn 0
DataBytesIn 0
SndUna 4002973902
SndNxt 4002973902
SndMax 4002973902
ThruBytesAcked 322904090
SndISS 3680069812
RcvNxt 2942495629
ThruBytesReceived 0
RecvISS 2942495629
StartTimeSec 1208277475
StartTimeUsec 779859
Duration 18149747
SndLimTransSender 4
SndLimBytesSender 10456
SndLimTimeSender 3787
SndLimTransCwnd 3
SndLimBytesCwnd 326308128
SndLimTimeCwnd 10006059
SndLimTransRwin 0
SndLimBytesRwin 0
SndLimTimeRwin 0
SlowStart 0
CongAvoid 0
CongestionSignals 3
OtherReductions 13166
X_OtherReductionsCV 0
X_OtherReductionsCM 13166
CongestionOverCount 37
CurCwnd 10136
MaxCwnd 173760
CurSsthresh 94894680
LimCwnd 4294965848
MaxSsthresh 94894680
MinSsthresh 46336
FastRetran 3
Timeouts 0
SubsequentTimeouts 0
CurTimeoutCount 0
AbruptTimeouts 0
PktsRetrans 7
BytesRetrans 10136
DupAcksIn 59484
SACKsRcvd 0
SACKBlocksRcvd 0
PreCongSumCwnd 286704
PreCongSumRTT 12
PostCongSumRTT 11
PostCongCountRTT 3
ECERcvd 0
SendStall 0
QuenchRcvd 0
RetranThresh 23
NonRecovDA 0
AckAfterFR 0
DSACKDups 0
SampleRTT 4
SmoothedRTT 4
RTTVar 50
MaxRTT 6
MinRTT 2
SumRTT 159332
CountRTT 48291
CurRTO 204
MaxRTO 204
MinRTO 203
CurMSS 1448
MaxMSS 1448
MinMSS 524
X_Sndbuf 451584
X_Rcvbuf 87380
CurRetxQueue 0
MaxRetxQueue 0
CurAppWQueue 373584
MaxAppWQueue 454672
CurRwinSent 5888
MaxRwinSent 5888
MinRwinSent 5840
LimRwin 0
DupAcksOut 0
CurReasmQueue 0
MaxReasmQueue 0
CurAppRQueue 0
MaxAppRQueue 0
X_rcv_ssthresh 5840
X_wnd_clamp 64087
X_dbg1 5888
X_dbg2 536
X_dbg3 5840
X_dbg4 0
CurRwinRcvd 3137536
MaxRwinRcvd 3137536
MinRwinRcvd 17896
LocalAddressType 1
LocalAddress 198.2.1.2
LocalPort 43578
RemAddress 131.225.2.16
RemPort 5001
X_RcvRTT 0
[root@gw004 ipv4]#
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 16:39 ` Wenji Wu
@ 2008-04-15 17:01 ` John Heffner
2008-04-15 17:08 ` Ilpo Järvinen
2008-04-15 17:55 ` Wenji Wu
0 siblings, 2 replies; 56+ messages in thread
From: John Heffner @ 2008-04-15 17:01 UTC (permalink / raw)
To: Wenji Wu; +Cc: Ilpo Järvinen, Netdev
On Tue, Apr 15, 2008 at 9:39 AM, Wenji Wu <wenji@fnal.gov> wrote:
> SlowStart 0
> CongAvoid 0
> CongestionSignals 3
> OtherReductions 13166
> X_OtherReductionsCV 0
> X_OtherReductionsCM 13166
> CongestionOverCount 37
> CurCwnd 10136
>
> MaxCwnd 173760
> CurSsthresh 94894680
> LimCwnd 4294965848
> MaxSsthresh 94894680
> MinSsthresh 46336
We can see that in both cases you are getting throttled by
tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why
it's reaching this code - I would have thought that the high
tp->reordering would prevent this. Ilpo, do you have any insights?
It's not all that surprising that packets_in_flight is a higher value
with newreno than sack, which would explain the higher window with
newreno.
Wenji, the web100 kernel has a sysctl - WAD_MaxBurst. I suspect it
may make a significant difference if you set this to a large value.
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 17:01 ` John Heffner
@ 2008-04-15 17:08 ` Ilpo Järvinen
2008-04-15 17:23 ` John Heffner
2008-04-15 17:55 ` Wenji Wu
1 sibling, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-15 17:08 UTC (permalink / raw)
To: John Heffner; +Cc: Wenji Wu, Netdev
On Tue, 15 Apr 2008, John Heffner wrote:
> On Tue, Apr 15, 2008 at 9:39 AM, Wenji Wu <wenji@fnal.gov> wrote:
> > SlowStart 0
> > CongAvoid 0
> > CongestionSignals 3
> > OtherReductions 13166
> > X_OtherReductionsCV 0
> > X_OtherReductionsCM 13166
> > CongestionOverCount 37
> > CurCwnd 10136
> >
> > MaxCwnd 173760
> > CurSsthresh 94894680
> > LimCwnd 4294965848
> > MaxSsthresh 94894680
> > MinSsthresh 46336
>
>
> We can see that in both cases you are getting throttled by
> tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why
> it's reaching this code - I would have thought that the high
> tp->reordering would prevent this. Ilpo, do you have any insights?
What makes you think so? It's called from tcp_try_to_open as anyone can
read from the source, basically when our state is CA_Disorder (some very
small portion might happen in ca_recovery besides that).
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 17:08 ` Ilpo Järvinen
@ 2008-04-15 17:23 ` John Heffner
2008-04-15 18:00 ` Matt Mathis
0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-15 17:23 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: Wenji Wu, Netdev
On Tue, Apr 15, 2008 at 10:08 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Tue, 15 Apr 2008, John Heffner wrote:
> > We can see that in both cases you are getting throttled by
> > tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why
> > it's reaching this code - I would have thought that the high
> > tp->reordering would prevent this. Ilpo, do you have any insights?
>
> What makes you think so? It's called from tcp_try_to_open as anyone can
> read from the source, basically when our state is CA_Disorder (some very
> small portion might happen in ca_recovery besides that).
This is what X_OtherReductionsCM instruments, and that was the only
thing holding back cwnd.
I just looked at the source, and indeed it will be called on every ack
when we are in the disorder state. Limiting cwnd to
packets_in_flight() + 3 here is going to prevent cwnd from growing
when the reordering is greater than 3. Making max_burst at least
tp->reordering should help some, though I'm not sure it's the right
thing to do.
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: RE: A Linux TCP SACK Question
2008-04-15 17:01 ` John Heffner
2008-04-15 17:08 ` Ilpo Järvinen
@ 2008-04-15 17:55 ` Wenji Wu
1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 17:55 UTC (permalink / raw)
To: 'John Heffner'; +Cc: 'Ilpo Järvinen', 'Netdev'
>We can see that in both cases you are getting throttled by
>tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why
>it's reaching this code - I would have thought that the high
>tp->reordering would prevent this. Ilpo, do you have any insights?
>It's not all that surprising that packets_in_flight is a higher value
>with newreno than sack, which would explain the higher window with
>newreno.
>Wenji, the web100 kernel has a sysctl - WAD_MaxBurst. I suspect it
>may make a significant difference if you set this to a large value.
It is surprising! When I increase WAD_MaxBurst (Patched with Web100) from 3
to 20, the throughput in both cases (SACK ON/OFF) will saturate the 1Gbps
Link!!!
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 17:23 ` John Heffner
@ 2008-04-15 18:00 ` Matt Mathis
0 siblings, 0 replies; 56+ messages in thread
From: Matt Mathis @ 2008-04-15 18:00 UTC (permalink / raw)
To: =?X-UNKNOWN?Q?Ilpo_J=E4rvinen?=; +Cc: John Heffner, Wenji Wu, Netdev
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN; FORMAT=flowed, Size: 2018 bytes --]
In some future kernel release, I would consider changing it to limit cwnd to
be less than packets_in_flight() + reorder + 3(?). If the network is
reordering packets, then it has to accept bursts, otherwise TCP can never open
the window. The +3 (or some other constant) is still needed because TCP has
to send extra packets at the point where the window changes.
As an alternative, you could write a research paper on how the network could
do LIFO packet scheduling so the reordering serves as a congestion signal to
the stacks. I bet it would have some really interesting properties. Oh wait,
April 1st was 2 weeks ago.
Thanks,
--MM--
On Tue, 15 Apr 2008, John Heffner wrote:
> On Tue, Apr 15, 2008 at 10:08 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
>> On Tue, 15 Apr 2008, John Heffner wrote:
>> > We can see that in both cases you are getting throttled by
>> > tcp_moderate_cwnd (X_OtherReductionsCM). I'm not sure offhand why
>> > it's reaching this code - I would have thought that the high
>> > tp->reordering would prevent this. Ilpo, do you have any insights?
>>
>> What makes you think so? It's called from tcp_try_to_open as anyone can
>> read from the source, basically when our state is CA_Disorder (some very
>> small portion might happen in ca_recovery besides that).
>
> This is what X_OtherReductionsCM instruments, and that was the only
> thing holding back cwnd.
>
> I just looked at the source, and indeed it will be called on every ack
> when we are in the disorder state. Limiting cwnd to
> packets_in_flight() + 3 here is going to prevent cwnd from growing
> when the reordering is greater than 3. Making max_burst at least
> tp->reordering should help some, though I'm not sure it's the right
> thing to do.
>
> -John
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 8:25 ` Ilpo Järvinen
@ 2008-04-15 18:01 ` Wenji Wu
2008-04-15 22:40 ` John Heffner
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 18:01 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: John Heffner, Netdev
> No, this is not right. The old_ack happens only if left edge
> backtracks, in which case we obviously should discard as it's stale
> information (except SACK may reveal something not yet known which is
> why sacktag is called there). This same applies regardless of SACK (no
>
> tagging of course).
Yes, I mis-present myself in the last email. What I meant is the left edge backtrack case as you have pointed out.
>
> I think you might have found a bug though it won't affect you but
> makes
> that check easier to pass actually:
>
> Questionable thing is that || in tcp_may_raise_cwnd (might not be
> intentional)...
>
> But in your case, during initial slow-start that condition in
> tcp_may_raise_cwnd will always be true (if you've metrics are cleared
> as
> they should). Because: (...not important || 1) && 1 because cwnd <
> ssthresh. After that, when you don't have ECE nor are in recovery,
> tcp_may_raise_cwnd results in this: (1 || ...not calculated) && 1, so
> it
> should always allow increment in your case except when in recovery
> which
> hardly makes up for the difference you're seeing...
You are right, I just printed out the return value of tcp_may_raise_cwnd(). It is all one!
> This would only make difference if any of those SACK blocks were new.
> If
> they're not, DATA_SACKED_ACKED won't be set in flag.
>
> > > Not quite sure, just a guess.
>
> You seem to be missing the third case, which I tried to point out
> earlier. The case where left edge remains the same. I think it makes a
>
> huge difference here (I'll analyse non-recovery case here):
>
> NewReno goes always to fastretrans_alert, to default branch, and
> because
> it's is_dupack, it increments sacked_out through tcp_add_reno_sack.
> Effectively packets_in_flight is reduced by one and TCP is able to
> send
> a new segment out.
>
> Now with SACK there are two cases:
>
> SACK and newly discovere SACK info (for simplicity, lets assume just
> one
> newly discovered sacked segment). Sacktag marks that segment and
> increment
> sacked_out, effectively making packets_in_flight equal to the case
> with
> NewReno. It goes to fastretrans_alert and makes all similar maneuvers
> as
> NewReno (except if enough SACK blocks have arrived to trigger recovery
>
> while NewReno would not have enough dupACKs collected, I doubt that
> this
> makes the difference though, I'll need no-metricsed logs to verify the
>
> number of recoveries to confirm that they're quite few).
>
> SACK and no new SACK info. Sacktag won't find anything to mark, thus
> sacked_out remains the same. It goes to fastretrans_alert because
> ca_state
> is CA_Disorder. But, now we did lose one segment compared with NewReno
>
> because we didn't increment sacked_out making packets_in_flight to
> stay in
> the amount it was before. Thus we cannot send new data segment out and
>
> fall behind the NewReno.
Agree with you. Thanks. You did give me a good class on Linux ACK/SACK implementation. Thank you.
> > I had considered this, but it would seem that tcp_may_raise_cwnd() in
> > this case *should* return true, right?
>
> Yes, it seems. Though I think that it's unintentional. I'd say that
> that
> || should be && but I might be wrong.
Yes, It is all ture!
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: RE: A Linux TCP SACK Question
2008-04-15 18:01 ` Wenji Wu
@ 2008-04-15 22:40 ` John Heffner
2008-04-16 8:27 ` David Miller
2008-04-16 14:46 ` Wenji Wu
0 siblings, 2 replies; 56+ messages in thread
From: John Heffner @ 2008-04-15 22:40 UTC (permalink / raw)
To: Wenji Wu; +Cc: Ilpo Järvinen, Netdev
[-- Attachment #1: Type: text/plain, Size: 73 bytes --]
Wenji, can you try this out? Patch against net-2.6.26.
Thanks,
-John
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Increase-the-max_burst-threshold-from-3-to-tp-reord.patch --]
[-- Type: text/x-diff; name=0001-Increase-the-max_burst-threshold-from-3-to-tp-reord.patch, Size: 1491 bytes --]
From 4cb2a9fd1d497b02bfdd06f71b499d441ca10aee Mon Sep 17 00:00:00 2001
From: John Heffner <johnwheffner@gmail.com>
Date: Tue, 15 Apr 2008 15:26:39 -0700
Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
This change is necessary to allow cwnd to grow during persistent
reordering. Cwnd moderation is applied when in the disorder state
and an ack that fills the hole comes in. If the hole was greater
than 3 packets, but less than tp->reordering, cwnd will shrink when
it should not have.
Signed-off-by: John Heffner <jheffner@napa.(none)>
---
include/net/tcp.h | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2c14edf..633147c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,11 +787,14 @@ extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
/* Slow start with delack produces 3 packets of burst, so that
- * it is safe "de facto".
+ * it is safe "de facto". This will be the default - same as
+ * the default reordering threshold - but if reordering increases,
+ * we must be able to allow cwnd to burst at least this much in order
+ * to not pull it back when holes are filled.
*/
static __inline__ __u32 tcp_max_burst(const struct tcp_sock *tp)
{
- return 3;
+ return tp->reordering;
}
/* Returns end sequence number of the receiver's advertised window */
--
1.5.2.5
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-15 22:40 ` John Heffner
@ 2008-04-16 8:27 ` David Miller
2008-04-16 9:21 ` Ilpo Järvinen
2008-04-16 14:46 ` Wenji Wu
1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2008-04-16 8:27 UTC (permalink / raw)
To: johnwheffner; +Cc: wenji, ilpo.jarvinen, netdev
From: "John Heffner" <johnwheffner@gmail.com>
Date: Tue, 15 Apr 2008 15:40:05 -0700
> Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
>
> This change is necessary to allow cwnd to grow during persistent
> reordering. Cwnd moderation is applied when in the disorder state
> and an ack that fills the hole comes in. If the hole was greater
> than 3 packets, but less than tp->reordering, cwnd will shrink when
> it should not have.
>
> Signed-off-by: John Heffner <jheffner@napa.(none)>
I think this patch is correct, or at least more correct than what
this code is doing right now.
Any objections to my adding this to net-2.6.26?
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-16 8:27 ` David Miller
@ 2008-04-16 9:21 ` Ilpo Järvinen
2008-04-16 9:35 ` David Miller
2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner
0 siblings, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-16 9:21 UTC (permalink / raw)
To: David Miller; +Cc: johnwheffner, wenji, Netdev
On Wed, 16 Apr 2008, David Miller wrote:
> From: "John Heffner" <johnwheffner@gmail.com>
> Date: Tue, 15 Apr 2008 15:40:05 -0700
>
> > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
> >
> > This change is necessary to allow cwnd to grow during persistent
> > reordering. Cwnd moderation is applied when in the disorder state
> > and an ack that fills the hole comes in. If the hole was greater
> > than 3 packets, but less than tp->reordering, cwnd will shrink when
> > it should not have.
> >
> > Signed-off-by: John Heffner <jheffner@napa.(none)>
>
> I think this patch is correct, or at least more correct than what
> this code is doing right now.
>
> Any objections to my adding this to net-2.6.26?
I don't have objections.
But I want to note that tp->reordering does not consider the situation on
that specific ACK because its value might originate a number of segments
and even RTTs back. I think it could be possible to find a more
appropriate value for max_burst locally to an ACK. ...Though it might be a
bit over-engineered solution. For SACK we calculate similar metric anyway
in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at
cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
could perhaps be used (I'm not sure if these would be foolproof in
recovery though).
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-16 9:21 ` Ilpo Järvinen
@ 2008-04-16 9:35 ` David Miller
2008-04-16 14:50 ` Wenji Wu
2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu
2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner
1 sibling, 2 replies; 56+ messages in thread
From: David Miller @ 2008-04-16 9:35 UTC (permalink / raw)
To: ilpo.jarvinen; +Cc: johnwheffner, wenji, netdev
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 16 Apr 2008 12:21:38 +0300 (EEST)
> On Wed, 16 Apr 2008, David Miller wrote:
>
> > From: "John Heffner" <johnwheffner@gmail.com>
> > Date: Tue, 15 Apr 2008 15:40:05 -0700
> >
> > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
...
> > Any objections to my adding this to net-2.6.26?
>
> I don't have objections.
>
> But I want to note that tp->reordering does not consider the situation on
> that specific ACK because its value might originate a number of segments
> and even RTTs back. I think it could be possible to find a more
> appropriate value for max_burst locally to an ACK. ...Though it might be a
> bit over-engineered solution. For SACK we calculate similar metric anyway
> in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at
> cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
> could perhaps be used (I'm not sure if these would be foolproof in
> recovery though).
Right, we can tweak this thing further later.
*beep* *beep*
I've added John's patch to net-2.6.26
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-16 9:21 ` Ilpo Järvinen
2008-04-16 9:35 ` David Miller
@ 2008-04-16 14:40 ` John Heffner
2008-04-16 15:03 ` Ilpo Järvinen
1 sibling, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-16 14:40 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: David Miller, wenji, Netdev
On Wed, Apr 16, 2008 at 2:21 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> On Wed, 16 Apr 2008, David Miller wrote:
>
> > From: "John Heffner" <johnwheffner@gmail.com>
> > Date: Tue, 15 Apr 2008 15:40:05 -0700
> >
> > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
> > >
> > > This change is necessary to allow cwnd to grow during persistent
> > > reordering. Cwnd moderation is applied when in the disorder state
> > > and an ack that fills the hole comes in. If the hole was greater
> > > than 3 packets, but less than tp->reordering, cwnd will shrink when
> > > it should not have.
> > >
> > > Signed-off-by: John Heffner <jheffner@napa.(none)>
> >
> > I think this patch is correct, or at least more correct than what
> > this code is doing right now.
> >
> > Any objections to my adding this to net-2.6.26?
>
> I don't have objections.
>
> But I want to note that tp->reordering does not consider the situation on
> that specific ACK because its value might originate a number of segments
> and even RTTs back. I think it could be possible to find a more
> appropriate value for max_burst locally to an ACK. ...Though it might be a
> bit over-engineered solution. For SACK we calculate similar metric anyway
> in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at
> cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
> could perhaps be used (I'm not sure if these would be foolproof in
> recovery though).
Reordering is generally a random process resulting from a packet
traversing parallel queues. (In the case of netem, the random process
is explicitly defined by simulation.) As reordering is created by
packets sitting in queues, these queues *should* be able to absorb a
burst of at least the reordering size. That's at least my
justification for using the reordering threshold as max_burst, along
with the fact that it should prevent cwnd from getting clamped.
Anyway, max_burst isn't a standard. TCP makes no guarantees that it
won't burst a full window. If anything, I actually think that in most
cases we'd be better off without it. It's harmful to high-bdp flows
because it pulls down cwnd, which has a long-term effect in response
to a short-term event.
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: RE: A Linux TCP SACK Question
2008-04-15 22:40 ` John Heffner
2008-04-16 8:27 ` David Miller
@ 2008-04-16 14:46 ` Wenji Wu
1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-16 14:46 UTC (permalink / raw)
To: 'John Heffner'; +Cc: 'Ilpo Järvinen', 'Netdev'
>Wenji, can you try this out? Patch against net-2.6.26.
I just try with the new patch. It works, saturating the 1Gbps link.
The experiment works as:
Sender --- Router --- Receiver
Iperf is sending from the sender to the receiver. In between there is an
emulated router which runs netem. The emulated router has two interfaces,
both with netem configured. One interface emulates the forward path and the
other for the reverse path. Both netem interfaces are configured with 1.5ms
delay and 0.15ms variance. No packet drops. Kernel 2.6.25-rc9 patched with
the file you provided
Thanks,
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: A Linux TCP SACK Question
2008-04-16 9:35 ` David Miller
@ 2008-04-16 14:50 ` Wenji Wu
2008-04-18 6:52 ` David Miller
2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu
1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-16 14:50 UTC (permalink / raw)
To: 'David Miller', ilpo.jarvinen; +Cc: johnwheffner, netdev
>Right, we can tweak this thing further later.
>*beep* *beep*
>I've added John's patch to net-2.6.26
I just tried with John's patch. It works, saturating the 1Gbps in my test.
Without the patch, the throughput is around 180Mbps with SACK On, 250Mbps
with SACK off.
The same test environment described in my previous emails.
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner
@ 2008-04-16 15:03 ` Ilpo Järvinen
0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-16 15:03 UTC (permalink / raw)
To: John Heffner; +Cc: David Miller, wenji, Netdev
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1634 bytes --]
On Wed, 16 Apr 2008, John Heffner wrote:
> On Wed, Apr 16, 2008 at 2:21 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > But I want to note that tp->reordering does not consider the situation on
> > that specific ACK because its value might originate a number of segments
> > and even RTTs back. I think it could be possible to find a more
> > appropriate value for max_burst locally to an ACK. ...Though it might be a
> > bit over-engineered solution. For SACK we calculate similar metric anyway
> > in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at
> > cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
> > could perhaps be used (I'm not sure if these would be foolproof in
> > recovery though).
>
> Reordering is generally a random process resulting from a packet
> traversing parallel queues. (In the case of netem, the random process
> is explicitly defined by simulation.) As reordering is created by
> packets sitting in queues, these queues *should* be able to absorb a
> burst of at least the reordering size. That's at least my
> justification for using the reordering threshold as max_burst, along
> with the fact that it should prevent cwnd from getting clamped.
Sure, but combined with other phenomena such as ACK compression (and
appropriate ACK pattern & pre TCP state), one might end up generating much
larger bursts than just tp->reordering. Though it's probably not any worse
than ACK compression already can cause e.g. after spurious RTO. And one is
quite guaranteed to run out of something else too before things get too
nasty.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: A Linux TCP SACK Question
2008-04-16 14:50 ` Wenji Wu
@ 2008-04-18 6:52 ` David Miller
0 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2008-04-18 6:52 UTC (permalink / raw)
To: wenji; +Cc: ilpo.jarvinen, johnwheffner, netdev
From: Wenji Wu <wenji@fnal.gov>
Date: Wed, 16 Apr 2008 09:50:19 -0500
> >I've added John's patch to net-2.6.26
>
> I just tried with John's patch. It works, saturating the 1Gbps in my test.
>
> Without the patch, the throughput is around 180Mbps with SACK On, 250Mbps
> with SACK off.
>
> The same test environment described in my previous emails.
After this patch cooks for a couple more days I'll submit it
to -stable.
Thanks for your report and all of your testing Wenji.
Thanks John for the patch.
^ permalink raw reply [flat|nested] 56+ messages in thread
* about Linux adaptivly adjusting ssthresh
2008-04-16 9:35 ` David Miller
2008-04-16 14:50 ` Wenji Wu
@ 2008-08-27 14:38 ` Wenji Wu
2008-08-27 22:48 ` John Heffner
1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-08-27 14:38 UTC (permalink / raw)
To: 'David Miller', ilpo.jarvinen; +Cc: johnwheffner, netdev
Hi, all,
Could anybody help me out with Linux adaptively adjusting ssthresh? Thanks
in advance.
I understand that the latest Linux is able to adaptively adjust ssthresh to
avoid retransmission. Could anybody tell me which algorithms have been
implemented for the adaptive ssthresh adjust?
Thanks,
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting ssthresh
2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu
@ 2008-08-27 22:48 ` John Heffner
2008-08-28 0:53 ` Wenji Wu
0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-08-27 22:48 UTC (permalink / raw)
To: wenji; +Cc: David Miller, ilpo.jarvinen, netdev
On Wed, Aug 27, 2008 at 7:38 AM, Wenji Wu <wenji@fnal.gov> wrote:
>
> Hi, all,
>
> Could anybody help me out with Linux adaptively adjusting ssthresh? Thanks
> in advance.
>
> I understand that the latest Linux is able to adaptively adjust ssthresh to
> avoid retransmission. Could anybody tell me which algorithms have been
> implemented for the adaptive ssthresh adjust?
A little more detail would be helpful. Are you referring to caching
ssthresh between connections, or something going on during a
connection? Various congestion control modules use ssthresh
differently, so a comprehensive answer would be difficult.
-John
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting ssthresh
2008-08-27 22:48 ` John Heffner
@ 2008-08-28 0:53 ` Wenji Wu
2008-08-28 6:34 ` Ilpo Järvinen
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-08-28 0:53 UTC (permalink / raw)
To: John Heffner; +Cc: David Miller, ilpo.jarvinen, netdev
> A little more detail would be helpful. Are you referring to caching
> ssthresh between connections, or something going on during a
> connection? Various congestion control modules use ssthresh
> differently, so a comprehensive answer would be difficult.
Thanks John, I am referring to the adaptive ssthresh adjusting during a connection.
thanks,
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting ssthresh
2008-08-28 0:53 ` Wenji Wu
@ 2008-08-28 6:34 ` Ilpo Järvinen
2008-08-28 14:20 ` about Linux adaptivly adjusting dupthresh Wenji Wu
0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-08-28 6:34 UTC (permalink / raw)
To: Wenji Wu; +Cc: John Heffner, David Miller, Netdev
On Wed, 27 Aug 2008, Wenji Wu wrote:
>
> > A little more detail would be helpful. Are you referring to caching
> > ssthresh between connections, or something going on during a
> > connection? Various congestion control modules use ssthresh
> > differently, so a comprehensive answer would be difficult.
>
>
> Thanks John, I am referring to the adaptive ssthresh adjusting during a
> connection.
???
Every now and then (once we detect some losses) snd_ssthresh is set to a
halved flightsize as given by, well, you know those standards that say
something about it :-). So I (like John) seem to somewhat miss the point
of your question here.
Or did you perhaps refer to rcv_ssthresh (which I wouldn't ever call
to with a plain "ssthresh")?
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* about Linux adaptivly adjusting dupthresh
2008-08-28 6:34 ` Ilpo Järvinen
@ 2008-08-28 14:20 ` Wenji Wu
2008-08-28 18:53 ` Ilpo Järvinen
0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-08-28 14:20 UTC (permalink / raw)
To: 'Ilpo Järvinen'
Cc: 'John Heffner', 'David Miller', 'Netdev'
Sorry, I made a mistake in the last post, what I mean is "algorithms
adaptively adjust TCP reordering threshold dupthresh".
I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust
dupthresh to deal with packet reordering. Are there any other
reordering-tolerant algorithms implemented in Linux?
Thanks,
wenji
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: about Linux adaptivly adjusting dupthresh
2008-08-28 14:20 ` about Linux adaptivly adjusting dupthresh Wenji Wu
@ 2008-08-28 18:53 ` Ilpo Järvinen
2008-08-28 19:30 ` Wenji Wu
0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-08-28 18:53 UTC (permalink / raw)
To: Wenji Wu; +Cc: 'John Heffner', 'David Miller', 'Netdev'
On Thu, 28 Aug 2008, Wenji Wu wrote:
> Sorry, I made a mistake in the last post, what I mean is "algorithms
> adaptively adjust TCP reordering threshold dupthresh".
Ah, that makes much more sense. :-)
> I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust
> dupthresh to deal with packet reordering. Are there any other
> reordering-tolerant algorithms implemented in Linux?
First about adaptive dupthresh:
In addition to DSACK, we use never-retransmitted block's cumulative ACKs
to increase the dupthresh (see tcp_clean_rtx_queue). Then there's some
newreno thing when dupacks > packets_out but I've never really figured it
fully out if that's doing the correct thing when doing + tp->packets_out
besides the most simple case (see tcp_check_reno_reordering).
I don't think that eifel adjusts dupthresh though it can remove ambiguity
problem and thus we can use the never-retransmitted block acked detection
more often.
Also, there's some added logic for small-windowed case to reduce dupthresh
temporarily (at the smallest to 3 or whatever the default is) if window is
not large enough to generate the incremented (see tcp_time_to_recover).
Again, I'm not too sure what you mean by "reordering tolerant", but here
are some things that may be related:
FACK -> RFC3517 auto-fallback if reordering is detected (basically holes
are only counted with FACK in the more-than-dupthresh check).
I guess Eifel like timestamp checking belongs to this category (in
tcp_try_undo_partial).
If latency spike + reordering occurs, SACK FRTO might help but I think
it depends on scenario.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: about Linux adaptivly adjusting dupthresh
2008-08-28 18:53 ` Ilpo Järvinen
@ 2008-08-28 19:30 ` Wenji Wu
0 siblings, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-08-28 19:30 UTC (permalink / raw)
To: 'Ilpo Järvinen'
Cc: 'John Heffner', 'David Miller', 'Netdev'
Thanks,
-----Original Message-----
From: Ilpo Järvinen [mailto:ilpo.jarvinen@helsinki.fi]
Sent: Thursday, August 28, 2008 1:53 PM
To: Wenji Wu
Cc: 'John Heffner'; 'David Miller'; 'Netdev'
Subject: Re: about Linux adaptivly adjusting dupthresh
On Thu, 28 Aug 2008, Wenji Wu wrote:
> Sorry, I made a mistake in the last post, what I mean is "algorithms
> adaptively adjust TCP reordering threshold dupthresh".
Ah, that makes much more sense. :-)
> I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust
> dupthresh to deal with packet reordering. Are there any other
> reordering-tolerant algorithms implemented in Linux?
First about adaptive dupthresh:
In addition to DSACK, we use never-retransmitted block's cumulative ACKs
to increase the dupthresh (see tcp_clean_rtx_queue). Then there's some
newreno thing when dupacks > packets_out but I've never really figured it
fully out if that's doing the correct thing when doing + tp->packets_out
besides the most simple case (see tcp_check_reno_reordering).
I don't think that eifel adjusts dupthresh though it can remove ambiguity
problem and thus we can use the never-retransmitted block acked detection
more often.
Also, there's some added logic for small-windowed case to reduce dupthresh
temporarily (at the smallest to 3 or whatever the default is) if window is
not large enough to generate the incremented (see tcp_time_to_recover).
Again, I'm not too sure what you mean by "reordering tolerant", but here
are some things that may be related:
FACK -> RFC3517 auto-fallback if reordering is detected (basically holes
are only counted with FACK in the more-than-dupthresh check).
I guess Eifel like timestamp checking belongs to this category (in
tcp_try_undo_partial).
If latency spike + reordering occurs, SACK FRTO might help but I think
it depends on scenario.
--
i.
^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2008-08-28 19:30 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-04 4:54 A Linux TCP SACK Question Wenji Wu
2008-04-04 16:27 ` John Heffner
2008-04-04 17:49 ` Wenji Wu
2008-04-04 18:07 ` John Heffner
2008-04-04 20:00 ` Ilpo Järvinen
2008-04-04 20:07 ` Wenji Wu
2008-04-04 21:15 ` Wenji Wu
2008-04-04 21:33 ` Ilpo Järvinen
2008-04-04 21:39 ` Ilpo Järvinen
2008-04-04 22:14 ` Wenji Wu
2008-04-05 17:42 ` Ilpo Järvinen
2008-04-05 21:17 ` Sangtae Ha
2008-04-06 20:27 ` Wenji Wu
2008-04-06 22:43 ` Sangtae Ha
2008-04-07 14:56 ` Wenji Wu
2008-04-08 6:36 ` Ilpo Järvinen
2008-04-08 12:33 ` Wenji Wu
2008-04-08 13:45 ` Ilpo Järvinen
2008-04-08 14:30 ` Wenji Wu
2008-04-08 14:59 ` Ilpo Järvinen
2008-04-08 15:27 ` Wenji Wu
2008-04-08 17:26 ` Ilpo Järvinen
2008-04-14 22:47 ` Wenji Wu
2008-04-15 0:48 ` John Heffner
2008-04-15 8:25 ` Ilpo Järvinen
2008-04-15 18:01 ` Wenji Wu
2008-04-15 22:40 ` John Heffner
2008-04-16 8:27 ` David Miller
2008-04-16 9:21 ` Ilpo Järvinen
2008-04-16 9:35 ` David Miller
2008-04-16 14:50 ` Wenji Wu
2008-04-18 6:52 ` David Miller
2008-08-27 14:38 ` about Linux adaptivly adjusting ssthresh Wenji Wu
2008-08-27 22:48 ` John Heffner
2008-08-28 0:53 ` Wenji Wu
2008-08-28 6:34 ` Ilpo Järvinen
2008-08-28 14:20 ` about Linux adaptivly adjusting dupthresh Wenji Wu
2008-08-28 18:53 ` Ilpo Järvinen
2008-08-28 19:30 ` Wenji Wu
2008-04-16 14:40 ` A Linux TCP SACK Question John Heffner
2008-04-16 15:03 ` Ilpo Järvinen
2008-04-16 14:46 ` Wenji Wu
2008-04-15 15:45 ` Wenji Wu
2008-04-15 16:39 ` Wenji Wu
2008-04-15 17:01 ` John Heffner
2008-04-15 17:08 ` Ilpo Järvinen
2008-04-15 17:23 ` John Heffner
2008-04-15 18:00 ` Matt Mathis
2008-04-15 17:55 ` Wenji Wu
2008-04-08 15:57 ` John Heffner
2008-04-08 14:07 ` John Heffner
2008-04-14 16:10 ` Wenji Wu
2008-04-14 16:48 ` Ilpo Järvinen
2008-04-14 22:07 ` Wenji Wu
2008-04-15 8:23 ` Ilpo Järvinen
2008-04-04 21:40 ` Wenji Wu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).