A Linux TCP SACK Question

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* A Linux TCP SACK Question
@ 2008-04-04  4:54 Wenji Wu
  2008-04-04 16:27 ` John Heffner
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-04  4:54 UTC (permalink / raw)
  To: netdev

Hi, Could any body help me out with Linux TCP SACK? Thanks in advance.

I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before?

thanks,

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-04  4:54 A Linux TCP SACK Question Wenji Wu
@ 2008-04-04 16:27 ` John Heffner
  2008-04-04 17:49   ` Wenji Wu
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-04 16:27 UTC (permalink / raw)
  To: Wenji Wu; +Cc: netdev

Unless you're sending very fast, where the computational overhead of
processing SACK blocks is slowing you down, this is not expected
behavior.  Do you have more detail?  What is the window size, and how
much reordering?

Full binary tcpdumps are very useful in diagnosing this type of problem.

  -John


On Thu, Apr 3, 2008 at 9:54 PM, Wenji Wu <wenji@fnal.gov> wrote:
> Hi, Could any body help me out with Linux TCP SACK? Thanks in advance.
>
>  I run iperf to send traffic from sender to receiver. and add packet reordering in both forward and reverse directions. I found when I turn off the SACK/DSACK option, the throughput is better than with the SACK/DSACK on? How could it happen in this way? did anybody encounter this phenomenon before?
>
>
>  thanks,
>
>  wenji
>  --
>  To unsubscribe from this list: send the line "unsubscribe netdev" in
>  the body of a message to majordomo@vger.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 16:27 ` John Heffner
@ 2008-04-04 17:49   ` Wenji Wu
  2008-04-04 18:07     ` John Heffner
  2008-04-04 20:00     ` Ilpo Järvinen
  0 siblings, 2 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 17:49 UTC (permalink / raw)
  To: 'John Heffner'; +Cc: netdev

Hi, John,

Thanks,

I just sat with Richard Clarson and repeat the phenomenon.

The experiment works as:

      Sender --- Router --- Receiver

Iperf is sending from the sender to the receiver. In between there is an
emulated router which runs netem. The emulated router has two interfaces,
both with netem configured.  One interface emulates the forward path and the
other for the reverse path. Both netem interfaces are configured with 1.5ms
delay and 0.15ms variance. No packet drops. Every system runs Linux 2.6.24.

When sack is on, the throughput is around 180Mbps
When sack is off, the throughput is around 260Mbps

I am sure it is not due to the computational overhead of the processing SACK
block. All of these systems are multi-core platforms, with 2G+ CPU. I run
TOP to verify, CPUs are idle most of time.

I was thinking that if the reordered ACKs/SACKs cause confusion in the
sender, and sender will unnecessarily reduce either the CWND or the
TCP_REORDERING threshold. I might need to take a serious look at the SACK
implementation. 

I will send out the tcpdump files soon,

Thanks,

wenji

-----Original Message-----
From: John Heffner [mailto:johnwheffner@gmail.com] 
Sent: Friday, April 04, 2008 11:28 AM
To: Wenji Wu
Cc: netdev@vger.kernel.org
Subject: Re: A Linux TCP SACK Question

Unless you're sending very fast, where the computational overhead of
processing SACK blocks is slowing you down, this is not expected
behavior.  Do you have more detail?  What is the window size, and how
much reordering?

Full binary tcpdumps are very useful in diagnosing this type of problem.

  -John

On Thu, Apr 3, 2008 at 9:54 PM, Wenji Wu <wenji@fnal.gov> wrote:
> Hi, Could any body help me out with Linux TCP SACK? Thanks in advance.
>
>  I run iperf to send traffic from sender to receiver. and add packet
reordering in both forward and reverse directions. I found when I turn off
the SACK/DSACK option, the throughput is better than with the SACK/DSACK on?
How could it happen in this way? did anybody encounter this phenomenon
before?
>
>
>  thanks,
>
>  wenji
>  --
>  To unsubscribe from this list: send the line "unsubscribe netdev" in
>  the body of a message to majordomo@vger.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-04 17:49   ` Wenji Wu
@ 2008-04-04 18:07     ` John Heffner
  2008-04-04 20:00     ` Ilpo Järvinen
  1 sibling, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-04 18:07 UTC (permalink / raw)
  To: wenji; +Cc: netdev

On Fri, Apr 4, 2008 at 10:49 AM, Wenji Wu <wenji@fnal.gov> wrote:
>  I was thinking that if the reordered ACKs/SACKs cause confusion in the
>  sender, and sender will unnecessarily reduce either the CWND or the
>  TCP_REORDERING threshold. I might need to take a serious look at the SACK
>  implementation.

It sounds very likely that you're encountering a bug or thinko in the sack code.

This actually brings to mind an old topic -- NCR (RFC4653).  There was
some discussion of implementing this, which I think is simpler and
more robust than Linux's current reordering threshold calculation.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 17:49   ` Wenji Wu
  2008-04-04 18:07     ` John Heffner
@ 2008-04-04 20:00     ` Ilpo Järvinen
  2008-04-04 20:07       ` Wenji Wu
  2008-04-04 21:15       ` Wenji Wu
  1 sibling, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-04 20:00 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'John Heffner', Netdev

On Fri, 4 Apr 2008, Wenji Wu wrote:

> Every system runs Linux 2.6.24.

You should have reported kernel version right from the beginning. It may 
have a huge effect... ;-)

> When sack is on, the throughput is around 180Mbps
> When sack is off, the throughput is around 260Mbps

Not a surprise, once some reordering is detected, SACK TCP switches away 
from FACK to something that's not what you'd expect (in 2.6.24), you 
should try 2.6.25-rcs first in which the non-FACK is very close to 
RFC3517.

> I was thinking that if the reordered ACKs/SACKs cause confusion in the
> sender, and sender will unnecessarily reduce either the CWND or the
> TCP_REORDERING threshold. I might need to take a serious look at the 
> SACK implementation. 

I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it 
is recoded/updated since then.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-04 20:00     ` Ilpo Järvinen
@ 2008-04-04 20:07       ` Wenji Wu
  2008-04-04 21:15       ` Wenji Wu
  1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 20:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: 'John Heffner', Netdev


> On Fri, 4 Apr 2008, Wenji Wu wrote:
> 
> > Every system runs Linux 2.6.24.
> 
> You should have reported kernel version right from the beginning. It 
> may 
> have a huge effect... ;-)
> 
> > When sack is on, the throughput is around 180Mbps
> > When sack is off, the throughput is around 260Mbps
> 
> Not a surprise, once some reordering is detected, SACK TCP switches 
> away 
> from FACK to something that's not what you'd expect (in 2.6.24), you 
> should try 2.6.25-rcs first in which the non-FACK is very close to 
> RFC3517.
> 
> > I was thinking that if the reordered ACKs/SACKs cause cjavascript:parent.send('smtp')
Send Message
Sendonfusion in the
> > sender, and sender will unnecessarily reduce either the CWND or the
> > TCP_REORDERING threshold. I might need to take a serious look at the 
> 
> > SACK implementation. 
> 
> I'd suggest that you don't waste too much effort for 2.6.24. ...Most 
> of it 
> is recoded/updated since then.

Thanks, i would try it on the latest version and report the results.

wenji


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 20:00     ` Ilpo Järvinen
  2008-04-04 20:07       ` Wenji Wu
@ 2008-04-04 21:15       ` Wenji Wu
  2008-04-04 21:33         ` Ilpo Järvinen
  1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 21:15 UTC (permalink / raw)
  To: 'Ilpo Järvinen'; +Cc: 'John Heffner', 'Netdev'


>I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it 
>is recoded/updated since then.

Hi, Ilpo,

I just tried it on 2.6.25-rc8. The result is still the same: the throughput
with SACK on is less than with SACK off.

wenji



^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 21:15       ` Wenji Wu
@ 2008-04-04 21:33         ` Ilpo Järvinen
  2008-04-04 21:39           ` Ilpo Järvinen
  2008-04-04 21:40           ` Wenji Wu
  0 siblings, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-04 21:33 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'John Heffner', 'Netdev'

On Fri, 4 Apr 2008, Wenji Wu wrote:

> 
> >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it 
> >is recoded/updated since then.
> 
> I just tried it on 2.6.25-rc8. The result is still the same: the throughput
> with SACK on is less than with SACK off.

Hmm, can you also try if playing around with FRTO setting makes some 
difference (tcp_frto sysctl)?

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 21:33         ` Ilpo Järvinen
@ 2008-04-04 21:39           ` Ilpo Järvinen
  2008-04-04 22:14             ` Wenji Wu
  2008-04-04 21:40           ` Wenji Wu
  1 sibling, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-04 21:39 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'John Heffner', 'Netdev'

[-- Attachment #1: Type: TEXT/PLAIN, Size: 614 bytes --]

On Sat, 5 Apr 2008, Ilpo Järvinen wrote:

> On Fri, 4 Apr 2008, Wenji Wu wrote:
> 
> > 
> > >I'd suggest that you don't waste too much effort for 2.6.24. ...Most of it 
> > >is recoded/updated since then.
> > 
> > I just tried it on 2.6.25-rc8. The result is still the same: the throughput
> > with SACK on is less than with SACK off.
> 
> Hmm, can you also try if playing around with FRTO setting makes some 
> difference (tcp_frto sysctl)?

...Assuming it wasn't disabled already. If you find that there's 
significant difference, you could try also with SACK+basic FRTO (set 
the tcp_frto sysctl to 1).

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 21:39           ` Ilpo Järvinen
@ 2008-04-04 22:14             ` Wenji Wu
  2008-04-05 17:42               ` Ilpo Järvinen
  2008-04-05 21:17               ` Sangtae Ha
  0 siblings, 2 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 22:14 UTC (permalink / raw)
  To: 'Ilpo Järvinen'; +Cc: 'John Heffner', 'Netdev'


>...Assuming it wasn't disabled already. If you find that there's 
>significant difference, you could try also with SACK+basic FRTO (set 
>the tcp_frto sysctl to 1).

No, still the same. I tried tcp_frto with 0, 1, 2. 

SACK On is worse than SACK off.

wenji



^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-04 22:14             ` Wenji Wu
@ 2008-04-05 17:42               ` Ilpo Järvinen
  2008-04-05 21:17               ` Sangtae Ha
  1 sibling, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-05 17:42 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'John Heffner', 'Netdev'

On Fri, 4 Apr 2008, Wenji Wu wrote:

> 
> >...Assuming it wasn't disabled already. If you find that there's 
> >significant difference, you could try also with SACK+basic FRTO (set 
> >the tcp_frto sysctl to 1).
> 
> No, still the same. I tried tcp_frto with 0, 1, 2. 
> 
> SACK On is worse than SACK off.

No easy solution then, we'll have to take a look on tcpdumps.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-04 22:14             ` Wenji Wu
  2008-04-05 17:42               ` Ilpo Järvinen
@ 2008-04-05 21:17               ` Sangtae Ha
  2008-04-06 20:27                 ` Wenji Wu
  1 sibling, 1 reply; 56+ messages in thread
From: Sangtae Ha @ 2008-04-05 21:17 UTC (permalink / raw)
  To: wenji; +Cc: Ilpo Järvinen, John Heffner, Netdev

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]

Can you run the attached script and run your testing again?
I think it might be the problem of your dual cores balance the
interrupts on your testing NIC.
As we do a lot of things with SACK, cache misses and etc. might affect
your performance.

In default setting, I disabled tcp segment offload and did a smp
affinity setting to CPU 0.
Please change "INF" to your interface name and let us know the results.

Sangtae


On Fri, Apr 4, 2008 at 6:14 PM, Wenji Wu <wenji@fnal.gov> wrote:
>
> >...Assuming it wasn't disabled already. If you find that there's
> >significant difference, you could try also with SACK+basic FRTO (set
> >the tcp_frto sysctl to 1).
>
> No, still the same. I tried tcp_frto with 0, 1, 2.
>
> SACK On is worse than SACK off.
>
> wenji
>
>
> --
>
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: tuning.sh --]
[-- Type: application/x-sh, Size: 1753 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-05 21:17               ` Sangtae Ha
@ 2008-04-06 20:27                 ` Wenji Wu
  2008-04-06 22:43                   ` Sangtae Ha
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-06 20:27 UTC (permalink / raw)
  To: Sangtae Ha; +Cc: Ilpo Järvinen, John Heffner, Netdev

> Can you run the attached script and run your testing again?
> I think it might be the problem of your dual cores balance the
> interrupts on your testing NIC.
> As we do a lot of things with SACK, cache misses and etc. might affect
> your performance.
> 
> In default setting, I disabled tcp segment offload and did a smp
> affinity setting to CPU 0.
> Please change "INF" to your interface name and let us know the results.

I bound the network interrupts and iperf both the CPU0, and CPU0 will be ilde most of the time. The results are still the same.

At this throughput level, the SACK processing won't take much CPU. 

It is not the interrupt/cpu affinity that cause the difference. 

I am beleving that it is the ACK reordering that cuase the confusion in the sender, which lead the sender uncecessarily to reduce CWND or REORDERING_THRESHOLD.

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-06 20:27                 ` Wenji Wu
@ 2008-04-06 22:43                   ` Sangtae Ha
  2008-04-07 14:56                     ` Wenji Wu
  0 siblings, 1 reply; 56+ messages in thread
From: Sangtae Ha @ 2008-04-06 22:43 UTC (permalink / raw)
  To: Wenji Wu; +Cc: Ilpo Järvinen, John Heffner, Netdev

When our 40 students had the same lab experiment comparing between
TCP-SACK and TCP-NewReno, they had come up with similar results. The
settings are identical to your setting (one linux sender, one linux
receiver, and one nettem machine in between) . When we introduced some
loss using a nettem, TCP-SACK showed a bit better performance while
they had similar throughput most of cases.

I don't think reorderings frequently happened in your directly
connected networking scenario. Please post your tcpdump file for
clearing out all doubts.

Sangtae

On 4/6/08, Wenji Wu <wenji@fnal.gov> wrote:
>
>
> > Can you run the attached script and run your testing again?
> > I think it might be the problem of your dual cores balance the
> > interrupts on your testing NIC.
> > As we do a lot of things with SACK, cache misses and etc. might affect
> > your performance.
> >
> > In default setting, I disabled tcp segment offload and did a smp
> > affinity setting to CPU 0.
> > Please change "INF" to your interface name and let us know the results.
>
> I bound the network interrupts and iperf both the CPU0, and CPU0 will be ilde most of the time. The results are still the same.
>
> At this throughput level, the SACK processing won't take much CPU.
>
> It is not the interrupt/cpu affinity that cause the difference.
>
> I am beleving that it is the ACK reordering that cuase the confusion in the sender, which lead the sender uncecessarily to reduce CWND or REORDERING_THRESHOLD.
>
> wenji
>


-- 
----------------------------------------------------------------
 Sangtae Ha, http://www4.ncsu.edu/~sha2
 PhD. Student,
 Department of Computer Science,
 North Carolina State University, USA
----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-06 22:43                   ` Sangtae Ha
@ 2008-04-07 14:56                     ` Wenji Wu
  2008-04-08  6:36                       ` Ilpo Järvinen
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-07 14:56 UTC (permalink / raw)
  To: 'Sangtae Ha'
  Cc: 'Ilpo Järvinen', 'John Heffner',
	'Netdev'


>I don't think reorderings frequently happened in your directly
>connected networking scenario. Please post your tcpdump file for
>clearing out all doubts.

https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/

Two tcpdump files: one with SACK on, the other with SACK off. The test
configures described in my previous emails.

Best,

wenji



^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-07 14:56                     ` Wenji Wu
@ 2008-04-08  6:36                       ` Ilpo Järvinen
  2008-04-08 12:33                         ` Wenji Wu
                                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08  6:36 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

On Mon, 7 Apr 2008, Wenji Wu wrote:

> >I don't think reorderings frequently happened in your directly
> >connected networking scenario. Please post your tcpdump file for
> >clearing out all doubts.
> 
> https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
> 
> Two tcpdump files: one with SACK on, the other with SACK off. The test
> configures described in my previous emails.

NewReno never retransmitted anything in them (except at the very end of 
the transfer). Probably something related to how tp->reordering behaves
I suppose...

ijjarvin@pointhope:~/linux/debug$ /usr/sbin/tcpdump -n -r nosack | grep 
"4888[35] >" | cut -d ' ' -f 7- | cut -d ':' -f 1 | awk '{if ($1 < old) 
{print $1}; old=$1;}'
reading from file nosack, link-type EN10MB (Ethernet)
1
641080641
ijjarvin@pointhope:~/linux/debug$ 

ijjarvin@pointhope:~/linux/debug$ /usr/sbin/tcpdump -n -r sack | grep 
"4888[35] >" | cut -d ' ' -f 7- | cut -d ':' -f 1 | awk '{if ($1 < old) 
{print $1}; old=$1;}'
reading from file sack, link-type EN10MB (Ethernet)
1
7265
10161
141929
175233
196953
446558881
3542223511
ijjarvin@pointhope:~/linux/debug$ 


This is probably far fetched but could you tell us how you make sure that 
earlier connection's metrics are not affecting the latter connection? 
Ie., the discovered reordering is not transferred across the flows (in CBI 
like manner) and thus newreno has unfair advantage?

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08  6:36                       ` Ilpo Järvinen
@ 2008-04-08 12:33                         ` Wenji Wu
  2008-04-08 13:45                           ` Ilpo Järvinen
  2008-04-08 15:57                           ` John Heffner
  2008-04-08 14:07                         ` John Heffner
  2008-04-14 16:10                         ` Wenji Wu
  2 siblings, 2 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-08 12:33 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

> NewReno never retransmitted anything in them (except at the very end 
> of 
> the transfer). Probably something related to how tp->reordering behaves
> I suppose...

Yes, the adaptive tp->reordering will play a role here. 

> This is probably far fetched but could you tell us how you make sure 
> that 
> earlier connection's metrics are not affecting the latter connection? 
> 
> Ie., the discovered reordering is not transferred across the flows (in 
> CBI 
> like manner) and thus newreno has unfair advantage?

You can reverse the order of the tests, with SACK option on/off. The results are still the same.

Also, according to the source code, tp->reordering will be initialized to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new connection is established. After that, tp->reordering is controlled by the the adaptive algorithm

wenji







^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08 12:33                         ` Wenji Wu
@ 2008-04-08 13:45                           ` Ilpo Järvinen
  2008-04-08 14:30                             ` Wenji Wu
  2008-04-08 15:57                           ` John Heffner
  1 sibling, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 13:45 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

On Tue, 8 Apr 2008, Wenji Wu wrote:

> > NewReno never retransmitted anything in them (except at the very end 
> > of 
> > the transfer). Probably something related to how tp->reordering behaves
> > I suppose...
> 
> Yes, the adaptive tp->reordering will play a role here. 

...What is not clear to me why NewReno does not go to recovery at least 
once near the beginning, or at least it won't result in a retransmission.

In which kernel version this dump comes from? 2.6.24 newreno is crippled 
with TSO as was recently discovered, ie., it won't mark lost super skbs 
at head and thus won't retransmit them. Also 2.6.25-rcs are still broken 
(though they'll transmit too much, I'll not go detail in here), DaveM now 
has the fix for 2.6.25-rcs in net-2.6.

> > This is probably far fetched but could you tell us how you make sure 
> > that 
> > earlier connection's metrics are not affecting the latter connection? 
> > 
> > Ie., the discovered reordering is not transferred across the flows (in 
> > CBI 
> > like manner) and thus newreno has unfair advantage?
> 
> You can reverse the order of the tests, with SACK option on/off. The 
> results are still the same.

Ok. I just wanted to make sure so that we don't end up trace some test 
setup issue :-).

> Also, according to the source code, tp->reordering will be initialized 
> to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new 
> connection is established.

In addition, in tcp_init_metrics():

	if (dst_metric(dst, RTAX_REORDERING) &&
            tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
                tcp_disable_fack(tp);
                tp->reordering = dst_metric(dst, RTAX_REORDERING);
        }

> After that, tp->reordering is controlled by  the the adaptive algorithm

Yes, however, the algorithm will be vastly different in those two cases.
NewReno stuff is in tcp_check_reno_reordering() and other place in 
tcp_try_undo_partial() but the latter is only happening in recovery I 
think. SACK on the other has number of callsites to tcp_update_reordering, 
check for yourself.

This might be due to my change which made tcp_check_reno_reordering to be 
called earlier than it used to be (to remove a transition state during 
which sacked_out contained stale info including some already cumulative 
ACKed segments). I was quite unsure if I can safely do that. It's not 
clear to me how your test could cause sacked_out > packets_out-1 to occur 
though, which is necessary for tcp_update_reordering to get called with 
newreno. The ACK reordering should just make the number of duplicate acks 
smaller because part of them get discarded as old ones as a newer 
cumulative ACK often arrives a bit "ahead" of it's time making rest 
smaller sequenced ACKs very close to no-op. ...Though I didn't yet do a 
awk magic to prove that it won't happen in the non-sack dump.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08 13:45                           ` Ilpo Järvinen
@ 2008-04-08 14:30                             ` Wenji Wu
  2008-04-08 14:59                               ` Ilpo Järvinen
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-08 14:30 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

> > Yes, the adaptive tp->reordering will play a role here. 
> 
> ...What is not clear to me why NewReno does not go to recovery at 
> least 
> once near the beginning, or at least it won't result in a retransmission.


The problem cause me two weeks' time to debug!

With 3 DupACKs, tcp_ack() calls tcp_fastretrans_alert(), and which in turn calls tcp_xmit_retransmit_queue().

Within tcp_xmit_retransmit_queue(), there is a line of code that would cause the problem above:

......................................................................................................
 /* Forward retransmissions are possible only during Recovery. */
1999        if (icsk->icsk_ca_state != TCP_CA_Recovery)
2000                return;

2001
2002        /* No forward retransmissions in Reno are possible. */
2003        if (tcp_is_reno(tp))
2004                return;

.....................................................................................................

if you look at "tcp_is_reno", you would see that with SACK off, Reno does not do retransmit, it will return!!!

Really do not understand why these two lines of code exist there!!!

Also, this code still in 2.6.25.


 
> In which kernel version this dump comes from? 2.6.24 newreno is 
> crippled 
> with TSO as was recently discovered, ie., it won't mark lost super 
> skbs 
> at head and thus won't retransmit them. Also 2.6.25-rcs are still 
> broken 
> (though they'll transmit too much, I'll not go detail in here), DaveM 
> now 
> has the fix for 2.6.25-rcs in net-2.6.

The dumped file is from 2.6.24. 2.6.25's is similiar.

 
> > You can reverse the order of the tests, with SACK option on/off. The 
> 
> > results are still the same.
> 
> Ok. I just wanted to make sure so that we don't end up trace some test 
> 
> setup issue :-).
> 
> > Also, according to the source code, tp->reordering will be 
> initialized 
> > to "/proc/sys/net/ipv4/tcp_reordering" (default 3), when the new 
> > connection is established.
> 
> In addition, in tcp_init_metrics():
> 
> 	if (dst_metric(dst, RTAX_REORDERING) &&
>             tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
>                 tcp_disable_fack(tp);
>                 tp->reordering = dst_metric(dst, RTAX_REORDERING);
>         }

Good to know this, thanks



wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08 14:30                             ` Wenji Wu
@ 2008-04-08 14:59                               ` Ilpo Järvinen
  2008-04-08 15:27                                 ` Wenji Wu
  2008-04-14 22:47                                 ` Wenji Wu
  0 siblings, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 14:59 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

On Tue, 8 Apr 2008, Wenji Wu wrote:

> With 3 DupACKs, tcp_ack() calls tcp_fastretrans_alert(), and which in 
> turn calls tcp_xmit_retransmit_queue().

Yeah. It should.

> Within tcp_xmit_retransmit_queue(), there is a line of code that would 
> cause the problem above:
> 
> ......................................................................................................
>  /* Forward retransmissions are possible only during Recovery. */
> 1999        if (icsk->icsk_ca_state != TCP_CA_Recovery)
> 2000                return;
> 
> 2001
> 2002        /* No forward retransmissions in Reno are possible. */
> 2003        if (tcp_is_reno(tp))
> 2004                return;
> 
> .....................................................................................................
> 
> if you look at "tcp_is_reno", you would see that with SACK off, Reno 
> does not do retransmit, it will return!!!

Your analysis is missing something important here, there are two loops 
there :-). One for retransmitting assumed lost segments that's above those 
lines you quoted! The other below is for non-lost marked similar to what 
is specified by RFC3517's Rule 3 for NextSeg, which definately won't apply 
for newreno nor should be executed.

> Really do not understand why these two lines of code exist there!!!
>
> Also, this code still in 2.6.25.

Sure, but there's nothing wrong with them! 2.6.24 just is currently broken 
if you have TSO+NewReno because it won't do the correct lost marking which 
is a necessary preparation step for the loop above that, too bad as I just 
figured that out one/two days ago so there's no fix yet available :-).

> > In which kernel version this dump comes from? 2.6.24 newreno is 
> > crippled 
> > with TSO as was recently discovered, ie., it won't mark lost super 
> > skbs 
> > at head and thus won't retransmit them. Also 2.6.25-rcs are still 
> > broken 
> > (though they'll transmit too much, I'll not go detail in here), DaveM 
> > now 
> > has the fix for 2.6.25-rcs in net-2.6.
> 
> The dumped file is from 2.6.24. 2.6.25's is similiar.

It's a bit hard for me to believe, considering what the last weeks debug 
has revealed about internals of it. Have you checked it from the dumps or 
from the overall results, a similarity in the latter could be due to 
other factors related to the differences in reordering detection between 
NewReno/SACK.

> > In addition, in tcp_init_metrics():
> > 
> > 	if (dst_metric(dst, RTAX_REORDERING) &&
> >             tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
> >                 tcp_disable_fack(tp);
> >                 tp->reordering = dst_metric(dst, RTAX_REORDERING);
> >         }
> 
> Good to know this, thanks

...There might be some bug which causes it to get skipped under some 
circumstances though (which I haven't yet remembered to fix). I don't 
remember too well anymore, probably some goto which caused skipping most 
of what's in there.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08 14:59                               ` Ilpo Järvinen
@ 2008-04-08 15:27                                 ` Wenji Wu
  2008-04-08 17:26                                   ` Ilpo Järvinen
  2008-04-14 22:47                                 ` Wenji Wu
  1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-08 15:27 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

 
> It's a bit hard for me to believe, considering what the last weeks 
> debug 
> has revealed about internals of it. Have you checked it from the dumps 
> or 
> from the overall results, a similarity in the latter could be due to 
> other factors related to the differences in reordering detection 
> between 
> NewReno/SACK.
> 

> ...There might be some bug which causes it to get skipped under some 
> circumstances though (which I haven't yet remembered to fix). I don't 
> 
> remember too well anymore, probably some goto which caused skipping 
> most 
> of what's in there.
> 

Get back to you later, and post the tcpdump file for 2.6.25. 

wenji



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08 15:27                                 ` Wenji Wu
@ 2008-04-08 17:26                                   ` Ilpo Järvinen
  0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-08 17:26 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

On Tue, 8 Apr 2008, Wenji Wu wrote:

>  
> > It's a bit hard for me to believe, considering what the last weeks 
> > debug 
> > has revealed about internals of it. Have you checked it from the dumps 
> > or 
> > from the overall results, a similarity in the latter could be due to 
> > other factors related to the differences in reordering detection 
> > between 
> > NewReno/SACK.
> > 
>
> Get back to you later, and post the tcpdump file for 2.6.25. 

Please, if possible use a kernel version where my today applied tcp fixes 
are in, ie., at least DaveM's net-2.6 already has them, I didn't check if 
Linus has pulled them in yet.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: RE: A Linux TCP SACK Question
  2008-04-08 14:59                               ` Ilpo Järvinen
  2008-04-08 15:27                                 ` Wenji Wu
@ 2008-04-14 22:47                                 ` Wenji Wu
  2008-04-15  0:48                                   ` John Heffner
  1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-14 22:47 UTC (permalink / raw)
  To: 'Ilpo Järvinen'; +Cc: 'Netdev'

Hi, Ilpo,

Could the throughput difference with SACK ON/OFF be due to the following
code in tcp_ack()?

3120        if (tcp_ack_is_dubious(sk, flag)) {
3121                /* Advance CWND, if state allows this. */
3122                if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
3123                    tcp_may_raise_cwnd(sk, flag))
3124                        tcp_cong_avoid(sk, ack, prior_in_flight, 0);
3125                tcp_fastretrans_alert(sk, prior_packets -
tp->packets_out, flag);
3126        } else {
3127                if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
3128                        tcp_cong_avoid(sk, ack, prior_in_flight, 1);
3129        }

In my tests, there are actually no packet drops, just severe packet
reordering in both forward and reverse paths. With good tcp_reordering
auto-tuning, there are few retransmissions.

(1) With SACK option off, the reorder ACKs will not cause much harm to the
throughput. As you have pointed out in the email that "The ACK reordering
should just make the number of duplicate acks smaller because part of them
get discarded as old ones as a newer cumulative ACK often arrives a bit
"ahead" of its time making rest smaller sequenced ACKs very close to on-op."

If there are any ACK advancement, tcp_cong_avoid() will be called. 

(2) With the sack option is on. If the ACKs do not advance the left edge of
the window, those ACKs will go to "old_ack" of "tcp_ack()", no much
processing except sack-tagging the corresponding packets in the
retransmission queue. tcp_cong_avoid() will not be called.

However, if the ACKs advance the left edge of the window and these ACKs
include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the
calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122,
which is stricter than the if-condition at line 3127.

So, the congestion window with SACK on would be smaller than with SACK off. 

If you run tcptrace and xplot on the files I posted, you would see lots ACKs
will advance the left edge of the window, and include SACK blocks.

Not quite sure, just a guess.

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-14 22:47                                 ` Wenji Wu
@ 2008-04-15  0:48                                   ` John Heffner
  2008-04-15  8:25                                     ` Ilpo Järvinen
                                                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: John Heffner @ 2008-04-15  0:48 UTC (permalink / raw)
  To: wenji; +Cc: Ilpo Järvinen, Netdev

On Mon, Apr 14, 2008 at 3:47 PM, Wenji Wu <wenji@fnal.gov> wrote:
> Hi, Ilpo,
>
>  Could the throughput difference with SACK ON/OFF be due to the following
>  code in tcp_ack()?
>
>  3120        if (tcp_ack_is_dubious(sk, flag)) {
>  3121                /* Advance CWND, if state allows this. */
>  3122                if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
>  3123                    tcp_may_raise_cwnd(sk, flag))
>  3124                        tcp_cong_avoid(sk, ack, prior_in_flight, 0);
>  3125                tcp_fastretrans_alert(sk, prior_packets -
>  tp->packets_out, flag);
>  3126        } else {
>  3127                if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
>  3128                        tcp_cong_avoid(sk, ack, prior_in_flight, 1);
>  3129        }
>
>  In my tests, there are actually no packet drops, just severe packet
>  reordering in both forward and reverse paths. With good tcp_reordering
>  auto-tuning, there are few retransmissions.
>
>  (1) With SACK option off, the reorder ACKs will not cause much harm to the
>  throughput. As you have pointed out in the email that "The ACK reordering
>
> should just make the number of duplicate acks smaller because part of them
>  get discarded as old ones as a newer cumulative ACK often arrives a bit
>  "ahead" of its time making rest smaller sequenced ACKs very close to on-op."
>
>  If there are any ACK advancement, tcp_cong_avoid() will be called.
>
>  (2) With the sack option is on. If the ACKs do not advance the left edge of
>  the window, those ACKs will go to "old_ack" of "tcp_ack()", no much
>  processing except sack-tagging the corresponding packets in the
>  retransmission queue. tcp_cong_avoid() will not be called.
>
>  However, if the ACKs advance the left edge of the window and these ACKs
>  include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the
>  calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122,
>  which is stricter than the if-condition at line 3127.
>
>  So, the congestion window with SACK on would be smaller than with SACK off.
>
>  If you run tcptrace and xplot on the files I posted, you would see lots ACKs
>  will advance the left edge of the window, and include SACK blocks.
>
>  Not quite sure, just a guess.

I had considered this, but it would seem that tcp_may_raise_cwnd() in
this case *should* return true, right?

Sill the mystery remains as to why *both* are going so slowly.  You
mentioned you're using a web100 kernel.  What are the final values of
all the variables for the connections (grab with readall)?

Thanks,
  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15  0:48                                   ` John Heffner
@ 2008-04-15  8:25                                     ` Ilpo Järvinen
  2008-04-15 18:01                                       ` Wenji Wu
  2008-04-15 15:45                                     ` Wenji Wu
  2008-04-15 16:39                                     ` Wenji Wu
  2 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-15  8:25 UTC (permalink / raw)
  To: John Heffner, wenji; +Cc: Netdev

On Mon, 14 Apr 2008, John Heffner wrote:

> On Mon, Apr 14, 2008 at 3:47 PM, Wenji Wu <wenji@fnal.gov> wrote:
> >
> >  Could the throughput difference with SACK ON/OFF be due to the following
> >  code in tcp_ack()?
> >
> >  3120        if (tcp_ack_is_dubious(sk, flag)) {
> >  3121                /* Advance CWND, if state allows this. */
> >  3122                if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
> >  3123                    tcp_may_raise_cwnd(sk, flag))
> >  3124                        tcp_cong_avoid(sk, ack, prior_in_flight, 0);
> >  3125                tcp_fastretrans_alert(sk, prior_packets -
> >  tp->packets_out, flag);
> >  3126        } else {
> >  3127                if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
> >  3128                        tcp_cong_avoid(sk, ack, prior_in_flight, 1);
> >  3129        }
> >
> >  In my tests, there are actually no packet drops, just severe packet
> >  reordering in both forward and reverse paths. With good tcp_reordering
> >  auto-tuning, there are few retransmissions.
> >
> >
> >  (1) With SACK option off, the reorder ACKs will not cause much harm to the
> >  throughput. As you have pointed out in the email that "The ACK reordering
> >
> >  should just make the number of duplicate acks smaller because part of them
> >  get discarded as old ones as a newer cumulative ACK often arrives a bit
> >  "ahead" of its time making rest smaller sequenced ACKs very close to on-op."

...Please note that these are considered as old ACKs, so that we do goto 
old_ack, which is equal for both SACK and NewReno. ...So it won't make any 
difference between them.

> >  If there are any ACK advancement, tcp_cong_avoid() will be called.

NewReno case analysis is not exactly what you assume, if there was at 
least on duplicate ACK already, the ca_state will be CA_Disorder for 
NewReno which makes ack_is_dubious true. You probably assumed it goes
to the other branch directly?

> >  (2) With the sack option is on. If the ACKs do not advance the left edge of
> >  the window, those ACKs will go to "old_ack" of "tcp_ack()", no much
> >  processing except sack-tagging the corresponding packets in the
> >  retransmission queue. tcp_cong_avoid() will not be called.

No, this is not right. The old_ack happens only if left edge 
backtracks, in which case we obviously should discard as it's stale 
information (except SACK may reveal something not yet known which is
why sacktag is called there). This same applies regardless of SACK (no 
tagging of course).

...Hmm, there's one questionable part in here in the code (I doubt it 
makes any difference here though). If new sack info is discovered, we 
don't retransmit but send new data (if window allows) even when in 
recovery where TCP should retransmit first.

> >  However, if the ACKs advance the left edge of the window and these ACKs
> >  include SACK options, tcp_ack_is_dubious(sk, flag)) would be true. Then the
> >  calling of tcp_cong_avoid() needs to satisfy the if-condition at line 3122,
> >  which is stricter than the if-condition at line 3127.
> >
> >  So, the congestion window with SACK on would be smaller than with SACK off.

I think you might have found a bug though it won't affect you but makes 
that check easier to pass actually:

Questionable thing is that || in tcp_may_raise_cwnd (might not be 
intentional)...

But in your case, during initial slow-start that condition in 
tcp_may_raise_cwnd will always be true (if you've metrics are cleared as 
they should). Because: (...not important || 1) && 1 because cwnd < 
ssthresh. After that, when you don't have ECE nor are in recovery, 
tcp_may_raise_cwnd results in this: (1 || ...not calculated) && 1, so it 
should always allow increment in your case except when in recovery which 
hardly makes up for the difference you're seeing...

> >  If you run tcptrace and xplot on the files I posted, you would see 
> >  lots ACKs will advance the left edge of the window, and include SACK 
> >  blocks.

This would only make difference if any of those SACK blocks were new. If 
they're not, DATA_SACKED_ACKED won't be set in flag.

> >  Not quite sure, just a guess.

You seem to be missing the third case, which I tried to point out
earlier. The case where left edge remains the same. I think it makes a 
huge difference here (I'll analyse non-recovery case here):

NewReno goes always to fastretrans_alert, to default branch, and because 
it's is_dupack, it increments sacked_out through tcp_add_reno_sack. 
Effectively packets_in_flight is reduced by one and TCP is able to send 
a new segment out.

Now with SACK there are two cases:

SACK and newly discovere SACK info (for simplicity, lets assume just one 
newly discovered sacked segment). Sacktag marks that segment and increment 
sacked_out, effectively making packets_in_flight equal to the case with 
NewReno. It goes to fastretrans_alert and makes all similar maneuvers as 
NewReno (except if enough SACK blocks have arrived to trigger recovery 
while NewReno would not have enough dupACKs collected, I doubt that this 
makes the difference though, I'll need no-metricsed logs to verify the 
number of recoveries to confirm that they're quite few).

SACK and no new SACK info. Sacktag won't find anything to mark, thus 
sacked_out remains the same. It goes to fastretrans_alert because ca_state 
is CA_Disorder. But, now we did lose one segment compared with NewReno 
because we didn't increment sacked_out making packets_in_flight to stay in 
the amount it was before. Thus we cannot send new data segment out and 
fall behind the NewReno.

> I had considered this, but it would seem that tcp_may_raise_cwnd() in
> this case *should* return true, right?

Yes, it seems. Though I think that it's unintentional. I'd say that that 
|| should be && but I might be wrong.

> Sill the mystery remains as to why *both* are going so slowly.  You
> mentioned you're using a web100 kernel.  What are the final values of
> all the variables for the connections (grab with readall)?

...I think that due to reordering, one will lose part of the cwnd 
increments because of old ACKs as they won't allow you to add more 
segments to the network, at some point of time the lossage will be large 
enough to stall the growth of the cwnd (if in congestion avoidance with 
the small increment). With slow start it seems not that self-evident that 
such level exists though it might.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15  8:25                                     ` Ilpo Järvinen
@ 2008-04-15 18:01                                       ` Wenji Wu
  2008-04-15 22:40                                         ` John Heffner
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 18:01 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: John Heffner, Netdev


> No, this is not right. The old_ack happens only if left edge 
> backtracks, in which case we obviously should discard as it's stale 
> information (except SACK may reveal something not yet known which is
> why sacktag is called there). This same applies regardless of SACK (no 
> 
> tagging of course).

Yes, I mis-present myself in the last email. What I meant is the left edge backtrack case as you have pointed out.
 
> 
> I think you might have found a bug though it won't affect you but 
> makes 
> that check easier to pass actually:
> 
> Questionable thing is that || in tcp_may_raise_cwnd (might not be 
> intentional)...
> 
> But in your case, during initial slow-start that condition in 
> tcp_may_raise_cwnd will always be true (if you've metrics are cleared 
> as 
> they should). Because: (...not important || 1) && 1 because cwnd < 
> ssthresh. After that, when you don't have ECE nor are in recovery, 
> tcp_may_raise_cwnd results in this: (1 || ...not calculated) && 1, so 
> it 
> should always allow increment in your case except when in recovery 
> which 
> hardly makes up for the difference you're seeing...

You are right, I just printed out the return value of tcp_may_raise_cwnd(). It is all one!


> This would only make difference if any of those SACK blocks were new. 
> If 
> they're not, DATA_SACKED_ACKED won't be set in flag.
> 
> > >  Not quite sure, just a guess.
> 
> You seem to be missing the third case, which I tried to point out
> earlier. The case where left edge remains the same. I think it makes a 
> 
> huge difference here (I'll analyse non-recovery case here):
> 
> NewReno goes always to fastretrans_alert, to default branch, and 
> because 
> it's is_dupack, it increments sacked_out through tcp_add_reno_sack. 
> Effectively packets_in_flight is reduced by one and TCP is able to 
> send 
> a new segment out.
> 
> Now with SACK there are two cases:
> 
> SACK and newly discovere SACK info (for simplicity, lets assume just 
> one 
> newly discovered sacked segment). Sacktag marks that segment and 
> increment 
> sacked_out, effectively making packets_in_flight equal to the case 
> with 
> NewReno. It goes to fastretrans_alert and makes all similar maneuvers 
> as 
> NewReno (except if enough SACK blocks have arrived to trigger recovery 
> 
> while NewReno would not have enough dupACKs collected, I doubt that 
> this 
> makes the difference though, I'll need no-metricsed logs to verify the 
> 
> number of recoveries to confirm that they're quite few).
> 
> SACK and no new SACK info. Sacktag won't find anything to mark, thus 
> sacked_out remains the same. It goes to fastretrans_alert because 
> ca_state 
> is CA_Disorder. But, now we did lose one segment compared with NewReno 
> 
> because we didn't increment sacked_out making packets_in_flight to 
> stay in 
> the amount it was before. Thus we cannot send new data segment out and 
> 
> fall behind the NewReno.

Agree with you. Thanks. You did give me a good class on Linux ACK/SACK implementation. Thank you.

> > I had considered this, but it would seem that tcp_may_raise_cwnd() in
> > this case *should* return true, right?
> 
> Yes, it seems. Though I think that it's unintentional. I'd say that 
> that 
> || should be && but I might be wrong.

Yes, It is all ture!

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15 18:01                                       ` Wenji Wu
@ 2008-04-15 22:40                                         ` John Heffner
  2008-04-16  8:27                                           ` David Miller
  2008-04-16 14:46                                           ` Wenji Wu
  0 siblings, 2 replies; 56+ messages in thread
From: John Heffner @ 2008-04-15 22:40 UTC (permalink / raw)
  To: Wenji Wu; +Cc: Ilpo Järvinen, Netdev

[-- Attachment #1: Type: text/plain, Size: 73 bytes --]

Wenji, can you try this out?  Patch against net-2.6.26.

Thanks,
  -John

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Increase-the-max_burst-threshold-from-3-to-tp-reord.patch --]
[-- Type: text/x-diff; name=0001-Increase-the-max_burst-threshold-from-3-to-tp-reord.patch, Size: 1491 bytes --]

From 4cb2a9fd1d497b02bfdd06f71b499d441ca10aee Mon Sep 17 00:00:00 2001
From: John Heffner <johnwheffner@gmail.com>
Date: Tue, 15 Apr 2008 15:26:39 -0700
Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.

This change is necessary to allow cwnd to grow during persistent
reordering.  Cwnd moderation is applied when in the disorder state
and an ack that fills the hole comes in.  If the hole was greater
than 3 packets, but less than tp->reordering, cwnd will shrink when
it should not have.

Signed-off-by: John Heffner <jheffner@napa.(none)>
---
 include/net/tcp.h |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2c14edf..633147c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,11 +787,14 @@ extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
 
 /* Slow start with delack produces 3 packets of burst, so that
- * it is safe "de facto".
+ * it is safe "de facto".  This will be the default - same as
+ * the default reordering threshold - but if reordering increases,
+ * we must be able to allow cwnd to burst at least this much in order
+ * to not pull it back when holes are filled.
  */
 static __inline__ __u32 tcp_max_burst(const struct tcp_sock *tp)
 {
-	return 3;
+	return tp->reordering;
 }
 
 /* Returns end sequence number of the receiver's advertised window */
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-15 22:40                                         ` John Heffner
@ 2008-04-16  8:27                                           ` David Miller
  2008-04-16  9:21                                             ` Ilpo Järvinen
  2008-04-16 14:46                                           ` Wenji Wu
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2008-04-16  8:27 UTC (permalink / raw)
  To: johnwheffner; +Cc: wenji, ilpo.jarvinen, netdev

From: "John Heffner" <johnwheffner@gmail.com>
Date: Tue, 15 Apr 2008 15:40:05 -0700

> Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
> 
> This change is necessary to allow cwnd to grow during persistent
> reordering.  Cwnd moderation is applied when in the disorder state
> and an ack that fills the hole comes in.  If the hole was greater
> than 3 packets, but less than tp->reordering, cwnd will shrink when
> it should not have.
> 
> Signed-off-by: John Heffner <jheffner@napa.(none)>

I think this patch is correct, or at least more correct than what
this code is doing right now.

Any objections to my adding this to net-2.6.26?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-16  8:27                                           ` David Miller
@ 2008-04-16  9:21                                             ` Ilpo Järvinen
  2008-04-16  9:35                                               ` David Miller
  2008-04-16 14:40                                               ` A Linux TCP SACK Question John Heffner
  0 siblings, 2 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-16  9:21 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, wenji, Netdev

On Wed, 16 Apr 2008, David Miller wrote:

> From: "John Heffner" <johnwheffner@gmail.com>
> Date: Tue, 15 Apr 2008 15:40:05 -0700
> 
> > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
> > 
> > This change is necessary to allow cwnd to grow during persistent
> > reordering.  Cwnd moderation is applied when in the disorder state
> > and an ack that fills the hole comes in.  If the hole was greater
> > than 3 packets, but less than tp->reordering, cwnd will shrink when
> > it should not have.
> > 
> > Signed-off-by: John Heffner <jheffner@napa.(none)>
> 
> I think this patch is correct, or at least more correct than what
> this code is doing right now.
> 
> Any objections to my adding this to net-2.6.26?

I don't have objections.

But I want to note that tp->reordering does not consider the situation on 
that specific ACK because its value might originate a number of segments 
and even RTTs back. I think it could be possible to find a more 
appropriate value for max_burst locally to an ACK. ...Though it might be a 
bit over-engineered solution. For SACK we calculate similar metric anyway 
in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at 
cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
could perhaps be used (I'm not sure if these would be foolproof in 
recovery though).

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-16  9:21                                             ` Ilpo Järvinen
@ 2008-04-16  9:35                                               ` David Miller
  2008-04-16 14:50                                                 ` Wenji Wu
  2008-08-27 14:38                                                 ` about Linux adaptivly adjusting ssthresh Wenji Wu
  2008-04-16 14:40                                               ` A Linux TCP SACK Question John Heffner
  1 sibling, 2 replies; 56+ messages in thread
From: David Miller @ 2008-04-16  9:35 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: johnwheffner, wenji, netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 16 Apr 2008 12:21:38 +0300 (EEST)

> On Wed, 16 Apr 2008, David Miller wrote:
> 
> > From: "John Heffner" <johnwheffner@gmail.com>
> > Date: Tue, 15 Apr 2008 15:40:05 -0700
> > 
> > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
 ...
> > Any objections to my adding this to net-2.6.26?
> 
> I don't have objections.
> 
> But I want to note that tp->reordering does not consider the situation on 
> that specific ACK because its value might originate a number of segments 
> and even RTTs back. I think it could be possible to find a more 
> appropriate value for max_burst locally to an ACK. ...Though it might be a 
> bit over-engineered solution. For SACK we calculate similar metric anyway 
> in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at 
> cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
> could perhaps be used (I'm not sure if these would be foolproof in 
> recovery though).

Right, we can tweak this thing further later.

*beep* *beep*

I've added John's patch to net-2.6.26


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-16  9:35                                               ` David Miller
@ 2008-04-16 14:50                                                 ` Wenji Wu
  2008-04-18  6:52                                                   ` David Miller
  2008-08-27 14:38                                                 ` about Linux adaptivly adjusting ssthresh Wenji Wu
  1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-16 14:50 UTC (permalink / raw)
  To: 'David Miller', ilpo.jarvinen; +Cc: johnwheffner, netdev


>Right, we can tweak this thing further later.

>*beep* *beep*

>I've added John's patch to net-2.6.26

I just tried with John's patch. It works, saturating the 1Gbps in my test. 

Without the patch, the throughput is around 180Mbps with SACK On, 250Mbps
with SACK off.

The same test environment described in my previous emails.

wenji



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-16 14:50                                                 ` Wenji Wu
@ 2008-04-18  6:52                                                   ` David Miller
  0 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2008-04-18  6:52 UTC (permalink / raw)
  To: wenji; +Cc: ilpo.jarvinen, johnwheffner, netdev

From: Wenji Wu <wenji@fnal.gov>
Date: Wed, 16 Apr 2008 09:50:19 -0500

> >I've added John's patch to net-2.6.26
> 
> I just tried with John's patch. It works, saturating the 1Gbps in my test. 
> 
> Without the patch, the throughput is around 180Mbps with SACK On, 250Mbps
> with SACK off.
> 
> The same test environment described in my previous emails.

After this patch cooks for a couple more days I'll submit it
to -stable.

Thanks for your report and all of your testing Wenji.

Thanks John for the patch.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* about Linux adaptivly adjusting ssthresh
  2008-04-16  9:35                                               ` David Miller
  2008-04-16 14:50                                                 ` Wenji Wu
@ 2008-08-27 14:38                                                 ` Wenji Wu
  2008-08-27 22:48                                                   ` John Heffner
  1 sibling, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-08-27 14:38 UTC (permalink / raw)
  To: 'David Miller', ilpo.jarvinen; +Cc: johnwheffner, netdev

Hi, all, 

Could anybody help me out with Linux adaptively adjusting ssthresh? Thanks
in advance.

I understand that the latest Linux is able to adaptively adjust ssthresh to
avoid retransmission. Could anybody tell me which algorithms have been
implemented for the adaptive ssthresh adjust? 

Thanks,

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: about Linux adaptivly adjusting ssthresh
  2008-08-27 14:38                                                 ` about Linux adaptivly adjusting ssthresh Wenji Wu
@ 2008-08-27 22:48                                                   ` John Heffner
  2008-08-28  0:53                                                     ` Wenji Wu
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-08-27 22:48 UTC (permalink / raw)
  To: wenji; +Cc: David Miller, ilpo.jarvinen, netdev

On Wed, Aug 27, 2008 at 7:38 AM, Wenji Wu <wenji@fnal.gov> wrote:
>
> Hi, all,
>
> Could anybody help me out with Linux adaptively adjusting ssthresh? Thanks
> in advance.
>
> I understand that the latest Linux is able to adaptively adjust ssthresh to
> avoid retransmission. Could anybody tell me which algorithms have been
> implemented for the adaptive ssthresh adjust?


A little more detail would be helpful.  Are you referring to caching
ssthresh between connections, or something going on during a
connection?  Various congestion control modules use ssthresh
differently, so a comprehensive answer would be difficult.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: about Linux adaptivly adjusting ssthresh
  2008-08-27 22:48                                                   ` John Heffner
@ 2008-08-28  0:53                                                     ` Wenji Wu
  2008-08-28  6:34                                                       ` Ilpo Järvinen
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-08-28  0:53 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, ilpo.jarvinen, netdev


> A little more detail would be helpful.  Are you referring to caching
> ssthresh between connections, or something going on during a
> connection?  Various congestion control modules use ssthresh
> differently, so a comprehensive answer would be difficult.


Thanks John, I am referring to the adaptive ssthresh adjusting during a connection.

thanks,

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: about Linux adaptivly adjusting ssthresh
  2008-08-28  0:53                                                     ` Wenji Wu
@ 2008-08-28  6:34                                                       ` Ilpo Järvinen
  2008-08-28 14:20                                                         ` about Linux adaptivly adjusting dupthresh Wenji Wu
  0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-08-28  6:34 UTC (permalink / raw)
  To: Wenji Wu; +Cc: John Heffner, David Miller, Netdev

On Wed, 27 Aug 2008, Wenji Wu wrote:

> 
> > A little more detail would be helpful.  Are you referring to caching
> > ssthresh between connections, or something going on during a
> > connection?  Various congestion control modules use ssthresh
> > differently, so a comprehensive answer would be difficult.
> 
> 
> Thanks John, I am referring to the adaptive ssthresh adjusting during a 
> connection.

???

Every now and then (once we detect some losses) snd_ssthresh is set to a 
halved flightsize as given by, well, you know those standards that say 
something about it :-). So I (like John) seem to somewhat miss the point 
of your question here.

Or did you perhaps refer to rcv_ssthresh (which I wouldn't ever call 
to with a plain "ssthresh")?

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* about Linux adaptivly adjusting dupthresh
  2008-08-28  6:34                                                       ` Ilpo Järvinen
@ 2008-08-28 14:20                                                         ` Wenji Wu
  2008-08-28 18:53                                                           ` Ilpo Järvinen
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-08-28 14:20 UTC (permalink / raw)
  To: 'Ilpo Järvinen'
  Cc: 'John Heffner', 'David Miller', 'Netdev'

Sorry, I made a mistake in the last post, what I mean is "algorithms
adaptively adjust TCP reordering threshold dupthresh". 

I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust
dupthresh to deal with packet reordering. Are there any other
reordering-tolerant algorithms implemented in Linux? 

Thanks,

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: about Linux adaptivly adjusting dupthresh
  2008-08-28 14:20                                                         ` about Linux adaptivly adjusting dupthresh Wenji Wu
@ 2008-08-28 18:53                                                           ` Ilpo Järvinen
  2008-08-28 19:30                                                             ` Wenji Wu
  0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-08-28 18:53 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'John Heffner', 'David Miller', 'Netdev'

On Thu, 28 Aug 2008, Wenji Wu wrote:

> Sorry, I made a mistake in the last post, what I mean is "algorithms
> adaptively adjust TCP reordering threshold dupthresh". 

Ah, that makes much more sense. :-)

> I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust
> dupthresh to deal with packet reordering. Are there any other
> reordering-tolerant algorithms implemented in Linux? 

First about adaptive dupthresh:

In addition to DSACK, we use never-retransmitted block's cumulative ACKs 
to increase the dupthresh (see tcp_clean_rtx_queue). Then there's some 
newreno thing when dupacks > packets_out but I've never really figured it 
fully out if that's doing the correct thing when doing + tp->packets_out 
besides the most simple case (see tcp_check_reno_reordering).

I don't think that eifel adjusts dupthresh though it can remove ambiguity 
problem and thus we can use the never-retransmitted block acked detection 
more often.

Also, there's some added logic for small-windowed case to reduce dupthresh 
temporarily (at the smallest to 3 or whatever the default is) if window is 
not large enough to generate the incremented (see tcp_time_to_recover).

Again, I'm not too sure what you mean by "reordering tolerant", but here 
are some things that may be related:

FACK -> RFC3517 auto-fallback if reordering is detected (basically holes 
are only counted with FACK in the more-than-dupthresh check).

I guess Eifel like timestamp checking belongs to this category (in 
tcp_try_undo_partial).

If latency spike + reordering occurs, SACK FRTO might help but I think
it depends on scenario.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: about Linux adaptivly adjusting dupthresh
  2008-08-28 18:53                                                           ` Ilpo Järvinen
@ 2008-08-28 19:30                                                             ` Wenji Wu
  0 siblings, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-08-28 19:30 UTC (permalink / raw)
  To: 'Ilpo Järvinen'
  Cc: 'John Heffner', 'David Miller', 'Netdev'

Thanks,

-----Original Message-----
From: Ilpo Järvinen [mailto:ilpo.jarvinen@helsinki.fi] 
Sent: Thursday, August 28, 2008 1:53 PM
To: Wenji Wu
Cc: 'John Heffner'; 'David Miller'; 'Netdev'
Subject: Re: about Linux adaptivly adjusting dupthresh

On Thu, 28 Aug 2008, Wenji Wu wrote:

> Sorry, I made a mistake in the last post, what I mean is "algorithms
> adaptively adjust TCP reordering threshold dupthresh". 

Ah, that makes much more sense. :-)

> I understand that "Eifel algorithm" or "DSACK TCP" will adaptively adjust
> dupthresh to deal with packet reordering. Are there any other
> reordering-tolerant algorithms implemented in Linux? 

First about adaptive dupthresh:

In addition to DSACK, we use never-retransmitted block's cumulative ACKs 
to increase the dupthresh (see tcp_clean_rtx_queue). Then there's some 
newreno thing when dupacks > packets_out but I've never really figured it 
fully out if that's doing the correct thing when doing + tp->packets_out 
besides the most simple case (see tcp_check_reno_reordering).

I don't think that eifel adjusts dupthresh though it can remove ambiguity 
problem and thus we can use the never-retransmitted block acked detection 
more often.

Also, there's some added logic for small-windowed case to reduce dupthresh 
temporarily (at the smallest to 3 or whatever the default is) if window is 
not large enough to generate the incremented (see tcp_time_to_recover).

Again, I'm not too sure what you mean by "reordering tolerant", but here 
are some things that may be related:

FACK -> RFC3517 auto-fallback if reordering is detected (basically holes 
are only counted with FACK in the more-than-dupthresh check).

I guess Eifel like timestamp checking belongs to this category (in 
tcp_try_undo_partial).

If latency spike + reordering occurs, SACK FRTO might help but I think
it depends on scenario.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-16  9:21                                             ` Ilpo Järvinen
  2008-04-16  9:35                                               ` David Miller
@ 2008-04-16 14:40                                               ` John Heffner
  2008-04-16 15:03                                                 ` Ilpo Järvinen
  1 sibling, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-16 14:40 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: David Miller, wenji, Netdev

On Wed, Apr 16, 2008 at 2:21 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> On Wed, 16 Apr 2008, David Miller wrote:
>
>  > From: "John Heffner" <johnwheffner@gmail.com>
>  > Date: Tue, 15 Apr 2008 15:40:05 -0700
>  >
>  > > Subject: [PATCH] Increase the max_burst threshold from 3 to tp->reordering.
>  > >
>  > > This change is necessary to allow cwnd to grow during persistent
>  > > reordering.  Cwnd moderation is applied when in the disorder state
>  > > and an ack that fills the hole comes in.  If the hole was greater
>  > > than 3 packets, but less than tp->reordering, cwnd will shrink when
>  > > it should not have.
>  > >
>  > > Signed-off-by: John Heffner <jheffner@napa.(none)>
>  >
>  > I think this patch is correct, or at least more correct than what
>  > this code is doing right now.
>  >
>  > Any objections to my adding this to net-2.6.26?
>
>  I don't have objections.
>
>  But I want to note that tp->reordering does not consider the situation on
>  that specific ACK because its value might originate a number of segments
>  and even RTTs back. I think it could be possible to find a more
>  appropriate value for max_burst locally to an ACK. ...Though it might be a
>  bit over-engineered solution. For SACK we calculate similar metric anyway
>  in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at
>  cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
>  could perhaps be used (I'm not sure if these would be foolproof in
>  recovery though).

Reordering is generally a random process resulting from a packet
traversing parallel queues.  (In the case of netem, the random process
is explicitly defined by simulation.)  As reordering is created by
packets sitting in queues, these queues *should* be able to absorb a
burst of at least the reordering size.  That's at least my
justification for using the reordering threshold as max_burst, along
with the fact that it should prevent cwnd from getting clamped.

Anyway, max_burst isn't a standard.  TCP makes no guarantees that it
won't burst a full window.  If anything, I actually think that in most
cases we'd be better off without it.  It's harmful to high-bdp flows
because it pulls down cwnd, which has a long-term effect in response
to a short-term event.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-16 14:40                                               ` A Linux TCP SACK Question John Heffner
@ 2008-04-16 15:03                                                 ` Ilpo Järvinen
  0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-16 15:03 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, wenji, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1634 bytes --]

On Wed, 16 Apr 2008, John Heffner wrote:

> On Wed, Apr 16, 2008 at 2:21 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> >  But I want to note that tp->reordering does not consider the situation on
> >  that specific ACK because its value might originate a number of segments
> >  and even RTTs back. I think it could be possible to find a more
> >  appropriate value for max_burst locally to an ACK. ...Though it might be a
> >  bit over-engineered solution. For SACK we calculate similar metric anyway
> >  in tcp_clean_rtx_queue to find if tp->reordering needs to be updated at
> >  cumulative ACK and for NewReno min(tp->sacked_out, tp->reordering) + 3
> >  could perhaps be used (I'm not sure if these would be foolproof in
> >  recovery though).
> 
> Reordering is generally a random process resulting from a packet
> traversing parallel queues.  (In the case of netem, the random process
> is explicitly defined by simulation.)  As reordering is created by
> packets sitting in queues, these queues *should* be able to absorb a
> burst of at least the reordering size.  That's at least my
> justification for using the reordering threshold as max_burst, along
> with the fact that it should prevent cwnd from getting clamped.

Sure, but combined with other phenomena such as ACK compression (and 
appropriate ACK pattern & pre TCP state), one might end up generating much 
larger bursts than just tp->reordering. Though it's probably not any worse 
than ACK compression already can cause e.g. after spurious RTO. And one is 
quite guaranteed to run out of something else too before things get too 
nasty.


-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: RE: A Linux TCP SACK Question
  2008-04-15 22:40                                         ` John Heffner
  2008-04-16  8:27                                           ` David Miller
@ 2008-04-16 14:46                                           ` Wenji Wu
  1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-16 14:46 UTC (permalink / raw)
  To: 'John Heffner'; +Cc: 'Ilpo Järvinen', 'Netdev'

>Wenji, can you try this out?  Patch against net-2.6.26.

I just try with the new patch. It works, saturating the 1Gbps link.

The experiment works as:

      Sender --- Router --- Receiver

Iperf is sending from the sender to the receiver. In between there is an
emulated router which runs netem. The emulated router has two interfaces,
both with netem configured.  One interface emulates the forward path and the
other for the reverse path. Both netem interfaces are configured with 1.5ms
delay and 0.15ms variance. No packet drops. Kernel 2.6.25-rc9 patched with
the file you provided

Thanks,

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15  0:48                                   ` John Heffner
  2008-04-15  8:25                                     ` Ilpo Järvinen
@ 2008-04-15 15:45                                     ` Wenji Wu
  2008-04-15 16:39                                     ` Wenji Wu
  2 siblings, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 15:45 UTC (permalink / raw)
  To: John Heffner; +Cc: Ilpo Järvinen, Netdev

> Sill the mystery remains as to why *both* are going so slowly.  You
> mentioned you're using a web100 kernel.  What are the final values of
> all the variables for the connections (grab with readall)?

Kernel 2.6.24, 

"echo 1 > /proc/sys/net/ipv4/tcp_no_metrics_save"

With SACK off:

Throughtpu 256Mbps

Connection 6 (198.2.1.2 38054  131.225.2.16 5001)
State                1
SACKEnabled          0
TimestampsEnabled    1
NagleEnabled         1
ECNEnabled           0
SndWinScale          11
RcvWinScale          7
ActiveOpen           1
MSSRcvd              0
WinScaleRcvd         11
WinScaleSent         7
PktsOut              221715
DataPktsOut          221715
DataBytesOut         324429992
PktsIn               215245
DataPktsIn           0
DataBytesIn          0
SndUna               2784091744
SndNxt               2784091744
SndMax               2784091744
ThruBytesAcked       321011738
SndISS               2463080006
RcvNxt               1309516114
ThruBytesReceived    0
RecvISS              1309516114
StartTimeSec         1208273537
StartTimeUsec        293029
Duration             14594853
SndLimTransSender    6
SndLimBytesSender    23960
SndLimTimeSender     4137
SndLimTransCwnd      5
SndLimBytesCwnd      324406032
SndLimTimeCwnd       10046308
SndLimTransRwin      0
SndLimBytesRwin      0
SndLimTimeRwin       0
SlowStart            0
CongAvoid            0
CongestionSignals    4
OtherReductions      13167
X_OtherReductionsCV  0
X_OtherReductionsCM  13167
CongestionOverCount  54
CurCwnd              4344
MaxCwnd              173760
CurSsthresh          94894680
LimCwnd              4294965848
MaxSsthresh          94894680
MinSsthresh          4344
FastRetran           4
Timeouts             0
SubsequentTimeouts   0
CurTimeoutCount      0
AbruptTimeouts       0
PktsRetrans          17
BytesRetrans         24616
DupAcksIn            59556
SACKsRcvd            0
SACKBlocksRcvd       0
PreCongSumCwnd       375032
PreCongSumRTT        12
PostCongSumRTT       15
PostCongCountRTT     4
ECERcvd              0
SendStall            0
QuenchRcvd           0
RetranThresh         29
NonRecovDA           0
AckAfterFR           0
DSACKDups            0
SampleRTT            3
SmoothedRTT          3
RTTVar               50
MaxRTT               46
MinRTT               2
SumRTT               158191
CountRTT             47830
CurRTO               203
MaxRTO               237
MinRTO               203
CurMSS               1448
MaxMSS               1448
MinMSS               524
X_Sndbuf             1919232
X_Rcvbuf             87380
CurRetxQueue         0
MaxRetxQueue         0
CurAppWQueue         1786832
MaxAppWQueue         1886744
CurRwinSent          5888
MaxRwinSent          5888
MinRwinSent          5840
LimRwin              0
DupAcksOut           0
CurReasmQueue        0
MaxReasmQueue        0
CurAppRQueue         0
MaxAppRQueue         0
X_rcv_ssthresh       5840
X_wnd_clamp          64087
X_dbg1               5888
X_dbg2               536
X_dbg3               5840
X_dbg4               0
CurRwinRcvd          3137536
MaxRwinRcvd          3137536
MinRwinRcvd          17896
LocalAddressType     1
LocalAddress         198.2.1.2
LocalPort            38054
RemAddress           131.225.2.16
RemPort              5001
X_RcvRTT             0
...............................................................

With SACK On

Throughput: 178Mbps

Connection 3 (131.225.2.22 22  131.225.82.152 52973)
State                5
SACKEnabled          3
TimestampsEnabled    1
NagleEnabled         0
ECNEnabled           0
SndWinScale          11
RcvWinScale          7
ActiveOpen           0
MSSRcvd              0
WinScaleRcvd         11
WinScaleSent         7
PktsOut              230
DataPktsOut          230
DataBytesOut         25783
PktsIn               353
DataPktsIn           164
DataBytesIn          11120
SndUna               2809669838
SndNxt               2809669838
SndMax               2809669838
ThruBytesAcked       18423
SndISS               2809651415
RcvNxt               2817947310
ThruBytesReceived    11120
RecvISS              2817936190
StartTimeSec         1208271915
StartTimeUsec        71844
Duration             2362591841
SndLimTransSender    6
SndLimBytesSender    25783
SndLimTimeSender     2273927770
SndLimTransCwnd      5
SndLimBytesCwnd      0
SndLimTimeCwnd       1047
SndLimTransRwin      0
SndLimBytesRwin      0
SndLimTimeRwin       0
SlowStart            0
CongAvoid            0
CongestionSignals    0
OtherReductions      0
X_OtherReductionsCV  0
X_OtherReductionsCM  0
CongestionOverCount  0
CurCwnd              5792
MaxCwnd              13032
CurSsthresh          4294966376
LimCwnd              4294965848
MaxSsthresh          0
MinSsthresh          4294967295
FastRetran           0
Timeouts             0
SubsequentTimeouts   0
CurTimeoutCount      0
AbruptTimeouts       0
PktsRetrans          0
BytesRetrans         0
DupAcksIn            0
SACKsRcvd            0
SACKBlocksRcvd       0
PreCongSumCwnd       0
PreCongSumRTT        0
PostCongSumRTT       0
PostCongCountRTT     0
ECERcvd              0
SendStall            0
QuenchRcvd           0
RetranThresh         3
NonRecovDA           0
AckAfterFR           0
DSACKDups            0
SampleRTT            0
SmoothedRTT          3
RTTVar               50
MaxRTT               40
MinRTT               0
SumRTT               1269
CountRTT             221
CurRTO               203
MaxRTO               234
MinRTO               201
CurMSS               1448
MaxMSS               1448
MinMSS               1428
X_Sndbuf             16384
X_Rcvbuf             87380
CurRetxQueue         0
MaxRetxQueue         0
CurAppWQueue         0
MaxAppWQueue         0
CurRwinSent          14208
MaxRwinSent          14208
MinRwinSent          5792
LimRwin              8365440
DupAcksOut           0
CurReasmQueue        0
MaxReasmQueue        0
CurAppRQueue         0
MaxAppRQueue         1152
X_rcv_ssthresh       14144
X_wnd_clamp          64087
X_dbg1               14208
X_dbg2               1152
X_dbg3               14144
X_dbg4               0
CurRwinRcvd          3749888
MaxRwinRcvd          3749888
MinRwinRcvd          3747840
LocalAddressType     1
LocalAddress         131.225.2.22
LocalPort            22
RemAddress           131.225.82.152
RemPort              52973
X_RcvRTT             405000
..................................................................

wenji





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15  0:48                                   ` John Heffner
  2008-04-15  8:25                                     ` Ilpo Järvinen
  2008-04-15 15:45                                     ` Wenji Wu
@ 2008-04-15 16:39                                     ` Wenji Wu
  2008-04-15 17:01                                       ` John Heffner
  2 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 16:39 UTC (permalink / raw)
  To: John Heffner; +Cc: Ilpo Järvinen, Netdev

My fault, resent.
 
> Sill the mystery remains as to why *both* are going so slowly.  You
> mentioned you're using a web100 kernel.  What are the final values of
> all the variables for the connections (grab with readall)?

kernel 2.6.24

"echo 1 > /proc/sys/net/ipv4/tcp_no_metric_save"

..............................................................................................
WIth SACK On, throughput: 179Mbps

Connection 4 (198.2.1.2 56648  131.225.2.16 5001)
State                1
SACKEnabled          3
TimestampsEnabled    1
NagleEnabled         1
ECNEnabled           0
SndWinScale          11
RcvWinScale          7
ActiveOpen           1
MSSRcvd              0
WinScaleRcvd         11
WinScaleSent         7
PktsOut              154770
DataPktsOut          154770
DataBytesOut         226294264
PktsIn               149398
DataPktsIn           0
DataBytesIn          0
SndUna               930060039
SndNxt               930060039
SndMax               930060039
ThruBytesAcked       224092186
SndISS               705967853
RcvNxt               4282199280
ThruBytesReceived    0
RecvISS              4282199280
StartTimeSec         1208277286
StartTimeUsec        813964
Duration             13984145
SndLimTransSender    3
SndLimBytesSender    7208
SndLimTimeSender     3107
SndLimTransCwnd      2
SndLimBytesCwnd      226287056
SndLimTimeCwnd       10003734
SndLimTransRwin      0
SndLimBytesRwin      0
SndLimTimeRwin       0
SlowStart            0
CongAvoid            0
CongestionSignals    2
OtherReductions      19402
X_OtherReductionsCV  0
X_OtherReductionsCM  19402
CongestionOverCount  13
CurCwnd              4344
MaxCwnd              102808
CurSsthresh          94894680
LimCwnd              4294965848
MaxSsthresh          94894680
MinSsthresh          7240
FastRetran           2
Timeouts             0
SubsequentTimeouts   0
CurTimeoutCount      0
AbruptTimeouts       0
PktsRetrans          7
BytesRetrans         10136
DupAcksIn            41940
SACKsRcvd            118692
SACKBlocksRcvd       189919
PreCongSumCwnd       91224
PreCongSumRTT        6
PostCongSumRTT       7
PostCongCountRTT     2
ECERcvd              0
SendStall            0
QuenchRcvd           0
RetranThresh         30
NonRecovDA           0
AckAfterFR           0
DSACKDups            0
SampleRTT            3
SmoothedRTT          3
RTTVar               50
MaxRTT               4
MinRTT               2
SumRTT               142655
CountRTT             43932
CurRTO               203
MaxRTO               204
MinRTO               203
CurMSS               1448
MaxMSS               1448
MinMSS               524
X_Sndbuf             206976
X_Rcvbuf             87380
CurRetxQueue         0
MaxRetxQueue         0
CurAppWQueue         130320
MaxAppWQueue         237472
CurRwinSent          5888
MaxRwinSent          5888
MinRwinSent          5840
LimRwin              0
DupAcksOut           0
CurReasmQueue        0
MaxReasmQueue        0
CurAppRQueue         0
MaxAppRQueue         0
X_rcv_ssthresh       5840
X_wnd_clamp          64087
X_dbg1               5888
X_dbg2               536
X_dbg3               5840
X_dbg4               0
CurRwinRcvd          3137536
MaxRwinRcvd          3137536
MinRwinRcvd          17896
LocalAddressType     1
LocalAddress         198.2.1.2
LocalPort            56648
RemAddress           131.225.2.16
RemPort              5001
X_RcvRTT             0
[root@gw004 ipv4]#

..................................................................

WIth SACK Off:

Throughput: 258Mbps

Connection 5 (198.2.1.2 43578  131.225.2.16 5001)
State                1
SACKEnabled          0
TimestampsEnabled    1
NagleEnabled         1
ECNEnabled           0
SndWinScale          11
RcvWinScale          7
ActiveOpen           1
MSSRcvd              0
WinScaleRcvd         11
WinScaleSent         7
PktsOut              223011
DataPktsOut          223011
DataBytesOut         326318584
PktsIn               216404
DataPktsIn           0
DataBytesIn          0
SndUna               4002973902
SndNxt               4002973902
SndMax               4002973902
ThruBytesAcked       322904090
SndISS               3680069812
RcvNxt               2942495629
ThruBytesReceived    0
RecvISS              2942495629
StartTimeSec         1208277475
StartTimeUsec        779859
Duration             18149747
SndLimTransSender    4
SndLimBytesSender    10456
SndLimTimeSender     3787
SndLimTransCwnd      3
SndLimBytesCwnd      326308128
SndLimTimeCwnd       10006059
SndLimTransRwin      0
SndLimBytesRwin      0
SndLimTimeRwin       0
SlowStart            0
CongAvoid            0
CongestionSignals    3
OtherReductions      13166
X_OtherReductionsCV  0
X_OtherReductionsCM  13166
CongestionOverCount  37
CurCwnd              10136
MaxCwnd              173760
CurSsthresh          94894680
LimCwnd              4294965848
MaxSsthresh          94894680
MinSsthresh          46336
FastRetran           3
Timeouts             0
SubsequentTimeouts   0
CurTimeoutCount      0
AbruptTimeouts       0
PktsRetrans          7
BytesRetrans         10136
DupAcksIn            59484
SACKsRcvd            0
SACKBlocksRcvd       0
PreCongSumCwnd       286704
PreCongSumRTT        12
PostCongSumRTT       11
PostCongCountRTT     3
ECERcvd              0
SendStall            0
QuenchRcvd           0
RetranThresh         23
NonRecovDA           0
AckAfterFR           0
DSACKDups            0
SampleRTT            4
SmoothedRTT          4
RTTVar               50
MaxRTT               6
MinRTT               2
SumRTT               159332
CountRTT             48291
CurRTO               204
MaxRTO               204
MinRTO               203
CurMSS               1448
MaxMSS               1448
MinMSS               524
X_Sndbuf             451584
X_Rcvbuf             87380
CurRetxQueue         0
MaxRetxQueue         0
CurAppWQueue         373584
MaxAppWQueue         454672
CurRwinSent          5888
MaxRwinSent          5888
MinRwinSent          5840
LimRwin              0
DupAcksOut           0
CurReasmQueue        0
MaxReasmQueue        0
CurAppRQueue         0
MaxAppRQueue         0
X_rcv_ssthresh       5840
X_wnd_clamp          64087
X_dbg1               5888
X_dbg2               536
X_dbg3               5840
X_dbg4               0
CurRwinRcvd          3137536
MaxRwinRcvd          3137536
MinRwinRcvd          17896
LocalAddressType     1
LocalAddress         198.2.1.2
LocalPort            43578
RemAddress           131.225.2.16
RemPort              5001
X_RcvRTT             0
[root@gw004 ipv4]#



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15 16:39                                     ` Wenji Wu
@ 2008-04-15 17:01                                       ` John Heffner
  2008-04-15 17:08                                         ` Ilpo Järvinen
  2008-04-15 17:55                                         ` Wenji Wu
  0 siblings, 2 replies; 56+ messages in thread
From: John Heffner @ 2008-04-15 17:01 UTC (permalink / raw)
  To: Wenji Wu; +Cc: Ilpo Järvinen, Netdev

On Tue, Apr 15, 2008 at 9:39 AM, Wenji Wu <wenji@fnal.gov> wrote:
>  SlowStart            0
>  CongAvoid            0
>  CongestionSignals    3
>  OtherReductions      13166
>  X_OtherReductionsCV  0
>  X_OtherReductionsCM  13166
>  CongestionOverCount  37
>  CurCwnd              10136
>
> MaxCwnd              173760
>  CurSsthresh          94894680
>  LimCwnd              4294965848
>  MaxSsthresh          94894680
>  MinSsthresh          46336

We can see that in both cases you are getting throttled by
tcp_moderate_cwnd (X_OtherReductionsCM).  I'm not sure offhand why
it's reaching this code - I would have thought that the high
tp->reordering would prevent this.  Ilpo, do you have any insights?

It's not all that surprising that packets_in_flight is a higher value
with newreno than sack, which would explain the higher window with
newreno.

Wenji, the web100 kernel has a sysctl - WAD_MaxBurst.  I suspect it
may make a significant difference if you set this to a large value.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15 17:01                                       ` John Heffner
@ 2008-04-15 17:08                                         ` Ilpo Järvinen
  2008-04-15 17:23                                           ` John Heffner
  2008-04-15 17:55                                         ` Wenji Wu
  1 sibling, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-15 17:08 UTC (permalink / raw)
  To: John Heffner; +Cc: Wenji Wu, Netdev

On Tue, 15 Apr 2008, John Heffner wrote:

> On Tue, Apr 15, 2008 at 9:39 AM, Wenji Wu <wenji@fnal.gov> wrote:
> >  SlowStart            0
> >  CongAvoid            0
> >  CongestionSignals    3
> >  OtherReductions      13166
> >  X_OtherReductionsCV  0
> >  X_OtherReductionsCM  13166
> >  CongestionOverCount  37
> >  CurCwnd              10136
> >
> > MaxCwnd              173760
> >  CurSsthresh          94894680
> >  LimCwnd              4294965848
> >  MaxSsthresh          94894680
> >  MinSsthresh          46336
> 
> 
> We can see that in both cases you are getting throttled by
> tcp_moderate_cwnd (X_OtherReductionsCM).  I'm not sure offhand why
> it's reaching this code - I would have thought that the high
> tp->reordering would prevent this.  Ilpo, do you have any insights?

What makes you think so? It's called from tcp_try_to_open as anyone can 
read from the source, basically when our state is CA_Disorder (some very 
small portion might happen in ca_recovery besides that).


-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15 17:08                                         ` Ilpo Järvinen
@ 2008-04-15 17:23                                           ` John Heffner
  2008-04-15 18:00                                             ` Matt Mathis
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-15 17:23 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Wenji Wu, Netdev

On Tue, Apr 15, 2008 at 10:08 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Tue, 15 Apr 2008, John Heffner wrote:
>  > We can see that in both cases you are getting throttled by
>  > tcp_moderate_cwnd (X_OtherReductionsCM).  I'm not sure offhand why
>  > it's reaching this code - I would have thought that the high
>  > tp->reordering would prevent this.  Ilpo, do you have any insights?
>
>  What makes you think so? It's called from tcp_try_to_open as anyone can
>  read from the source, basically when our state is CA_Disorder (some very
>  small portion might happen in ca_recovery besides that).

This is what X_OtherReductionsCM instruments, and that was the only
thing holding back cwnd.

I just looked at the source, and indeed it will be called on every ack
when we are in the disorder state.  Limiting cwnd to
packets_in_flight() + 3 here is going to prevent cwnd from growing
when the reordering is greater than 3.  Making max_burst at least
tp->reordering should help some, though I'm not sure it's the right
thing to do.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-15 17:23                                           ` John Heffner
@ 2008-04-15 18:00                                             ` Matt Mathis
  0 siblings, 0 replies; 56+ messages in thread
From: Matt Mathis @ 2008-04-15 18:00 UTC (permalink / raw)
  To: =?X-UNKNOWN?Q?Ilpo_J=E4rvinen?=; +Cc: John Heffner, Wenji Wu, Netdev

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN; FORMAT=flowed, Size: 2018 bytes --]

In some future kernel release, I would consider changing it to limit cwnd to 
be less than packets_in_flight() + reorder + 3(?).  If the network is 
reordering packets, then it has to accept bursts, otherwise TCP can never open 
the window.  The +3 (or some other constant) is still needed because TCP has 
to send extra packets at the point where the window changes.

As an alternative, you could write a research paper on how the network could 
do LIFO packet scheduling so the reordering serves as a congestion signal to 
the stacks.  I bet it would have some really interesting properties.  Oh wait, 
April 1st was 2 weeks ago.

Thanks,
--MM--

On Tue, 15 Apr 2008, John Heffner wrote:

> On Tue, Apr 15, 2008 at 10:08 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
>> On Tue, 15 Apr 2008, John Heffner wrote:
>> > We can see that in both cases you are getting throttled by
>> > tcp_moderate_cwnd (X_OtherReductionsCM).  I'm not sure offhand why
>> > it's reaching this code - I would have thought that the high
>> > tp->reordering would prevent this.  Ilpo, do you have any insights?
>>
>>  What makes you think so? It's called from tcp_try_to_open as anyone can
>>  read from the source, basically when our state is CA_Disorder (some very
>>  small portion might happen in ca_recovery besides that).
>
> This is what X_OtherReductionsCM instruments, and that was the only
> thing holding back cwnd.
>
> I just looked at the source, and indeed it will be called on every ack
> when we are in the disorder state.  Limiting cwnd to
> packets_in_flight() + 3 here is going to prevent cwnd from growing
> when the reordering is greater than 3.  Making max_burst at least
> tp->reordering should help some, though I'm not sure it's the right
> thing to do.
>
>  -John
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: RE: A Linux TCP SACK Question
  2008-04-15 17:01                                       ` John Heffner
  2008-04-15 17:08                                         ` Ilpo Järvinen
@ 2008-04-15 17:55                                         ` Wenji Wu
  1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-15 17:55 UTC (permalink / raw)
  To: 'John Heffner'; +Cc: 'Ilpo Järvinen', 'Netdev'



>We can see that in both cases you are getting throttled by
>tcp_moderate_cwnd (X_OtherReductionsCM).  I'm not sure offhand why
>it's reaching this code - I would have thought that the high
>tp->reordering would prevent this.  Ilpo, do you have any insights?

>It's not all that surprising that packets_in_flight is a higher value
>with newreno than sack, which would explain the higher window with
>newreno.

>Wenji, the web100 kernel has a sysctl - WAD_MaxBurst.  I suspect it
>may make a significant difference if you set this to a large value.

It is surprising! When I increase WAD_MaxBurst (Patched with Web100) from 3
to 20, the throughput in both cases (SACK ON/OFF) will saturate the 1Gbps
Link!!!



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-08 12:33                         ` Wenji Wu
  2008-04-08 13:45                           ` Ilpo Järvinen
@ 2008-04-08 15:57                           ` John Heffner
  1 sibling, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-08 15:57 UTC (permalink / raw)
  To: Wenji Wu; +Cc: Ilpo Järvinen, Sangtae Ha, Netdev

On Tue, Apr 8, 2008 at 5:33 AM, Wenji Wu <wenji@fnal.gov> wrote:
> > NewReno never retransmitted anything in them (except at the very end
>  > of
>  > the transfer). Probably something related to how tp->reordering behaves
>  > I suppose...
>
>  Yes, the adaptive tp->reordering will play a role here.

I remember several years ago when I first looked at chronic reordering
with a high BDP, the problem I had was that:
1) Only acks of new data can advance cwnd, and these only advance by
the normal amount per ack, so cwnd grows very slowly.
2) Reordering caused slow start to exit early, before the reordering
threshold had adapted
3) The "undo" code didn't work well because of cwnd moderation
4) There were bugs in the reordering calculation that caused the
threshold to be pulled back

Some of these shouldn't matter to you because your rtt is low, but I
thought i would be worth mentioning.  I'm not sure what is keeping
your cwnd from growing -- it always seems to be within a small range
in both cases, which is not right unless there's a bottleneck at the
sender.  The fact reno does a little better than sack seems like the
less important problem.

Also, what's the behavior when turning off reordering, in each or both
directions?

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: A Linux TCP SACK Question
  2008-04-08  6:36                       ` Ilpo Järvinen
  2008-04-08 12:33                         ` Wenji Wu
@ 2008-04-08 14:07                         ` John Heffner
  2008-04-14 16:10                         ` Wenji Wu
  2 siblings, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-08 14:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Wenji Wu, Sangtae Ha, Netdev

On Mon, Apr 7, 2008 at 11:36 PM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> On Mon, 7 Apr 2008, Wenji Wu wrote:
>
>  > >I don't think reorderings frequently happened in your directly
>  > >connected networking scenario. Please post your tcpdump file for
>  > >clearing out all doubts.
>  >
>  > https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
>  >
>  > Two tcpdump files: one with SACK on, the other with SACK off. The test
>  > configures described in my previous emails.
>
>  NewReno never retransmitted anything in them (except at the very end of
>  the transfer). Probably something related to how tp->reordering behaves
>  I suppose...

Yes, this looks very suspicious.  Can we see this again with TSO off?

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-08  6:36                       ` Ilpo Järvinen
  2008-04-08 12:33                         ` Wenji Wu
  2008-04-08 14:07                         ` John Heffner
@ 2008-04-14 16:10                         ` Wenji Wu
  2008-04-14 16:48                           ` Ilpo Järvinen
  2 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-14 16:10 UTC (permalink / raw)
  To: 'Ilpo Järvinen'
  Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

Hi, Ilop,

The latest results have been posted to:

https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/

The kernel under test is: Linux-2.6.25-rc9. I have checked with its
changelog, which shows that your latest fix is included.

In the tests, I vary the tcp_frto (0, 1, and 2) with SACK On/Off.

The experiment works as:

      Sender --- Router --- Receiver

Iperf is sending from the sender to the receiver. In between there is an
emulated router which runs netem. The emulated router has two interfaces,
both with netem configured.  One interface emulates the forward path and the
other for the reverse path. Both netem interfaces are configured with 1.5ms
delay and 0.15ms variance. No packet drops in tests and packet capturing.

All of these systems are multi-core platforms, with 2G+ CPU. I run
TOP to verify, CPUs are idle most of time.

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: A Linux TCP SACK Question
  2008-04-14 16:10                         ` Wenji Wu
@ 2008-04-14 16:48                           ` Ilpo Järvinen
  2008-04-14 22:07                             ` Wenji Wu
  0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-14 16:48 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'Sangtae Ha', 'John Heffner', 'Netdev'

On Mon, 14 Apr 2008, Wenji Wu wrote:

> The latest results have been posted to:
> 
> https://plone3.fnal.gov/P0/WAN/Members/wenji/tcp_dump_files_sack/
> 
> The kernel under test is: Linux-2.6.25-rc9. I have checked with its
> changelog, which shows that your latest fix is included.

Hmm, now there are even less retransmissions (barely some with
the SACK in the end).

I suppose the reordering detection is good enough to kill them. ...You 
could perhaps figure that out from MIBs if you would want to.

> In the tests, I vary the tcp_frto (0, 1, and 2) with SACK On/Off.

...I should have said it more clearly last time already that these are 
not significant with your workload.

> The experiment works as:
> 
>       Sender --- Router --- Receiver
> 
> Iperf is sending from the sender to the receiver. In between there is an
> emulated router which runs netem. The emulated router has two interfaces,
> both with netem configured.  One interface emulates the forward path and the
> other for the reverse path. Both netem interfaces are configured with 1.5ms
> delay and 0.15ms variance. No packet drops in tests and packet capturing.

...How about this theory:

Forward path reordering causes duplicate ACKs due to old segments. These 
are threated differently for NewReno and SACK:

NewReno => Sends new data out (limited xmit; it's not limited to two 
           segments in linux as per RFC, however, RFC doesn't consider
	   autotuning of DupThresh either)
SACK =>    No new SACK block discovered. Packets in flight remains the 
           same, and thus no new segment is sent.

...What do other think?

I guess it should be visible with fwd path reordering alone, though the 
added distance with reverse path reordering might act as amplifier 
because NewReno benefits from shorter RTTed packets when fwd path 
old segment arrived while SACK losses its ability to increase 
outstanding data...

...A quick look into it with tcptrace's outstanding date plot, it seems 
that NewReno levels ~100000 and SACK ~68000.

...I think SACK just knows too much? :-/

> All of these systems are multi-core platforms, with 2G+ CPU. I run
> TOP to verify, CPUs are idle most of time.

Thanks for adding this for other. I agree with you that this is not
an cpu horsepower issue.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-14 16:48                           ` Ilpo Järvinen
@ 2008-04-14 22:07                             ` Wenji Wu
  2008-04-15  8:23                               ` Ilpo Järvinen
  0 siblings, 1 reply; 56+ messages in thread
From: Wenji Wu @ 2008-04-14 22:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: 'Netdev'

> Hmm, now there are even less retransmissions (barely some with
> the SACK in the end).
> 
> I suppose the reordering detection is good enough to kill them. ...You 
> 
> could perhaps figure that out from MIBs if you would want to.
> 

Yes, the web100 shows that the tcp_reordering could be as large as 127.

I just rerun the following experimetns to show why there are few retransmissions in my previous posts.

(1) Flush the sytem routing cache by running "ip route flush cache" before running and tcpdumping the traffic
(2) Before running and tcpdumping the traffic, run a data transmission test to generate tcp_reordering in the routing cache.
Do not flush the routing cache. Then running and tcpdumping the traffic.

Both experiments with sack off. 

The results is posted to
https://plone3.fnal.gov/P0/WAN/Members/wenji/adaptive_tcp_reordering/

So, the few retransmissions in my previous post are really caused by the routing cache. 

But flushing cahce has nothing to do with SACK on/off. Still the trhoughput with SACK off is better than that of with SACK on.

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-14 22:07                             ` Wenji Wu
@ 2008-04-15  8:23                               ` Ilpo Järvinen
  0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2008-04-15  8:23 UTC (permalink / raw)
  To: Wenji Wu; +Cc: 'Netdev'

On Mon, 14 Apr 2008, Wenji Wu wrote:

> 
> > Hmm, now there are even less retransmissions (barely some with
> > the SACK in the end).
> > 
> > I suppose the reordering detection is good enough to kill them. ...You 
> > 
> > could perhaps figure that out from MIBs if you would want to.
> > 
> 
> Yes, the web100 shows that the tcp_reordering could be as large as 127.

It should get large, though I suspect newreno's new value: tp->packets_out 
+ addend might have tp->packets_out too much.

> I just rerun the following experimetns to show why there are few 
> retransmissions in my previous posts.
> 
> (1) Flush the sytem routing cache by running "ip route flush cache" 
> before running and tcpdumping the traffic

I didn't know that works, tcp_no_metrics_save sysctl seems to prevent 
saving them from an running TCP flow when a flow ends.

> (2) Before running and tcpdumping the traffic, run a data transmission 
> test to generate tcp_reordering in the routing cache. Do not flush the 
> routing cache. Then running and tcpdumping the traffic.
> 
> Both experiments with sack off. 
> 
> The results is posted to
> https://plone3.fnal.gov/P0/WAN/Members/wenji/adaptive_tcp_reordering/
> 
> So, the few retransmissions in my previous post are really caused by the 
> routing cache. 

Yes. Remember however that initial metrics has also have effect on initial 
ssthresh, so one must be very careful to not cause unfairness through them 
if metrics are not cleared.

> But flushing cahce has nothing to do with SACK on/off. Still the 
> trhoughput with SACK off is better than that of with SACK on.

Yes, I think it alone would never explain it. Though difference in 
initial ssthresh might have been the explanation for the different level 
where outstanding data settled with the logs without any retransmissions.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: RE: A Linux TCP SACK Question
  2008-04-04 21:33         ` Ilpo Järvinen
  2008-04-04 21:39           ` Ilpo Järvinen
@ 2008-04-04 21:40           ` Wenji Wu
  1 sibling, 0 replies; 56+ messages in thread
From: Wenji Wu @ 2008-04-04 21:40 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: 'John Heffner', 'Netdev'


> On Fri, 4 Apr 2008, Wenji Wu wrote:
> 
> > 
> > >I'd suggest that you don't waste too much effort for 2.6.24. 
> ...Most of it 
> > >is recoded/updated since then.
> > 
> > I just tried it on 2.6.25-rc8. The result is still the same: the throughput
> > with SACK on is less than with SACK off.
> 
> Hmm, can you also try if playing around with FRTO setting makes some 
> difference (tcp_frto sysctl)?

Still the same, I just tried with FRTO, FACK. No difference, SACK on is worse than SACK off.

wenji

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-08-28 19:30 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-04  4:54 A Linux TCP SACK Question Wenji Wu
2008-04-04 16:27 ` John Heffner
2008-04-04 17:49   ` Wenji Wu
2008-04-04 18:07     ` John Heffner
2008-04-04 20:00     ` Ilpo Järvinen
2008-04-04 20:07       ` Wenji Wu
2008-04-04 21:15       ` Wenji Wu
2008-04-04 21:33         ` Ilpo Järvinen
2008-04-04 21:39           ` Ilpo Järvinen
2008-04-04 22:14             ` Wenji Wu
2008-04-05 17:42               ` Ilpo Järvinen
2008-04-05 21:17               ` Sangtae Ha
2008-04-06 20:27                 ` Wenji Wu
2008-04-06 22:43                   ` Sangtae Ha
2008-04-07 14:56                     ` Wenji Wu
2008-04-08  6:36                       ` Ilpo Järvinen
2008-04-08 12:33                         ` Wenji Wu
2008-04-08 13:45                           ` Ilpo Järvinen
2008-04-08 14:30                             ` Wenji Wu
2008-04-08 14:59                               ` Ilpo Järvinen
2008-04-08 15:27                                 ` Wenji Wu
2008-04-08 17:26                                   ` Ilpo Järvinen
2008-04-14 22:47                                 ` Wenji Wu
2008-04-15  0:48                                   ` John Heffner
2008-04-15  8:25                                     ` Ilpo Järvinen
2008-04-15 18:01                                       ` Wenji Wu
2008-04-15 22:40                                         ` John Heffner
2008-04-16  8:27                                           ` David Miller
2008-04-16  9:21                                             ` Ilpo Järvinen
2008-04-16  9:35                                               ` David Miller
2008-04-16 14:50                                                 ` Wenji Wu
2008-04-18  6:52                                                   ` David Miller
2008-08-27 14:38                                                 ` about Linux adaptivly adjusting ssthresh Wenji Wu
2008-08-27 22:48                                                   ` John Heffner
2008-08-28  0:53                                                     ` Wenji Wu
2008-08-28  6:34                                                       ` Ilpo Järvinen
2008-08-28 14:20                                                         ` about Linux adaptivly adjusting dupthresh Wenji Wu
2008-08-28 18:53                                                           ` Ilpo Järvinen
2008-08-28 19:30                                                             ` Wenji Wu
2008-04-16 14:40                                               ` A Linux TCP SACK Question John Heffner
2008-04-16 15:03                                                 ` Ilpo Järvinen
2008-04-16 14:46                                           ` Wenji Wu
2008-04-15 15:45                                     ` Wenji Wu
2008-04-15 16:39                                     ` Wenji Wu
2008-04-15 17:01                                       ` John Heffner
2008-04-15 17:08                                         ` Ilpo Järvinen
2008-04-15 17:23                                           ` John Heffner
2008-04-15 18:00                                             ` Matt Mathis
2008-04-15 17:55                                         ` Wenji Wu
2008-04-08 15:57                           ` John Heffner
2008-04-08 14:07                         ` John Heffner
2008-04-14 16:10                         ` Wenji Wu
2008-04-14 16:48                           ` Ilpo Järvinen
2008-04-14 22:07                             ` Wenji Wu
2008-04-15  8:23                               ` Ilpo Järvinen
2008-04-04 21:40           ` Wenji Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).