2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
@ 2007-05-07  0:37 Bill Fink
  2007-05-08  9:41 ` SANGTAE HA
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Fink @ 2007-05-07  0:37 UTC (permalink / raw)
  To: Linux Network Developers

The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
extent bic) seems to be way too slow.  With an ~80 ms RTT, this
is what cubic delivers (thirty second test with one second interval
reporting and specifying a socket buffer size of 60 MB):

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

[root@lang2 ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
    6.8188 MB /   1.00 sec =   57.0365 Mbps
   16.2097 MB /   1.00 sec =  135.9824 Mbps
   25.4553 MB /   1.00 sec =  213.5420 Mbps
   35.5127 MB /   1.00 sec =  297.9119 Mbps
   43.0066 MB /   1.00 sec =  360.7770 Mbps
   50.3210 MB /   1.00 sec =  422.1370 Mbps
   59.0796 MB /   1.00 sec =  495.6124 Mbps
   69.1284 MB /   1.00 sec =  579.9098 Mbps
   76.6479 MB /   1.00 sec =  642.9130 Mbps
   90.6189 MB /   1.00 sec =  760.2835 Mbps
  109.4348 MB /   1.00 sec =  918.0361 Mbps
  128.3105 MB /   1.00 sec = 1076.3813 Mbps
  150.4932 MB /   1.00 sec = 1262.4686 Mbps
  175.9229 MB /   1.00 sec = 1475.7965 Mbps
  205.9412 MB /   1.00 sec = 1727.6150 Mbps
  240.8130 MB /   1.00 sec = 2020.1504 Mbps
  282.1790 MB /   1.00 sec = 2367.1644 Mbps
  318.1841 MB /   1.00 sec = 2669.1349 Mbps
  372.6814 MB /   1.00 sec = 3126.1687 Mbps
  440.8411 MB /   1.00 sec = 3698.5200 Mbps
  524.8633 MB /   1.00 sec = 4403.0220 Mbps
  614.3542 MB /   1.00 sec = 5153.7367 Mbps
  718.9917 MB /   1.00 sec = 6031.5386 Mbps
  829.0474 MB /   1.00 sec = 6954.6438 Mbps
  867.3289 MB /   1.00 sec = 7275.9510 Mbps
  865.7759 MB /   1.00 sec = 7262.9813 Mbps
  864.4795 MB /   1.00 sec = 7251.7071 Mbps
  864.5425 MB /   1.00 sec = 7252.8519 Mbps
  867.3372 MB /   1.00 sec = 7246.9232 Mbps

10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

It takes 25 seconds for cubic TCP to reach its maximal rate.
Note that there were no TCP retransmissions (no congestion
experienced).

Now with bic (only 20 second test this time):

[root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
bic

[root@lang2 ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
    9.9548 MB /   1.00 sec =   83.1497 Mbps
   47.2021 MB /   1.00 sec =  395.9762 Mbps
   92.4304 MB /   1.00 sec =  775.3889 Mbps
  134.3774 MB /   1.00 sec = 1127.2758 Mbps
  194.3286 MB /   1.00 sec = 1630.1987 Mbps
  280.0598 MB /   1.00 sec = 2349.3613 Mbps
  404.3201 MB /   1.00 sec = 3391.8250 Mbps
  559.1594 MB /   1.00 sec = 4690.6677 Mbps
  792.7100 MB /   1.00 sec = 6650.0257 Mbps
  857.2241 MB /   1.00 sec = 7190.6942 Mbps
  852.6912 MB /   1.00 sec = 7153.3283 Mbps
  852.6968 MB /   1.00 sec = 7153.2538 Mbps
  851.3162 MB /   1.00 sec = 7141.7575 Mbps
  851.4927 MB /   1.00 sec = 7143.0240 Mbps
  850.8782 MB /   1.00 sec = 7137.8762 Mbps
  852.7119 MB /   1.00 sec = 7153.2949 Mbps
  852.3879 MB /   1.00 sec = 7150.2982 Mbps
  850.2163 MB /   1.00 sec = 7132.5165 Mbps
  849.8340 MB /   1.00 sec = 7129.0026 Mbps

11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

bic does better but still takes 10 seconds to achieve its maximal
rate.

Surprisingly venerable reno does the best (only a 10 second test):

[root@lang2 ~]# echo reno > /proc/sys/net/ipv4/tcp_congestion_control
[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
reno

[root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
   69.9829 MB /   1.01 sec =  583.5822 Mbps
  844.3870 MB /   1.00 sec = 7083.2808 Mbps
  862.7568 MB /   1.00 sec = 7237.7342 Mbps
  859.5725 MB /   1.00 sec = 7210.8981 Mbps
  860.1365 MB /   1.00 sec = 7215.4487 Mbps
  865.3940 MB /   1.00 sec = 7259.8434 Mbps
  863.9678 MB /   1.00 sec = 7247.4942 Mbps
  864.7493 MB /   1.00 sec = 7254.4634 Mbps
  864.6660 MB /   1.00 sec = 7253.5183 Mbps

 7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

reno achieves its maximal rate in about 2 seconds.  This is what I
would expect from the exponential increase during TCP's initial
slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
jumbo frame packets would require:

	[root@lang2 ~]# bc -l
	scale=10
	10^10*0.080/9000/8
	11111.1111111111

So 11111 packets would have to be in flight during one RTT.
It should take log2(11111)+1 round trips to achieve 10 Gbps
(note bc's l() function is logE);

	[root@lang2 ~]# bc -l
	scale=10
	l(11111)/l(2)+1
	14.4397010470

And 15 round trips at 80 ms each gives a total time of:

	[root@lang2 ~]# bc -l
	scale=10
	15*0.080
	1.200

So if there is no packet loss (which there wasn't), it should only
take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
this ballpark range.

Now it's quite possible there's something basic I don't understand,
such as some /proc/sys/net/ipv4/tcp_* or /sys/module/tcp_*/parameters/*
parameter I've overlooked, in which case feel free to just refer me
to any suitable documentation.

I also checked the Changelog for 2.6.20.{8,9,10,11} to see if there
might be any relevant recent bug fixes, but the only thing that seemed
even remotely related was the 2.6.20.11 bug fix for the tcp_mem setting.
Although this did affect me, I manually adjusted the tcp_mem settings
before running these tests.

[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_mem                                  393216  524288  786432

The test setup was:

        +-------+                 +-------+                 +-------+
        |       |eth2         eth2|       |eth3         eth2|       |
        | lang2 |-----10-GigE-----| lang1 |-----10-GigE-----| lang3 |
        |       |                 |       |                 |       |
        +-------+                 +-------+                 +-------+
        192.168.88.14    192.168.88.13/192.168.89.13    192.168.89.15

All three systems are dual 2.8 GHz AMD Opteron Processor 254 systems
with 4 GB memory and all running the 2.6.20.7 kernel.  All the NICs
are Myricom PCI-E 10-GigE NICs.

The 80 ms delay was introduced by applying netem to lang1's eth3
interface:

[root@lang1 ~]# tc qdisc add dev eth3 root netem delay 80ms limit 20000
[root@lang1 ~]# tc qdisc show
qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 8022: dev eth3 limit 20000 delay 80.0ms reorder 100%

Experimentation determined that netem running on lang1 could handle
about 8-8.5 Gbps without dropping packets.

8.5 Gbps UDP test:

[root@lang2 ~]# nuttcp -u -Ri8.5g -w20m 192.168.89.15
10136.4844 MB /  10.01 sec = 8497.8205 Mbps 100 %TX 56 %RX 0 / 1297470 drop/pkt 0.00 %loss

Increasing the rate to 9 Gbps would give some loss:

[root@lang2 ~]# nuttcp -u -Ri9g -w20m 192.168.89.15
10219.1719 MB /  10.01 sec = 8560.2455 Mbps 100 %TX 58 %RX 65500 / 1373554 drop/pkt 4.77 %loss

Based on this, the specification of a 60 MB TCP socket buffer size was
used during the TCP tests to avoid overstressing the lang1 netem delay
emulator (to avoid dropping any packets).

Simple ping through the lang1 netem delay emulator:

[root@lang2 ~]# ping -c 5 192.168.89.15
PING 192.168.89.15 (192.168.89.15) 56(84) bytes of data.
64 bytes from 192.168.89.15: icmp_seq=1 ttl=63 time=80.4 ms
64 bytes from 192.168.89.15: icmp_seq=2 ttl=63 time=82.1 ms
64 bytes from 192.168.89.15: icmp_seq=3 ttl=63 time=82.1 ms
64 bytes from 192.168.89.15: icmp_seq=4 ttl=63 time=82.1 ms
64 bytes from 192.168.89.15: icmp_seq=5 ttl=63 time=82.1 ms

--- 192.168.89.15 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4014ms
rtt min/avg/max/mdev = 80.453/81.804/82.173/0.722 ms

And a bidirectional traceroute (using the "nuttcp -xt" option):

[root@lang2 ~]# nuttcp -xt 192.168.89.15
traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
 1  192.168.88.13 (192.168.88.13)  0.141 ms   0.125 ms   0.125 ms
 2  192.168.89.15 (192.168.89.15)  82.112 ms   82.039 ms   82.541 ms

traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
 1  192.168.89.13 (192.168.89.13)  81.101 ms   83.001 ms   82.999 ms
 2  192.168.88.14 (192.168.88.14)  83.005 ms   82.985 ms   82.978 ms

So is this a real bug in cubic (and bic), or do I just not understand
something basic.

						-Bill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-07  0:37 2.6.20.7 TCP cubic (and bic) initial slow start way too slow? Bill Fink
@ 2007-05-08  9:41 ` SANGTAE HA
  2007-05-09  6:31   ` Bill Fink
  0 siblings, 1 reply; 21+ messages in thread
From: SANGTAE HA @ 2007-05-08  9:41 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers

Hi Bill,

At this time, BIC and CUBIC use a less aggressive slow start than
other protocols. Because we observed "slow start" is somewhat
aggressive and introduced a lot of packet losses. This may be changed
to standard "slow start" in later version of BIC and CUBIC, but, at
this time, we still using a modified slow start.

So, as you observed, this modified slow start behavior may slow for
10G testing. You can alleviate this for your 10G testing by changing
BIC and CUBIC to use a standard "slow start" by loading these modules
with "initial_ssthresh=0".

Regards,
Sangtae


On 5/6/07, Bill Fink <billfink@mindspring.com> wrote:
> The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
> extent bic) seems to be way too slow.  With an ~80 ms RTT, this
> is what cubic delivers (thirty second test with one second interval
> reporting and specifying a socket buffer size of 60 MB):
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
>
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> cubic
>
> [root@lang2 ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
>     6.8188 MB /   1.00 sec =   57.0365 Mbps
>    16.2097 MB /   1.00 sec =  135.9824 Mbps
>    25.4553 MB /   1.00 sec =  213.5420 Mbps
>    35.5127 MB /   1.00 sec =  297.9119 Mbps
>    43.0066 MB /   1.00 sec =  360.7770 Mbps
>    50.3210 MB /   1.00 sec =  422.1370 Mbps
>    59.0796 MB /   1.00 sec =  495.6124 Mbps
>    69.1284 MB /   1.00 sec =  579.9098 Mbps
>    76.6479 MB /   1.00 sec =  642.9130 Mbps
>    90.6189 MB /   1.00 sec =  760.2835 Mbps
>   109.4348 MB /   1.00 sec =  918.0361 Mbps
>   128.3105 MB /   1.00 sec = 1076.3813 Mbps
>   150.4932 MB /   1.00 sec = 1262.4686 Mbps
>   175.9229 MB /   1.00 sec = 1475.7965 Mbps
>   205.9412 MB /   1.00 sec = 1727.6150 Mbps
>   240.8130 MB /   1.00 sec = 2020.1504 Mbps
>   282.1790 MB /   1.00 sec = 2367.1644 Mbps
>   318.1841 MB /   1.00 sec = 2669.1349 Mbps
>   372.6814 MB /   1.00 sec = 3126.1687 Mbps
>   440.8411 MB /   1.00 sec = 3698.5200 Mbps
>   524.8633 MB /   1.00 sec = 4403.0220 Mbps
>   614.3542 MB /   1.00 sec = 5153.7367 Mbps
>   718.9917 MB /   1.00 sec = 6031.5386 Mbps
>   829.0474 MB /   1.00 sec = 6954.6438 Mbps
>   867.3289 MB /   1.00 sec = 7275.9510 Mbps
>   865.7759 MB /   1.00 sec = 7262.9813 Mbps
>   864.4795 MB /   1.00 sec = 7251.7071 Mbps
>   864.5425 MB /   1.00 sec = 7252.8519 Mbps
>   867.3372 MB /   1.00 sec = 7246.9232 Mbps
>
> 10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
>
> It takes 25 seconds for cubic TCP to reach its maximal rate.
> Note that there were no TCP retransmissions (no congestion
> experienced).
>
> Now with bic (only 20 second test this time):
>
> [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> bic
>
> [root@lang2 ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
>     9.9548 MB /   1.00 sec =   83.1497 Mbps
>    47.2021 MB /   1.00 sec =  395.9762 Mbps
>    92.4304 MB /   1.00 sec =  775.3889 Mbps
>   134.3774 MB /   1.00 sec = 1127.2758 Mbps
>   194.3286 MB /   1.00 sec = 1630.1987 Mbps
>   280.0598 MB /   1.00 sec = 2349.3613 Mbps
>   404.3201 MB /   1.00 sec = 3391.8250 Mbps
>   559.1594 MB /   1.00 sec = 4690.6677 Mbps
>   792.7100 MB /   1.00 sec = 6650.0257 Mbps
>   857.2241 MB /   1.00 sec = 7190.6942 Mbps
>   852.6912 MB /   1.00 sec = 7153.3283 Mbps
>   852.6968 MB /   1.00 sec = 7153.2538 Mbps
>   851.3162 MB /   1.00 sec = 7141.7575 Mbps
>   851.4927 MB /   1.00 sec = 7143.0240 Mbps
>   850.8782 MB /   1.00 sec = 7137.8762 Mbps
>   852.7119 MB /   1.00 sec = 7153.2949 Mbps
>   852.3879 MB /   1.00 sec = 7150.2982 Mbps
>   850.2163 MB /   1.00 sec = 7132.5165 Mbps
>   849.8340 MB /   1.00 sec = 7129.0026 Mbps
>
> 11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
>
> bic does better but still takes 10 seconds to achieve its maximal
> rate.
>
> Surprisingly venerable reno does the best (only a 10 second test):
>
> [root@lang2 ~]# echo reno > /proc/sys/net/ipv4/tcp_congestion_control
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> reno
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
>    69.9829 MB /   1.01 sec =  583.5822 Mbps
>   844.3870 MB /   1.00 sec = 7083.2808 Mbps
>   862.7568 MB /   1.00 sec = 7237.7342 Mbps
>   859.5725 MB /   1.00 sec = 7210.8981 Mbps
>   860.1365 MB /   1.00 sec = 7215.4487 Mbps
>   865.3940 MB /   1.00 sec = 7259.8434 Mbps
>   863.9678 MB /   1.00 sec = 7247.4942 Mbps
>   864.7493 MB /   1.00 sec = 7254.4634 Mbps
>   864.6660 MB /   1.00 sec = 7253.5183 Mbps
>
>  7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
>
> reno achieves its maximal rate in about 2 seconds.  This is what I
> would expect from the exponential increase during TCP's initial
> slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
> jumbo frame packets would require:
>
>         [root@lang2 ~]# bc -l
>         scale=10
>         10^10*0.080/9000/8
>         11111.1111111111
>
> So 11111 packets would have to be in flight during one RTT.
> It should take log2(11111)+1 round trips to achieve 10 Gbps
> (note bc's l() function is logE);
>
>         [root@lang2 ~]# bc -l
>         scale=10
>         l(11111)/l(2)+1
>         14.4397010470
>
> And 15 round trips at 80 ms each gives a total time of:
>
>         [root@lang2 ~]# bc -l
>         scale=10
>         15*0.080
>         1.200
>
> So if there is no packet loss (which there wasn't), it should only
> take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
> this ballpark range.
>
> Now it's quite possible there's something basic I don't understand,
> such as some /proc/sys/net/ipv4/tcp_* or /sys/module/tcp_*/parameters/*
> parameter I've overlooked, in which case feel free to just refer me
> to any suitable documentation.
>
> I also checked the Changelog for 2.6.20.{8,9,10,11} to see if there
> might be any relevant recent bug fixes, but the only thing that seemed
> even remotely related was the 2.6.20.11 bug fix for the tcp_mem setting.
> Although this did affect me, I manually adjusted the tcp_mem settings
> before running these tests.
>
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_mem                                  393216  524288  786432
>
> The test setup was:
>
>         +-------+                 +-------+                 +-------+
>         |       |eth2         eth2|       |eth3         eth2|       |
>         | lang2 |-----10-GigE-----| lang1 |-----10-GigE-----| lang3 |
>         |       |                 |       |                 |       |
>         +-------+                 +-------+                 +-------+
>         192.168.88.14    192.168.88.13/192.168.89.13    192.168.89.15
>
> All three systems are dual 2.8 GHz AMD Opteron Processor 254 systems
> with 4 GB memory and all running the 2.6.20.7 kernel.  All the NICs
> are Myricom PCI-E 10-GigE NICs.
>
> The 80 ms delay was introduced by applying netem to lang1's eth3
> interface:
>
> [root@lang1 ~]# tc qdisc add dev eth3 root netem delay 80ms limit 20000
> [root@lang1 ~]# tc qdisc show
> qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> qdisc netem 8022: dev eth3 limit 20000 delay 80.0ms reorder 100%
>
> Experimentation determined that netem running on lang1 could handle
> about 8-8.5 Gbps without dropping packets.
>
> 8.5 Gbps UDP test:
>
> [root@lang2 ~]# nuttcp -u -Ri8.5g -w20m 192.168.89.15
> 10136.4844 MB /  10.01 sec = 8497.8205 Mbps 100 %TX 56 %RX 0 / 1297470 drop/pkt 0.00 %loss
>
> Increasing the rate to 9 Gbps would give some loss:
>
> [root@lang2 ~]# nuttcp -u -Ri9g -w20m 192.168.89.15
> 10219.1719 MB /  10.01 sec = 8560.2455 Mbps 100 %TX 58 %RX 65500 / 1373554 drop/pkt 4.77 %loss
>
> Based on this, the specification of a 60 MB TCP socket buffer size was
> used during the TCP tests to avoid overstressing the lang1 netem delay
> emulator (to avoid dropping any packets).
>
> Simple ping through the lang1 netem delay emulator:
>
> [root@lang2 ~]# ping -c 5 192.168.89.15
> PING 192.168.89.15 (192.168.89.15) 56(84) bytes of data.
> 64 bytes from 192.168.89.15: icmp_seq=1 ttl=63 time=80.4 ms
> 64 bytes from 192.168.89.15: icmp_seq=2 ttl=63 time=82.1 ms
> 64 bytes from 192.168.89.15: icmp_seq=3 ttl=63 time=82.1 ms
> 64 bytes from 192.168.89.15: icmp_seq=4 ttl=63 time=82.1 ms
> 64 bytes from 192.168.89.15: icmp_seq=5 ttl=63 time=82.1 ms
>
> --- 192.168.89.15 ping statistics ---
> 5 packets transmitted, 5 received, 0% packet loss, time 4014ms
> rtt min/avg/max/mdev = 80.453/81.804/82.173/0.722 ms
>
> And a bidirectional traceroute (using the "nuttcp -xt" option):
>
> [root@lang2 ~]# nuttcp -xt 192.168.89.15
> traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
>  1  192.168.88.13 (192.168.88.13)  0.141 ms   0.125 ms   0.125 ms
>  2  192.168.89.15 (192.168.89.15)  82.112 ms   82.039 ms   82.541 ms
>
> traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
>  1  192.168.89.13 (192.168.89.13)  81.101 ms   83.001 ms   82.999 ms
>  2  192.168.88.14 (192.168.88.14)  83.005 ms   82.985 ms   82.978 ms
>
> So is this a real bug in cubic (and bic), or do I just not understand
> something basic.
>
>                                                 -Bill
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-08  9:41 ` SANGTAE HA
@ 2007-05-09  6:31   ` Bill Fink
  2007-05-10 15:32     ` Bill Fink
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Fink @ 2007-05-09  6:31 UTC (permalink / raw)
  To: SANGTAE HA; +Cc: Linux Network Developers

Hi Sangtae,

On Tue, 8 May 2007, SANGTAE HA wrote:

> Hi Bill,
> 
> At this time, BIC and CUBIC use a less aggressive slow start than
> other protocols. Because we observed "slow start" is somewhat
> aggressive and introduced a lot of packet losses. This may be changed
> to standard "slow start" in later version of BIC and CUBIC, but, at
> this time, we still using a modified slow start.

"slow start" is somewhat of a misnomer.  However, I'd argue in favor
of using the standard "slow start" for BIC and CUBIC as the default.
Is the rationale for using a less agressive "slow start" to be gentler
to certain receivers, which possibly can't handle a rapidly increasing
initial burst of packets (and the resultant necessary allocation of
system resources)?  Or is it related to encountering actual network
congestion during the initial "slow start" period, and how well that
is responded to?

> So, as you observed, this modified slow start behavior may slow for
> 10G testing. You can alleviate this for your 10G testing by changing
> BIC and CUBIC to use a standard "slow start" by loading these modules
> with "initial_ssthresh=0".

I saw the initial_ssthresh parameter, but didn't know what it did or
even what its units were.  I saw the default value was 100 and tried
increasing it, but I didn't think to try setting it to 0.

[root@lang2 ~]# grep -r initial_ssthresh /usr/src/kernels/linux-2.6.20.7/Documentation/
[root@lang2 ~]#

It would be good to have some documentation for these bic and cubic
parameters similar to the documentation in ip-sysctl.txt for the
/proc/sys/net/ipv[46]/* variables (I know, I know, I should just
"use the source").

Is it expected that the cubic "slow start" is that much less agressive
than the bic "slow start" (from 10 secs to max rate for bic in my test
to 25 secs to max rate for cubic).  This could be considered a performance
regression since the default TCP was changed from bic to cubic.

In any event, I'm now happy as setting initial_ssthresh to 0 works
well for me.

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

[root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
0

[root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
   69.9829 MB /   1.00 sec =  584.2065 Mbps
  843.1467 MB /   1.00 sec = 7072.9052 Mbps
  844.3655 MB /   1.00 sec = 7082.6544 Mbps
  842.2671 MB /   1.00 sec = 7065.7169 Mbps
  839.9204 MB /   1.00 sec = 7045.8335 Mbps
  840.1780 MB /   1.00 sec = 7048.3114 Mbps
  834.1475 MB /   1.00 sec = 6997.4270 Mbps
  835.5972 MB /   1.00 sec = 7009.3148 Mbps
  835.8152 MB /   1.00 sec = 7011.7537 Mbps
  830.9333 MB /   1.00 sec = 6969.9281 Mbps

 7617.1875 MB /  10.01 sec = 6386.2622 Mbps 90 %TX 46 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

						-Thanks a lot!

						-Bill



> Regards,
> Sangtae
> 
> 
> On 5/6/07, Bill Fink <billfink@mindspring.com> wrote:
> > The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
> > extent bic) seems to be way too slow.  With an ~80 ms RTT, this
> > is what cubic delivers (thirty second test with one second interval
> > reporting and specifying a socket buffer size of 60 MB):
> >
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     0 segments retransmited
> >
> > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > cubic
> >
> > [root@lang2 ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
> >     6.8188 MB /   1.00 sec =   57.0365 Mbps
> >    16.2097 MB /   1.00 sec =  135.9824 Mbps
> >    25.4553 MB /   1.00 sec =  213.5420 Mbps
> >    35.5127 MB /   1.00 sec =  297.9119 Mbps
> >    43.0066 MB /   1.00 sec =  360.7770 Mbps
> >    50.3210 MB /   1.00 sec =  422.1370 Mbps
> >    59.0796 MB /   1.00 sec =  495.6124 Mbps
> >    69.1284 MB /   1.00 sec =  579.9098 Mbps
> >    76.6479 MB /   1.00 sec =  642.9130 Mbps
> >    90.6189 MB /   1.00 sec =  760.2835 Mbps
> >   109.4348 MB /   1.00 sec =  918.0361 Mbps
> >   128.3105 MB /   1.00 sec = 1076.3813 Mbps
> >   150.4932 MB /   1.00 sec = 1262.4686 Mbps
> >   175.9229 MB /   1.00 sec = 1475.7965 Mbps
> >   205.9412 MB /   1.00 sec = 1727.6150 Mbps
> >   240.8130 MB /   1.00 sec = 2020.1504 Mbps
> >   282.1790 MB /   1.00 sec = 2367.1644 Mbps
> >   318.1841 MB /   1.00 sec = 2669.1349 Mbps
> >   372.6814 MB /   1.00 sec = 3126.1687 Mbps
> >   440.8411 MB /   1.00 sec = 3698.5200 Mbps
> >   524.8633 MB /   1.00 sec = 4403.0220 Mbps
> >   614.3542 MB /   1.00 sec = 5153.7367 Mbps
> >   718.9917 MB /   1.00 sec = 6031.5386 Mbps
> >   829.0474 MB /   1.00 sec = 6954.6438 Mbps
> >   867.3289 MB /   1.00 sec = 7275.9510 Mbps
> >   865.7759 MB /   1.00 sec = 7262.9813 Mbps
> >   864.4795 MB /   1.00 sec = 7251.7071 Mbps
> >   864.5425 MB /   1.00 sec = 7252.8519 Mbps
> >   867.3372 MB /   1.00 sec = 7246.9232 Mbps
> >
> > 10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX
> >
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     0 segments retransmited
> >
> > It takes 25 seconds for cubic TCP to reach its maximal rate.
> > Note that there were no TCP retransmissions (no congestion
> > experienced).
> >
> > Now with bic (only 20 second test this time):
> >
> > [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
> > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > bic
> >
> > [root@lang2 ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
> >     9.9548 MB /   1.00 sec =   83.1497 Mbps
> >    47.2021 MB /   1.00 sec =  395.9762 Mbps
> >    92.4304 MB /   1.00 sec =  775.3889 Mbps
> >   134.3774 MB /   1.00 sec = 1127.2758 Mbps
> >   194.3286 MB /   1.00 sec = 1630.1987 Mbps
> >   280.0598 MB /   1.00 sec = 2349.3613 Mbps
> >   404.3201 MB /   1.00 sec = 3391.8250 Mbps
> >   559.1594 MB /   1.00 sec = 4690.6677 Mbps
> >   792.7100 MB /   1.00 sec = 6650.0257 Mbps
> >   857.2241 MB /   1.00 sec = 7190.6942 Mbps
> >   852.6912 MB /   1.00 sec = 7153.3283 Mbps
> >   852.6968 MB /   1.00 sec = 7153.2538 Mbps
> >   851.3162 MB /   1.00 sec = 7141.7575 Mbps
> >   851.4927 MB /   1.00 sec = 7143.0240 Mbps
> >   850.8782 MB /   1.00 sec = 7137.8762 Mbps
> >   852.7119 MB /   1.00 sec = 7153.2949 Mbps
> >   852.3879 MB /   1.00 sec = 7150.2982 Mbps
> >   850.2163 MB /   1.00 sec = 7132.5165 Mbps
> >   849.8340 MB /   1.00 sec = 7129.0026 Mbps
> >
> > 11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX
> >
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     0 segments retransmited
> >
> > bic does better but still takes 10 seconds to achieve its maximal
> > rate.
> >
> > Surprisingly venerable reno does the best (only a 10 second test):
> >
> > [root@lang2 ~]# echo reno > /proc/sys/net/ipv4/tcp_congestion_control
> > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > reno
> >
> > [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
> >    69.9829 MB /   1.01 sec =  583.5822 Mbps
> >   844.3870 MB /   1.00 sec = 7083.2808 Mbps
> >   862.7568 MB /   1.00 sec = 7237.7342 Mbps
> >   859.5725 MB /   1.00 sec = 7210.8981 Mbps
> >   860.1365 MB /   1.00 sec = 7215.4487 Mbps
> >   865.3940 MB /   1.00 sec = 7259.8434 Mbps
> >   863.9678 MB /   1.00 sec = 7247.4942 Mbps
> >   864.7493 MB /   1.00 sec = 7254.4634 Mbps
> >   864.6660 MB /   1.00 sec = 7253.5183 Mbps
> >
> >  7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX
> >
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     0 segments retransmited
> >
> > reno achieves its maximal rate in about 2 seconds.  This is what I
> > would expect from the exponential increase during TCP's initial
> > slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
> > jumbo frame packets would require:
> >
> >         [root@lang2 ~]# bc -l
> >         scale=10
> >         10^10*0.080/9000/8
> >         11111.1111111111
> >
> > So 11111 packets would have to be in flight during one RTT.
> > It should take log2(11111)+1 round trips to achieve 10 Gbps
> > (note bc's l() function is logE);
> >
> >         [root@lang2 ~]# bc -l
> >         scale=10
> >         l(11111)/l(2)+1
> >         14.4397010470
> >
> > And 15 round trips at 80 ms each gives a total time of:
> >
> >         [root@lang2 ~]# bc -l
> >         scale=10
> >         15*0.080
> >         1.200
> >
> > So if there is no packet loss (which there wasn't), it should only
> > take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
> > this ballpark range.
> >
> > Now it's quite possible there's something basic I don't understand,
> > such as some /proc/sys/net/ipv4/tcp_* or /sys/module/tcp_*/parameters/*
> > parameter I've overlooked, in which case feel free to just refer me
> > to any suitable documentation.
> >
> > I also checked the Changelog for 2.6.20.{8,9,10,11} to see if there
> > might be any relevant recent bug fixes, but the only thing that seemed
> > even remotely related was the 2.6.20.11 bug fix for the tcp_mem setting.
> > Although this did affect me, I manually adjusted the tcp_mem settings
> > before running these tests.
> >
> > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_mem                                  393216  524288  786432
> >
> > The test setup was:
> >
> >         +-------+                 +-------+                 +-------+
> >         |       |eth2         eth2|       |eth3         eth2|       |
> >         | lang2 |-----10-GigE-----| lang1 |-----10-GigE-----| lang3 |
> >         |       |                 |       |                 |       |
> >         +-------+                 +-------+                 +-------+
> >         192.168.88.14    192.168.88.13/192.168.89.13    192.168.89.15
> >
> > All three systems are dual 2.8 GHz AMD Opteron Processor 254 systems
> > with 4 GB memory and all running the 2.6.20.7 kernel.  All the NICs
> > are Myricom PCI-E 10-GigE NICs.
> >
> > The 80 ms delay was introduced by applying netem to lang1's eth3
> > interface:
> >
> > [root@lang1 ~]# tc qdisc add dev eth3 root netem delay 80ms limit 20000
> > [root@lang1 ~]# tc qdisc show
> > qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> > qdisc netem 8022: dev eth3 limit 20000 delay 80.0ms reorder 100%
> >
> > Experimentation determined that netem running on lang1 could handle
> > about 8-8.5 Gbps without dropping packets.
> >
> > 8.5 Gbps UDP test:
> >
> > [root@lang2 ~]# nuttcp -u -Ri8.5g -w20m 192.168.89.15
> > 10136.4844 MB /  10.01 sec = 8497.8205 Mbps 100 %TX 56 %RX 0 / 1297470 drop/pkt 0.00 %loss
> >
> > Increasing the rate to 9 Gbps would give some loss:
> >
> > [root@lang2 ~]# nuttcp -u -Ri9g -w20m 192.168.89.15
> > 10219.1719 MB /  10.01 sec = 8560.2455 Mbps 100 %TX 58 %RX 65500 / 1373554 drop/pkt 4.77 %loss
> >
> > Based on this, the specification of a 60 MB TCP socket buffer size was
> > used during the TCP tests to avoid overstressing the lang1 netem delay
> > emulator (to avoid dropping any packets).
> >
> > Simple ping through the lang1 netem delay emulator:
> >
> > [root@lang2 ~]# ping -c 5 192.168.89.15
> > PING 192.168.89.15 (192.168.89.15) 56(84) bytes of data.
> > 64 bytes from 192.168.89.15: icmp_seq=1 ttl=63 time=80.4 ms
> > 64 bytes from 192.168.89.15: icmp_seq=2 ttl=63 time=82.1 ms
> > 64 bytes from 192.168.89.15: icmp_seq=3 ttl=63 time=82.1 ms
> > 64 bytes from 192.168.89.15: icmp_seq=4 ttl=63 time=82.1 ms
> > 64 bytes from 192.168.89.15: icmp_seq=5 ttl=63 time=82.1 ms
> >
> > --- 192.168.89.15 ping statistics ---
> > 5 packets transmitted, 5 received, 0% packet loss, time 4014ms
> > rtt min/avg/max/mdev = 80.453/81.804/82.173/0.722 ms
> >
> > And a bidirectional traceroute (using the "nuttcp -xt" option):
> >
> > [root@lang2 ~]# nuttcp -xt 192.168.89.15
> > traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
> >  1  192.168.88.13 (192.168.88.13)  0.141 ms   0.125 ms   0.125 ms
> >  2  192.168.89.15 (192.168.89.15)  82.112 ms   82.039 ms   82.541 ms
> >
> > traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
> >  1  192.168.89.13 (192.168.89.13)  81.101 ms   83.001 ms   82.999 ms
> >  2  192.168.88.14 (192.168.88.14)  83.005 ms   82.985 ms   82.978 ms
> >
> > So is this a real bug in cubic (and bic), or do I just not understand
> > something basic.
> >
> >                                                 -Bill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-09  6:31   ` Bill Fink
@ 2007-05-10 15:32     ` Bill Fink
  2007-05-10 17:31       ` SANGTAE HA
  2007-05-10 18:39       ` rhee
  0 siblings, 2 replies; 21+ messages in thread
From: Bill Fink @ 2007-05-10 15:32 UTC (permalink / raw)
  To: Bill Fink; +Cc: SANGTAE HA, Linux Network Developers

As a followup, I ran a somewhat interesting test.  I increased the
requested socket buffer size to 100 MB, which is sufficient to
overstress the capabilities of the netem delay emulator (which can
handle up to about 8.5 Gbps).  This causes some packet loss when
using the standard Reno agressive "slow start".

[root@lang2 ~]# netstat -s | grep -i retrans
    0 segments retransmited

[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

[root@lang2 ~]# echo 0 > /sys/module/tcp_cubic/parameters/initial_ssthresh
[root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
0

[root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
   69.9829 MB /   1.00 sec =  585.1895 Mbps
  311.9521 MB /   1.00 sec = 2616.9019 Mbps
    0.2332 MB /   1.00 sec =    1.9559 Mbps
   37.9907 MB /   1.00 sec =  318.6912 Mbps
  702.7856 MB /   1.00 sec = 5895.4640 Mbps
  817.0142 MB /   1.00 sec = 6853.7006 Mbps
  820.3125 MB /   1.00 sec = 6881.3626 Mbps
  820.5625 MB /   1.00 sec = 6883.2601 Mbps
  813.0125 MB /   1.00 sec = 6820.2678 Mbps
  815.7756 MB /   1.00 sec = 6842.8867 Mbps

 5253.2500 MB /  10.07 sec = 4378.0109 Mbps 72 %TX 35 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    464 segments retransmited
    464 fast retransmits

Contrast that with the default behavior.

[root@lang2 ~]# echo 100 > /sys/module/tcp_cubic/parameters/initial_ssthresh
[root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
100

[root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
    6.8188 MB /   1.00 sec =   57.1670 Mbps
   16.2097 MB /   1.00 sec =  135.9795 Mbps
   25.4810 MB /   1.00 sec =  213.7525 Mbps
   38.7256 MB /   1.00 sec =  324.8580 Mbps
   49.7998 MB /   1.00 sec =  417.7565 Mbps
   62.5745 MB /   1.00 sec =  524.9189 Mbps
   78.6646 MB /   1.00 sec =  659.8947 Mbps
   98.9673 MB /   1.00 sec =  830.2086 Mbps
  124.3201 MB /   1.00 sec = 1038.7288 Mbps
  156.1584 MB /   1.00 sec = 1309.9730 Mbps

  775.2500 MB /  10.64 sec =  611.0181 Mbps 7 %TX 7 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    464 segments retransmited
    464 fast retransmits

The standard Reno aggressive "slow start" gets much better overall
performance even in this case, because even though the default cubic
behavior manages to avoid the "congestion" event, its lack of
aggressiveness during the initial slow start period puts it at a
major disadvantage.  It would take a long time for the tortoise
in this race to catch up with the hare.

It seems best to ramp up as quickly as possible to any congestion,
using the standard Reno aggressive "slow start" behavior, and then
let the power of cubic take over from there, getting the best of
both worlds.

For completeness here's the same test with bic.

First with the standard Reno aggessive "slow start" behavior:

[root@lang2 ~]# netstat -s | grep -i retrans
    464 segments retransmited
    464 fast retransmits

[root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
bic

[root@lang2 ~]# echo 0 > /sys/module/tcp_bic/parameters/initial_ssthresh
[root@lang2 ~]# cat /sys/module/tcp_bic/parameters/initial_ssthresh
0

[root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
   69.9829 MB /   1.00 sec =  585.2770 Mbps
  302.3921 MB /   1.00 sec = 2536.7045 Mbps
    0.0000 MB /   1.00 sec =    0.0000 Mbps
    0.7520 MB /   1.00 sec =    6.3079 Mbps
  114.1570 MB /   1.00 sec =  957.5914 Mbps
  792.9634 MB /   1.00 sec = 6651.5131 Mbps
  845.9099 MB /   1.00 sec = 7096.4182 Mbps
  865.0825 MB /   1.00 sec = 7257.1575 Mbps
  890.4663 MB /   1.00 sec = 7470.0567 Mbps
  911.5039 MB /   1.00 sec = 7646.3560 Mbps

 4829.9375 MB /  10.05 sec = 4033.0191 Mbps 76 %TX 32 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    1093 segments retransmited
    1093 fast retransmits

And then with the default bic behavior:

[root@lang2 ~]# echo 100 > /sys/module/tcp_bic/parameters/initial_ssthresh
[root@lang2 ~]# cat /sys/module/tcp_bic/parameters/initial_ssthresh
100

[root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
    9.9548 MB /   1.00 sec =   83.1028 Mbps
   47.5439 MB /   1.00 sec =  398.8351 Mbps
  107.6147 MB /   1.00 sec =  902.7506 Mbps
  183.9038 MB /   1.00 sec = 1542.7124 Mbps
  313.4875 MB /   1.00 sec = 2629.7689 Mbps
  531.0012 MB /   1.00 sec = 4454.3032 Mbps
  841.7866 MB /   1.00 sec = 7061.5098 Mbps
  837.5867 MB /   1.00 sec = 7026.4041 Mbps
  834.8889 MB /   1.00 sec = 7003.3667 Mbps

 4539.6250 MB /  10.00 sec = 3806.5410 Mbps 50 %TX 34 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    1093 segments retransmited
    1093 fast retransmits

bic actually does much better than cubic for this scenario, and only
loses out to the standard Reno aggressive "slow start" behavior by a
small amount.  Of course in the case of no congestion, it loses out
by a much more significant margin.

This reinforces my belief that it's best to marry the standard Reno
aggressive initial "slow start" behavior with the better performance
of bic or cubic during the subsequent steady state portion of the
TCP session.

I can of course achieve that objective by setting initial_ssthresh
to 0, but perhaps that should be made the default behavior.

						-Bill



On Wed, 9 May 2007, I wrote:

> Hi Sangtae,
> 
> On Tue, 8 May 2007, SANGTAE HA wrote:
> 
> > Hi Bill,
> > 
> > At this time, BIC and CUBIC use a less aggressive slow start than
> > other protocols. Because we observed "slow start" is somewhat
> > aggressive and introduced a lot of packet losses. This may be changed
> > to standard "slow start" in later version of BIC and CUBIC, but, at
> > this time, we still using a modified slow start.
> 
> "slow start" is somewhat of a misnomer.  However, I'd argue in favor
> of using the standard "slow start" for BIC and CUBIC as the default.
> Is the rationale for using a less agressive "slow start" to be gentler
> to certain receivers, which possibly can't handle a rapidly increasing
> initial burst of packets (and the resultant necessary allocation of
> system resources)?  Or is it related to encountering actual network
> congestion during the initial "slow start" period, and how well that
> is responded to?
> 
> > So, as you observed, this modified slow start behavior may slow for
> > 10G testing. You can alleviate this for your 10G testing by changing
> > BIC and CUBIC to use a standard "slow start" by loading these modules
> > with "initial_ssthresh=0".
> 
> I saw the initial_ssthresh parameter, but didn't know what it did or
> even what its units were.  I saw the default value was 100 and tried
> increasing it, but I didn't think to try setting it to 0.
> 
> [root@lang2 ~]# grep -r initial_ssthresh /usr/src/kernels/linux-2.6.20.7/Documentation/
> [root@lang2 ~]#
> 
> It would be good to have some documentation for these bic and cubic
> parameters similar to the documentation in ip-sysctl.txt for the
> /proc/sys/net/ipv[46]/* variables (I know, I know, I should just
> "use the source").
> 
> Is it expected that the cubic "slow start" is that much less agressive
> than the bic "slow start" (from 10 secs to max rate for bic in my test
> to 25 secs to max rate for cubic).  This could be considered a performance
> regression since the default TCP was changed from bic to cubic.
> 
> In any event, I'm now happy as setting initial_ssthresh to 0 works
> well for me.
> 
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
> 
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> cubic
> 
> [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
> 0
> 
> [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
>    69.9829 MB /   1.00 sec =  584.2065 Mbps
>   843.1467 MB /   1.00 sec = 7072.9052 Mbps
>   844.3655 MB /   1.00 sec = 7082.6544 Mbps
>   842.2671 MB /   1.00 sec = 7065.7169 Mbps
>   839.9204 MB /   1.00 sec = 7045.8335 Mbps
>   840.1780 MB /   1.00 sec = 7048.3114 Mbps
>   834.1475 MB /   1.00 sec = 6997.4270 Mbps
>   835.5972 MB /   1.00 sec = 7009.3148 Mbps
>   835.8152 MB /   1.00 sec = 7011.7537 Mbps
>   830.9333 MB /   1.00 sec = 6969.9281 Mbps
> 
>  7617.1875 MB /  10.01 sec = 6386.2622 Mbps 90 %TX 46 %RX
> 
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
> 
> 						-Thanks a lot!
> 
> 						-Bill
> 
> 
> 
> > Regards,
> > Sangtae
> > 
> > 
> > On 5/6/07, Bill Fink <billfink@mindspring.com> wrote:
> > > The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
> > > extent bic) seems to be way too slow.  With an ~80 ms RTT, this
> > > is what cubic delivers (thirty second test with one second interval
> > > reporting and specifying a socket buffer size of 60 MB):
> > >
> > > [root@lang2 ~]# netstat -s | grep -i retrans
> > >     0 segments retransmited
> > >
> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > > cubic
> > >
> > > [root@lang2 ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
> > >     6.8188 MB /   1.00 sec =   57.0365 Mbps
> > >    16.2097 MB /   1.00 sec =  135.9824 Mbps
> > >    25.4553 MB /   1.00 sec =  213.5420 Mbps
> > >    35.5127 MB /   1.00 sec =  297.9119 Mbps
> > >    43.0066 MB /   1.00 sec =  360.7770 Mbps
> > >    50.3210 MB /   1.00 sec =  422.1370 Mbps
> > >    59.0796 MB /   1.00 sec =  495.6124 Mbps
> > >    69.1284 MB /   1.00 sec =  579.9098 Mbps
> > >    76.6479 MB /   1.00 sec =  642.9130 Mbps
> > >    90.6189 MB /   1.00 sec =  760.2835 Mbps
> > >   109.4348 MB /   1.00 sec =  918.0361 Mbps
> > >   128.3105 MB /   1.00 sec = 1076.3813 Mbps
> > >   150.4932 MB /   1.00 sec = 1262.4686 Mbps
> > >   175.9229 MB /   1.00 sec = 1475.7965 Mbps
> > >   205.9412 MB /   1.00 sec = 1727.6150 Mbps
> > >   240.8130 MB /   1.00 sec = 2020.1504 Mbps
> > >   282.1790 MB /   1.00 sec = 2367.1644 Mbps
> > >   318.1841 MB /   1.00 sec = 2669.1349 Mbps
> > >   372.6814 MB /   1.00 sec = 3126.1687 Mbps
> > >   440.8411 MB /   1.00 sec = 3698.5200 Mbps
> > >   524.8633 MB /   1.00 sec = 4403.0220 Mbps
> > >   614.3542 MB /   1.00 sec = 5153.7367 Mbps
> > >   718.9917 MB /   1.00 sec = 6031.5386 Mbps
> > >   829.0474 MB /   1.00 sec = 6954.6438 Mbps
> > >   867.3289 MB /   1.00 sec = 7275.9510 Mbps
> > >   865.7759 MB /   1.00 sec = 7262.9813 Mbps
> > >   864.4795 MB /   1.00 sec = 7251.7071 Mbps
> > >   864.5425 MB /   1.00 sec = 7252.8519 Mbps
> > >   867.3372 MB /   1.00 sec = 7246.9232 Mbps
> > >
> > > 10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX
> > >
> > > [root@lang2 ~]# netstat -s | grep -i retrans
> > >     0 segments retransmited
> > >
> > > It takes 25 seconds for cubic TCP to reach its maximal rate.
> > > Note that there were no TCP retransmissions (no congestion
> > > experienced).
> > >
> > > Now with bic (only 20 second test this time):
> > >
> > > [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > > bic
> > >
> > > [root@lang2 ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
> > >     9.9548 MB /   1.00 sec =   83.1497 Mbps
> > >    47.2021 MB /   1.00 sec =  395.9762 Mbps
> > >    92.4304 MB /   1.00 sec =  775.3889 Mbps
> > >   134.3774 MB /   1.00 sec = 1127.2758 Mbps
> > >   194.3286 MB /   1.00 sec = 1630.1987 Mbps
> > >   280.0598 MB /   1.00 sec = 2349.3613 Mbps
> > >   404.3201 MB /   1.00 sec = 3391.8250 Mbps
> > >   559.1594 MB /   1.00 sec = 4690.6677 Mbps
> > >   792.7100 MB /   1.00 sec = 6650.0257 Mbps
> > >   857.2241 MB /   1.00 sec = 7190.6942 Mbps
> > >   852.6912 MB /   1.00 sec = 7153.3283 Mbps
> > >   852.6968 MB /   1.00 sec = 7153.2538 Mbps
> > >   851.3162 MB /   1.00 sec = 7141.7575 Mbps
> > >   851.4927 MB /   1.00 sec = 7143.0240 Mbps
> > >   850.8782 MB /   1.00 sec = 7137.8762 Mbps
> > >   852.7119 MB /   1.00 sec = 7153.2949 Mbps
> > >   852.3879 MB /   1.00 sec = 7150.2982 Mbps
> > >   850.2163 MB /   1.00 sec = 7132.5165 Mbps
> > >   849.8340 MB /   1.00 sec = 7129.0026 Mbps
> > >
> > > 11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX
> > >
> > > [root@lang2 ~]# netstat -s | grep -i retrans
> > >     0 segments retransmited
> > >
> > > bic does better but still takes 10 seconds to achieve its maximal
> > > rate.
> > >
> > > Surprisingly venerable reno does the best (only a 10 second test):
> > >
> > > [root@lang2 ~]# echo reno > /proc/sys/net/ipv4/tcp_congestion_control
> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > > reno
> > >
> > > [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
> > >    69.9829 MB /   1.01 sec =  583.5822 Mbps
> > >   844.3870 MB /   1.00 sec = 7083.2808 Mbps
> > >   862.7568 MB /   1.00 sec = 7237.7342 Mbps
> > >   859.5725 MB /   1.00 sec = 7210.8981 Mbps
> > >   860.1365 MB /   1.00 sec = 7215.4487 Mbps
> > >   865.3940 MB /   1.00 sec = 7259.8434 Mbps
> > >   863.9678 MB /   1.00 sec = 7247.4942 Mbps
> > >   864.7493 MB /   1.00 sec = 7254.4634 Mbps
> > >   864.6660 MB /   1.00 sec = 7253.5183 Mbps
> > >
> > >  7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX
> > >
> > > [root@lang2 ~]# netstat -s | grep -i retrans
> > >     0 segments retransmited
> > >
> > > reno achieves its maximal rate in about 2 seconds.  This is what I
> > > would expect from the exponential increase during TCP's initial
> > > slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
> > > jumbo frame packets would require:
> > >
> > >         [root@lang2 ~]# bc -l
> > >         scale=10
> > >         10^10*0.080/9000/8
> > >         11111.1111111111
> > >
> > > So 11111 packets would have to be in flight during one RTT.
> > > It should take log2(11111)+1 round trips to achieve 10 Gbps
> > > (note bc's l() function is logE);
> > >
> > >         [root@lang2 ~]# bc -l
> > >         scale=10
> > >         l(11111)/l(2)+1
> > >         14.4397010470
> > >
> > > And 15 round trips at 80 ms each gives a total time of:
> > >
> > >         [root@lang2 ~]# bc -l
> > >         scale=10
> > >         15*0.080
> > >         1.200
> > >
> > > So if there is no packet loss (which there wasn't), it should only
> > > take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
> > > this ballpark range.
> > >
> > > Now it's quite possible there's something basic I don't understand,
> > > such as some /proc/sys/net/ipv4/tcp_* or /sys/module/tcp_*/parameters/*
> > > parameter I've overlooked, in which case feel free to just refer me
> > > to any suitable documentation.
> > >
> > > I also checked the Changelog for 2.6.20.{8,9,10,11} to see if there
> > > might be any relevant recent bug fixes, but the only thing that seemed
> > > even remotely related was the 2.6.20.11 bug fix for the tcp_mem setting.
> > > Although this did affect me, I manually adjusted the tcp_mem settings
> > > before running these tests.
> > >
> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_mem                                  393216  524288  786432
> > >
> > > The test setup was:
> > >
> > >         +-------+                 +-------+                 +-------+
> > >         |       |eth2         eth2|       |eth3         eth2|       |
> > >         | lang2 |-----10-GigE-----| lang1 |-----10-GigE-----| lang3 |
> > >         |       |                 |       |                 |       |
> > >         +-------+                 +-------+                 +-------+
> > >         192.168.88.14    192.168.88.13/192.168.89.13    192.168.89.15
> > >
> > > All three systems are dual 2.8 GHz AMD Opteron Processor 254 systems
> > > with 4 GB memory and all running the 2.6.20.7 kernel.  All the NICs
> > > are Myricom PCI-E 10-GigE NICs.
> > >
> > > The 80 ms delay was introduced by applying netem to lang1's eth3
> > > interface:
> > >
> > > [root@lang1 ~]# tc qdisc add dev eth3 root netem delay 80ms limit 20000
> > > [root@lang1 ~]# tc qdisc show
> > > qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> > > qdisc netem 8022: dev eth3 limit 20000 delay 80.0ms reorder 100%
> > >
> > > Experimentation determined that netem running on lang1 could handle
> > > about 8-8.5 Gbps without dropping packets.
> > >
> > > 8.5 Gbps UDP test:
> > >
> > > [root@lang2 ~]# nuttcp -u -Ri8.5g -w20m 192.168.89.15
> > > 10136.4844 MB /  10.01 sec = 8497.8205 Mbps 100 %TX 56 %RX 0 / 1297470 drop/pkt 0.00 %loss
> > >
> > > Increasing the rate to 9 Gbps would give some loss:
> > >
> > > [root@lang2 ~]# nuttcp -u -Ri9g -w20m 192.168.89.15
> > > 10219.1719 MB /  10.01 sec = 8560.2455 Mbps 100 %TX 58 %RX 65500 / 1373554 drop/pkt 4.77 %loss
> > >
> > > Based on this, the specification of a 60 MB TCP socket buffer size was
> > > used during the TCP tests to avoid overstressing the lang1 netem delay
> > > emulator (to avoid dropping any packets).
> > >
> > > Simple ping through the lang1 netem delay emulator:
> > >
> > > [root@lang2 ~]# ping -c 5 192.168.89.15
> > > PING 192.168.89.15 (192.168.89.15) 56(84) bytes of data.
> > > 64 bytes from 192.168.89.15: icmp_seq=1 ttl=63 time=80.4 ms
> > > 64 bytes from 192.168.89.15: icmp_seq=2 ttl=63 time=82.1 ms
> > > 64 bytes from 192.168.89.15: icmp_seq=3 ttl=63 time=82.1 ms
> > > 64 bytes from 192.168.89.15: icmp_seq=4 ttl=63 time=82.1 ms
> > > 64 bytes from 192.168.89.15: icmp_seq=5 ttl=63 time=82.1 ms
> > >
> > > --- 192.168.89.15 ping statistics ---
> > > 5 packets transmitted, 5 received, 0% packet loss, time 4014ms
> > > rtt min/avg/max/mdev = 80.453/81.804/82.173/0.722 ms
> > >
> > > And a bidirectional traceroute (using the "nuttcp -xt" option):
> > >
> > > [root@lang2 ~]# nuttcp -xt 192.168.89.15
> > > traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
> > >  1  192.168.88.13 (192.168.88.13)  0.141 ms   0.125 ms   0.125 ms
> > >  2  192.168.89.15 (192.168.89.15)  82.112 ms   82.039 ms   82.541 ms
> > >
> > > traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
> > >  1  192.168.89.13 (192.168.89.13)  81.101 ms   83.001 ms   82.999 ms
> > >  2  192.168.88.14 (192.168.88.14)  83.005 ms   82.985 ms   82.978 ms
> > >
> > > So is this a real bug in cubic (and bic), or do I just not understand
> > > something basic.
> > >
> > >                                                 -Bill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-10 15:32     ` Bill Fink
@ 2007-05-10 17:31       ` SANGTAE HA
  2007-05-10 18:39       ` rhee
  1 sibling, 0 replies; 21+ messages in thread
From: SANGTAE HA @ 2007-05-10 17:31 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers

Hi Bill,

Thank you for your good work!
As you mentioned that we've been considering the problems of less
aggressiveness of BIC and CUBIC. We are testing BIC and CUBIC for as
many bottleneck bandwidths (from 1MB - 1GB) and possibly up to 10GB
capacity.

One of the reasons we clamp the "slow start" is that earlier kernel
version sometimes we had a lot of blackouts when we increase the rate
exponentially. It was not because of network but because of local
queues in the kernel.
Now, TCP stack is quite stable and looks handle the burst following
the slow start.
So, your suggestion to change to make BIC and CUBIC use a standard
slow start quite makes sense to me. Actually, we've been consdering to
adopt "limited slow start" or some new derivatives of "slow start".

For any updates of these protocols, we've been testing extensive set
of testing scenarios. Also, we've been comparing the results between
"slow start", "limited slow start" and our clamping for many cases.
Probably, we will update the result to the latest kernel.

Also, "Documents" for congestion control parameters are a good
suggestion and not much work for us, so we will collaborate with
kernel developers to make them available.


Recently, CUBIC is updated to v2.1 to enhance the scalability of the
protocol and you might want to try to use it.

Thanks,
Sangtae


On 5/10/07, Bill Fink <billfink@mindspring.com> wrote:
> As a followup, I ran a somewhat interesting test.  I increased the
> requested socket buffer size to 100 MB, which is sufficient to
> overstress the capabilities of the netem delay emulator (which can
> handle up to about 8.5 Gbps).  This causes some packet loss when
> using the standard Reno agressive "slow start".
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>    0 segments retransmited
>
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> cubic
>
> [root@lang2 ~]# echo 0 > /sys/module/tcp_cubic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
> 0
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>   69.9829 MB /   1.00 sec =  585.1895 Mbps
>  311.9521 MB /   1.00 sec = 2616.9019 Mbps
>    0.2332 MB /   1.00 sec =    1.9559 Mbps
>   37.9907 MB /   1.00 sec =  318.6912 Mbps
>  702.7856 MB /   1.00 sec = 5895.4640 Mbps
>  817.0142 MB /   1.00 sec = 6853.7006 Mbps
>  820.3125 MB /   1.00 sec = 6881.3626 Mbps
>  820.5625 MB /   1.00 sec = 6883.2601 Mbps
>  813.0125 MB /   1.00 sec = 6820.2678 Mbps
>  815.7756 MB /   1.00 sec = 6842.8867 Mbps
>
>  5253.2500 MB /  10.07 sec = 4378.0109 Mbps 72 %TX 35 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>    464 segments retransmited
>    464 fast retransmits
>
> Contrast that with the default behavior.
>
> [root@lang2 ~]# echo 100 > /sys/module/tcp_cubic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
> 100
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>    6.8188 MB /   1.00 sec =   57.1670 Mbps
>   16.2097 MB /   1.00 sec =  135.9795 Mbps
>   25.4810 MB /   1.00 sec =  213.7525 Mbps
>   38.7256 MB /   1.00 sec =  324.8580 Mbps
>   49.7998 MB /   1.00 sec =  417.7565 Mbps
>   62.5745 MB /   1.00 sec =  524.9189 Mbps
>   78.6646 MB /   1.00 sec =  659.8947 Mbps
>   98.9673 MB /   1.00 sec =  830.2086 Mbps
>  124.3201 MB /   1.00 sec = 1038.7288 Mbps
>  156.1584 MB /   1.00 sec = 1309.9730 Mbps
>
>  775.2500 MB /  10.64 sec =  611.0181 Mbps 7 %TX 7 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>    464 segments retransmited
>    464 fast retransmits
>
> The standard Reno aggressive "slow start" gets much better overall
> performance even in this case, because even though the default cubic
> behavior manages to avoid the "congestion" event, its lack of
> aggressiveness during the initial slow start period puts it at a
> major disadvantage.  It would take a long time for the tortoise
> in this race to catch up with the hare.
>
> It seems best to ramp up as quickly as possible to any congestion,
> using the standard Reno aggressive "slow start" behavior, and then
> let the power of cubic take over from there, getting the best of
> both worlds.
>
> For completeness here's the same test with bic.
>
> First with the standard Reno aggessive "slow start" behavior:
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>    464 segments retransmited
>    464 fast retransmits
>
> [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> bic
>
> [root@lang2 ~]# echo 0 > /sys/module/tcp_bic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_bic/parameters/initial_ssthresh
> 0
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>   69.9829 MB /   1.00 sec =  585.2770 Mbps
>  302.3921 MB /   1.00 sec = 2536.7045 Mbps
>    0.0000 MB /   1.00 sec =    0.0000 Mbps
>    0.7520 MB /   1.00 sec =    6.3079 Mbps
>  114.1570 MB /   1.00 sec =  957.5914 Mbps
>  792.9634 MB /   1.00 sec = 6651.5131 Mbps
>  845.9099 MB /   1.00 sec = 7096.4182 Mbps
>  865.0825 MB /   1.00 sec = 7257.1575 Mbps
>  890.4663 MB /   1.00 sec = 7470.0567 Mbps
>  911.5039 MB /   1.00 sec = 7646.3560 Mbps
>
>  4829.9375 MB /  10.05 sec = 4033.0191 Mbps 76 %TX 32 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>    1093 segments retransmited
>    1093 fast retransmits
>
> And then with the default bic behavior:
>
> [root@lang2 ~]# echo 100 > /sys/module/tcp_bic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_bic/parameters/initial_ssthresh
> 100
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>    9.9548 MB /   1.00 sec =   83.1028 Mbps
>   47.5439 MB /   1.00 sec =  398.8351 Mbps
>  107.6147 MB /   1.00 sec =  902.7506 Mbps
>  183.9038 MB /   1.00 sec = 1542.7124 Mbps
>  313.4875 MB /   1.00 sec = 2629.7689 Mbps
>  531.0012 MB /   1.00 sec = 4454.3032 Mbps
>  841.7866 MB /   1.00 sec = 7061.5098 Mbps
>  837.5867 MB /   1.00 sec = 7026.4041 Mbps
>  834.8889 MB /   1.00 sec = 7003.3667 Mbps
>
>  4539.6250 MB /  10.00 sec = 3806.5410 Mbps 50 %TX 34 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>    1093 segments retransmited
>    1093 fast retransmits
>
> bic actually does much better than cubic for this scenario, and only
> loses out to the standard Reno aggressive "slow start" behavior by a
> small amount.  Of course in the case of no congestion, it loses out
> by a much more significant margin.
>
> This reinforces my belief that it's best to marry the standard Reno
> aggressive initial "slow start" behavior with the better performance
> of bic or cubic during the subsequent steady state portion of the
> TCP session.
>
> I can of course achieve that objective by setting initial_ssthresh
> to 0, but perhaps that should be made the default behavior.
>
>                                                -Bill
>
>
>
> On Wed, 9 May 2007, I wrote:
>
> > Hi Sangtae,
> >
> > On Tue, 8 May 2007, SANGTAE HA wrote:
> >
> > > Hi Bill,
> > >
> > > At this time, BIC and CUBIC use a less aggressive slow start than
> > > other protocols. Because we observed "slow start" is somewhat
> > > aggressive and introduced a lot of packet losses. This may be changed
> > > to standard "slow start" in later version of BIC and CUBIC, but, at
> > > this time, we still using a modified slow start.
> >
> > "slow start" is somewhat of a misnomer.  However, I'd argue in favor
> > of using the standard "slow start" for BIC and CUBIC as the default.
> > Is the rationale for using a less agressive "slow start" to be gentler
> > to certain receivers, which possibly can't handle a rapidly increasing
> > initial burst of packets (and the resultant necessary allocation of
> > system resources)?  Or is it related to encountering actual network
> > congestion during the initial "slow start" period, and how well that
> > is responded to?
> >
> > > So, as you observed, this modified slow start behavior may slow for
> > > 10G testing. You can alleviate this for your 10G testing by changing
> > > BIC and CUBIC to use a standard "slow start" by loading these modules
> > > with "initial_ssthresh=0".
> >
> > I saw the initial_ssthresh parameter, but didn't know what it did or
> > even what its units were.  I saw the default value was 100 and tried
> > increasing it, but I didn't think to try setting it to 0.
> >
> > [root@lang2 ~]# grep -r initial_ssthresh /usr/src/kernels/linux-2.6.20.7/Documentation/
> > [root@lang2 ~]#
> >
> > It would be good to have some documentation for these bic and cubic
> > parameters similar to the documentation in ip-sysctl.txt for the
> > /proc/sys/net/ipv[46]/* variables (I know, I know, I should just
> > "use the source").
> >
> > Is it expected that the cubic "slow start" is that much less agressive
> > than the bic "slow start" (from 10 secs to max rate for bic in my test
> > to 25 secs to max rate for cubic).  This could be considered a performance
> > regression since the default TCP was changed from bic to cubic.
> >
> > In any event, I'm now happy as setting initial_ssthresh to 0 works
> > well for me.
> >
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     0 segments retransmited
> >
> > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > cubic
> >
> > [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
> > 0
> >
> > [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
> >    69.9829 MB /   1.00 sec =  584.2065 Mbps
> >   843.1467 MB /   1.00 sec = 7072.9052 Mbps
> >   844.3655 MB /   1.00 sec = 7082.6544 Mbps
> >   842.2671 MB /   1.00 sec = 7065.7169 Mbps
> >   839.9204 MB /   1.00 sec = 7045.8335 Mbps
> >   840.1780 MB /   1.00 sec = 7048.3114 Mbps
> >   834.1475 MB /   1.00 sec = 6997.4270 Mbps
> >   835.5972 MB /   1.00 sec = 7009.3148 Mbps
> >   835.8152 MB /   1.00 sec = 7011.7537 Mbps
> >   830.9333 MB /   1.00 sec = 6969.9281 Mbps
> >
> >  7617.1875 MB /  10.01 sec = 6386.2622 Mbps 90 %TX 46 %RX
> >
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     0 segments retransmited
> >
> >                                               -Thanks a lot!
> >
> >                                               -Bill
> >
> >
> >
> > > Regards,
> > > Sangtae
> > >
> > >
> > > On 5/6/07, Bill Fink <billfink@mindspring.com> wrote:
> > > > The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
> > > > extent bic) seems to be way too slow.  With an ~80 ms RTT, this
> > > > is what cubic delivers (thirty second test with one second interval
> > > > reporting and specifying a socket buffer size of 60 MB):
> > > >
> > > > [root@lang2 ~]# netstat -s | grep -i retrans
> > > >     0 segments retransmited
> > > >
> > > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > > > cubic
> > > >
> > > > [root@lang2 ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
> > > >     6.8188 MB /   1.00 sec =   57.0365 Mbps
> > > >    16.2097 MB /   1.00 sec =  135.9824 Mbps
> > > >    25.4553 MB /   1.00 sec =  213.5420 Mbps
> > > >    35.5127 MB /   1.00 sec =  297.9119 Mbps
> > > >    43.0066 MB /   1.00 sec =  360.7770 Mbps
> > > >    50.3210 MB /   1.00 sec =  422.1370 Mbps
> > > >    59.0796 MB /   1.00 sec =  495.6124 Mbps
> > > >    69.1284 MB /   1.00 sec =  579.9098 Mbps
> > > >    76.6479 MB /   1.00 sec =  642.9130 Mbps
> > > >    90.6189 MB /   1.00 sec =  760.2835 Mbps
> > > >   109.4348 MB /   1.00 sec =  918.0361 Mbps
> > > >   128.3105 MB /   1.00 sec = 1076.3813 Mbps
> > > >   150.4932 MB /   1.00 sec = 1262.4686 Mbps
> > > >   175.9229 MB /   1.00 sec = 1475.7965 Mbps
> > > >   205.9412 MB /   1.00 sec = 1727.6150 Mbps
> > > >   240.8130 MB /   1.00 sec = 2020.1504 Mbps
> > > >   282.1790 MB /   1.00 sec = 2367.1644 Mbps
> > > >   318.1841 MB /   1.00 sec = 2669.1349 Mbps
> > > >   372.6814 MB /   1.00 sec = 3126.1687 Mbps
> > > >   440.8411 MB /   1.00 sec = 3698.5200 Mbps
> > > >   524.8633 MB /   1.00 sec = 4403.0220 Mbps
> > > >   614.3542 MB /   1.00 sec = 5153.7367 Mbps
> > > >   718.9917 MB /   1.00 sec = 6031.5386 Mbps
> > > >   829.0474 MB /   1.00 sec = 6954.6438 Mbps
> > > >   867.3289 MB /   1.00 sec = 7275.9510 Mbps
> > > >   865.7759 MB /   1.00 sec = 7262.9813 Mbps
> > > >   864.4795 MB /   1.00 sec = 7251.7071 Mbps
> > > >   864.5425 MB /   1.00 sec = 7252.8519 Mbps
> > > >   867.3372 MB /   1.00 sec = 7246.9232 Mbps
> > > >
> > > > 10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX
> > > >
> > > > [root@lang2 ~]# netstat -s | grep -i retrans
> > > >     0 segments retransmited
> > > >
> > > > It takes 25 seconds for cubic TCP to reach its maximal rate.
> > > > Note that there were no TCP retransmissions (no congestion
> > > > experienced).
> > > >
> > > > Now with bic (only 20 second test this time):
> > > >
> > > > [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
> > > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > > > bic
> > > >
> > > > [root@lang2 ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
> > > >     9.9548 MB /   1.00 sec =   83.1497 Mbps
> > > >    47.2021 MB /   1.00 sec =  395.9762 Mbps
> > > >    92.4304 MB /   1.00 sec =  775.3889 Mbps
> > > >   134.3774 MB /   1.00 sec = 1127.2758 Mbps
> > > >   194.3286 MB /   1.00 sec = 1630.1987 Mbps
> > > >   280.0598 MB /   1.00 sec = 2349.3613 Mbps
> > > >   404.3201 MB /   1.00 sec = 3391.8250 Mbps
> > > >   559.1594 MB /   1.00 sec = 4690.6677 Mbps
> > > >   792.7100 MB /   1.00 sec = 6650.0257 Mbps
> > > >   857.2241 MB /   1.00 sec = 7190.6942 Mbps
> > > >   852.6912 MB /   1.00 sec = 7153.3283 Mbps
> > > >   852.6968 MB /   1.00 sec = 7153.2538 Mbps
> > > >   851.3162 MB /   1.00 sec = 7141.7575 Mbps
> > > >   851.4927 MB /   1.00 sec = 7143.0240 Mbps
> > > >   850.8782 MB /   1.00 sec = 7137.8762 Mbps
> > > >   852.7119 MB /   1.00 sec = 7153.2949 Mbps
> > > >   852.3879 MB /   1.00 sec = 7150.2982 Mbps
> > > >   850.2163 MB /   1.00 sec = 7132.5165 Mbps
> > > >   849.8340 MB /   1.00 sec = 7129.0026 Mbps
> > > >
> > > > 11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX
> > > >
> > > > [root@lang2 ~]# netstat -s | grep -i retrans
> > > >     0 segments retransmited
> > > >
> > > > bic does better but still takes 10 seconds to achieve its maximal
> > > > rate.
> > > >
> > > > Surprisingly venerable reno does the best (only a 10 second test):
> > > >
> > > > [root@lang2 ~]# echo reno > /proc/sys/net/ipv4/tcp_congestion_control
> > > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> > > > reno
> > > >
> > > > [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
> > > >    69.9829 MB /   1.01 sec =  583.5822 Mbps
> > > >   844.3870 MB /   1.00 sec = 7083.2808 Mbps
> > > >   862.7568 MB /   1.00 sec = 7237.7342 Mbps
> > > >   859.5725 MB /   1.00 sec = 7210.8981 Mbps
> > > >   860.1365 MB /   1.00 sec = 7215.4487 Mbps
> > > >   865.3940 MB /   1.00 sec = 7259.8434 Mbps
> > > >   863.9678 MB /   1.00 sec = 7247.4942 Mbps
> > > >   864.7493 MB /   1.00 sec = 7254.4634 Mbps
> > > >   864.6660 MB /   1.00 sec = 7253.5183 Mbps
> > > >
> > > >  7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX
> > > >
> > > > [root@lang2 ~]# netstat -s | grep -i retrans
> > > >     0 segments retransmited
> > > >
> > > > reno achieves its maximal rate in about 2 seconds.  This is what I
> > > > would expect from the exponential increase during TCP's initial
> > > > slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
> > > > jumbo frame packets would require:
> > > >
> > > >         [root@lang2 ~]# bc -l
> > > >         scale=10
> > > >         10^10*0.080/9000/8
> > > >         11111.1111111111
> > > >
> > > > So 11111 packets would have to be in flight during one RTT.
> > > > It should take log2(11111)+1 round trips to achieve 10 Gbps
> > > > (note bc's l() function is logE);
> > > >
> > > >         [root@lang2 ~]# bc -l
> > > >         scale=10
> > > >         l(11111)/l(2)+1
> > > >         14.4397010470
> > > >
> > > > And 15 round trips at 80 ms each gives a total time of:
> > > >
> > > >         [root@lang2 ~]# bc -l
> > > >         scale=10
> > > >         15*0.080
> > > >         1.200
> > > >
> > > > So if there is no packet loss (which there wasn't), it should only
> > > > take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
> > > > this ballpark range.
> > > >
> > > > Now it's quite possible there's something basic I don't understand,
> > > > such as some /proc/sys/net/ipv4/tcp_* or /sys/module/tcp_*/parameters/*
> > > > parameter I've overlooked, in which case feel free to just refer me
> > > > to any suitable documentation.
> > > >
> > > > I also checked the Changelog for 2.6.20.{8,9,10,11} to see if there
> > > > might be any relevant recent bug fixes, but the only thing that seemed
> > > > even remotely related was the 2.6.20.11 bug fix for the tcp_mem setting.
> > > > Although this did affect me, I manually adjusted the tcp_mem settings
> > > > before running these tests.
> > > >
> > > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_mem                                  393216  524288  786432
> > > >
> > > > The test setup was:
> > > >
> > > >         +-------+                 +-------+                 +-------+
> > > >         |       |eth2         eth2|       |eth3         eth2|       |
> > > >         | lang2 |-----10-GigE-----| lang1 |-----10-GigE-----| lang3 |
> > > >         |       |                 |       |                 |       |
> > > >         +-------+                 +-------+                 +-------+
> > > >         192.168.88.14    192.168.88.13/192.168.89.13    192.168.89.15
> > > >
> > > > All three systems are dual 2.8 GHz AMD Opteron Processor 254 systems
> > > > with 4 GB memory and all running the 2.6.20.7 kernel.  All the NICs
> > > > are Myricom PCI-E 10-GigE NICs.
> > > >
> > > > The 80 ms delay was introduced by applying netem to lang1's eth3
> > > > interface:
> > > >
> > > > [root@lang1 ~]# tc qdisc add dev eth3 root netem delay 80ms limit 20000
> > > > [root@lang1 ~]# tc qdisc show
> > > > qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> > > > qdisc netem 8022: dev eth3 limit 20000 delay 80.0ms reorder 100%
> > > >
> > > > Experimentation determined that netem running on lang1 could handle
> > > > about 8-8.5 Gbps without dropping packets.
> > > >
> > > > 8.5 Gbps UDP test:
> > > >
> > > > [root@lang2 ~]# nuttcp -u -Ri8.5g -w20m 192.168.89.15
> > > > 10136.4844 MB /  10.01 sec = 8497.8205 Mbps 100 %TX 56 %RX 0 / 1297470 drop/pkt 0.00 %loss
> > > >
> > > > Increasing the rate to 9 Gbps would give some loss:
> > > >
> > > > [root@lang2 ~]# nuttcp -u -Ri9g -w20m 192.168.89.15
> > > > 10219.1719 MB /  10.01 sec = 8560.2455 Mbps 100 %TX 58 %RX 65500 / 1373554 drop/pkt 4.77 %loss
> > > >
> > > > Based on this, the specification of a 60 MB TCP socket buffer size was
> > > > used during the TCP tests to avoid overstressing the lang1 netem delay
> > > > emulator (to avoid dropping any packets).
> > > >
> > > > Simple ping through the lang1 netem delay emulator:
> > > >
> > > > [root@lang2 ~]# ping -c 5 192.168.89.15
> > > > PING 192.168.89.15 (192.168.89.15) 56(84) bytes of data.
> > > > 64 bytes from 192.168.89.15: icmp_seq=1 ttl=63 time=80.4 ms
> > > > 64 bytes from 192.168.89.15: icmp_seq=2 ttl=63 time=82.1 ms
> > > > 64 bytes from 192.168.89.15: icmp_seq=3 ttl=63 time=82.1 ms
> > > > 64 bytes from 192.168.89.15: icmp_seq=4 ttl=63 time=82.1 ms
> > > > 64 bytes from 192.168.89.15: icmp_seq=5 ttl=63 time=82.1 ms
> > > >
> > > > --- 192.168.89.15 ping statistics ---
> > > > 5 packets transmitted, 5 received, 0% packet loss, time 4014ms
> > > > rtt min/avg/max/mdev = 80.453/81.804/82.173/0.722 ms
> > > >
> > > > And a bidirectional traceroute (using the "nuttcp -xt" option):
> > > >
> > > > [root@lang2 ~]# nuttcp -xt 192.168.89.15
> > > > traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
> > > >  1  192.168.88.13 (192.168.88.13)  0.141 ms   0.125 ms   0.125 ms
> > > >  2  192.168.89.15 (192.168.89.15)  82.112 ms   82.039 ms   82.541 ms
> > > >
> > > > traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
> > > >  1  192.168.89.13 (192.168.89.13)  81.101 ms   83.001 ms   82.999 ms
> > > >  2  192.168.88.14 (192.168.88.14)  83.005 ms   82.985 ms   82.978 ms
> > > >
> > > > So is this a real bug in cubic (and bic), or do I just not understand
> > > > something basic.
> > > >
> > > >                                                 -Bill
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-10 15:32     ` Bill Fink
  2007-05-10 17:31       ` SANGTAE HA
@ 2007-05-10 18:39       ` rhee
  2007-05-10 20:35         ` David Miller
  1 sibling, 1 reply; 21+ messages in thread
From: rhee @ 2007-05-10 18:39 UTC (permalink / raw)
  To: Bill Fink; +Cc: Bill Fink, SANGTAE HA, Linux Network Developers


Bill,
Could you test with the lastest version of CUBIC? this is not the latest
version of it you tested.
Injong

> As a followup, I ran a somewhat interesting test.  I increased the
> requested socket buffer size to 100 MB, which is sufficient to
> overstress the capabilities of the netem delay emulator (which can
> handle up to about 8.5 Gbps).  This causes some packet loss when
> using the standard Reno agressive "slow start".
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     0 segments retransmited
>
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> cubic
>
> [root@lang2 ~]# echo 0 > /sys/module/tcp_cubic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
> 0
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>    69.9829 MB /   1.00 sec =  585.1895 Mbps
>   311.9521 MB /   1.00 sec = 2616.9019 Mbps
>     0.2332 MB /   1.00 sec =    1.9559 Mbps
>    37.9907 MB /   1.00 sec =  318.6912 Mbps
>   702.7856 MB /   1.00 sec = 5895.4640 Mbps
>   817.0142 MB /   1.00 sec = 6853.7006 Mbps
>   820.3125 MB /   1.00 sec = 6881.3626 Mbps
>   820.5625 MB /   1.00 sec = 6883.2601 Mbps
>   813.0125 MB /   1.00 sec = 6820.2678 Mbps
>   815.7756 MB /   1.00 sec = 6842.8867 Mbps
>
>  5253.2500 MB /  10.07 sec = 4378.0109 Mbps 72 %TX 35 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     464 segments retransmited
>     464 fast retransmits
>
> Contrast that with the default behavior.
>
> [root@lang2 ~]# echo 100 >
> /sys/module/tcp_cubic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
> 100
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>     6.8188 MB /   1.00 sec =   57.1670 Mbps
>    16.2097 MB /   1.00 sec =  135.9795 Mbps
>    25.4810 MB /   1.00 sec =  213.7525 Mbps
>    38.7256 MB /   1.00 sec =  324.8580 Mbps
>    49.7998 MB /   1.00 sec =  417.7565 Mbps
>    62.5745 MB /   1.00 sec =  524.9189 Mbps
>    78.6646 MB /   1.00 sec =  659.8947 Mbps
>    98.9673 MB /   1.00 sec =  830.2086 Mbps
>   124.3201 MB /   1.00 sec = 1038.7288 Mbps
>   156.1584 MB /   1.00 sec = 1309.9730 Mbps
>
>   775.2500 MB /  10.64 sec =  611.0181 Mbps 7 %TX 7 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     464 segments retransmited
>     464 fast retransmits
>
> The standard Reno aggressive "slow start" gets much better overall
> performance even in this case, because even though the default cubic
> behavior manages to avoid the "congestion" event, its lack of
> aggressiveness during the initial slow start period puts it at a
> major disadvantage.  It would take a long time for the tortoise
> in this race to catch up with the hare.
>
> It seems best to ramp up as quickly as possible to any congestion,
> using the standard Reno aggressive "slow start" behavior, and then
> let the power of cubic take over from there, getting the best of
> both worlds.
>
> For completeness here's the same test with bic.
>
> First with the standard Reno aggessive "slow start" behavior:
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     464 segments retransmited
>     464 fast retransmits
>
> [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
> bic
>
> [root@lang2 ~]# echo 0 > /sys/module/tcp_bic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_bic/parameters/initial_ssthresh
> 0
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>    69.9829 MB /   1.00 sec =  585.2770 Mbps
>   302.3921 MB /   1.00 sec = 2536.7045 Mbps
>     0.0000 MB /   1.00 sec =    0.0000 Mbps
>     0.7520 MB /   1.00 sec =    6.3079 Mbps
>   114.1570 MB /   1.00 sec =  957.5914 Mbps
>   792.9634 MB /   1.00 sec = 6651.5131 Mbps
>   845.9099 MB /   1.00 sec = 7096.4182 Mbps
>   865.0825 MB /   1.00 sec = 7257.1575 Mbps
>   890.4663 MB /   1.00 sec = 7470.0567 Mbps
>   911.5039 MB /   1.00 sec = 7646.3560 Mbps
>
>  4829.9375 MB /  10.05 sec = 4033.0191 Mbps 76 %TX 32 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     1093 segments retransmited
>     1093 fast retransmits
>
> And then with the default bic behavior:
>
> [root@lang2 ~]# echo 100 > /sys/module/tcp_bic/parameters/initial_ssthresh
> [root@lang2 ~]# cat /sys/module/tcp_bic/parameters/initial_ssthresh
> 100
>
> [root@lang2 ~]# nuttcp -T10 -i1 -w100m 192.168.89.15
>     9.9548 MB /   1.00 sec =   83.1028 Mbps
>    47.5439 MB /   1.00 sec =  398.8351 Mbps
>   107.6147 MB /   1.00 sec =  902.7506 Mbps
>   183.9038 MB /   1.00 sec = 1542.7124 Mbps
>   313.4875 MB /   1.00 sec = 2629.7689 Mbps
>   531.0012 MB /   1.00 sec = 4454.3032 Mbps
>   841.7866 MB /   1.00 sec = 7061.5098 Mbps
>   837.5867 MB /   1.00 sec = 7026.4041 Mbps
>   834.8889 MB /   1.00 sec = 7003.3667 Mbps
>
>  4539.6250 MB /  10.00 sec = 3806.5410 Mbps 50 %TX 34 %RX
>
> [root@lang2 ~]# netstat -s | grep -i retrans
>     1093 segments retransmited
>     1093 fast retransmits
>
> bic actually does much better than cubic for this scenario, and only
> loses out to the standard Reno aggressive "slow start" behavior by a
> small amount.  Of course in the case of no congestion, it loses out
> by a much more significant margin.
>
> This reinforces my belief that it's best to marry the standard Reno
> aggressive initial "slow start" behavior with the better performance
> of bic or cubic during the subsequent steady state portion of the
> TCP session.
>
> I can of course achieve that objective by setting initial_ssthresh
> to 0, but perhaps that should be made the default behavior.
>
> 						-Bill
>
>
>
> On Wed, 9 May 2007, I wrote:
>
>> Hi Sangtae,
>>
>> On Tue, 8 May 2007, SANGTAE HA wrote:
>>
>> > Hi Bill,
>> >
>> > At this time, BIC and CUBIC use a less aggressive slow start than
>> > other protocols. Because we observed "slow start" is somewhat
>> > aggressive and introduced a lot of packet losses. This may be changed
>> > to standard "slow start" in later version of BIC and CUBIC, but, at
>> > this time, we still using a modified slow start.
>>
>> "slow start" is somewhat of a misnomer.  However, I'd argue in favor
>> of using the standard "slow start" for BIC and CUBIC as the default.
>> Is the rationale for using a less agressive "slow start" to be gentler
>> to certain receivers, which possibly can't handle a rapidly increasing
>> initial burst of packets (and the resultant necessary allocation of
>> system resources)?  Or is it related to encountering actual network
>> congestion during the initial "slow start" period, and how well that
>> is responded to?
>>
>> > So, as you observed, this modified slow start behavior may slow for
>> > 10G testing. You can alleviate this for your 10G testing by changing
>> > BIC and CUBIC to use a standard "slow start" by loading these modules
>> > with "initial_ssthresh=0".
>>
>> I saw the initial_ssthresh parameter, but didn't know what it did or
>> even what its units were.  I saw the default value was 100 and tried
>> increasing it, but I didn't think to try setting it to 0.
>>
>> [root@lang2 ~]# grep -r initial_ssthresh
>> /usr/src/kernels/linux-2.6.20.7/Documentation/
>> [root@lang2 ~]#
>>
>> It would be good to have some documentation for these bic and cubic
>> parameters similar to the documentation in ip-sysctl.txt for the
>> /proc/sys/net/ipv[46]/* variables (I know, I know, I should just
>> "use the source").
>>
>> Is it expected that the cubic "slow start" is that much less agressive
>> than the bic "slow start" (from 10 secs to max rate for bic in my test
>> to 25 secs to max rate for cubic).  This could be considered a
>> performance
>> regression since the default TCP was changed from bic to cubic.
>>
>> In any event, I'm now happy as setting initial_ssthresh to 0 works
>> well for me.
>>
>> [root@lang2 ~]# netstat -s | grep -i retrans
>>     0 segments retransmited
>>
>> [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
>> cubic
>>
>> [root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
>> 0
>>
>> [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
>>    69.9829 MB /   1.00 sec =  584.2065 Mbps
>>   843.1467 MB /   1.00 sec = 7072.9052 Mbps
>>   844.3655 MB /   1.00 sec = 7082.6544 Mbps
>>   842.2671 MB /   1.00 sec = 7065.7169 Mbps
>>   839.9204 MB /   1.00 sec = 7045.8335 Mbps
>>   840.1780 MB /   1.00 sec = 7048.3114 Mbps
>>   834.1475 MB /   1.00 sec = 6997.4270 Mbps
>>   835.5972 MB /   1.00 sec = 7009.3148 Mbps
>>   835.8152 MB /   1.00 sec = 7011.7537 Mbps
>>   830.9333 MB /   1.00 sec = 6969.9281 Mbps
>>
>>  7617.1875 MB /  10.01 sec = 6386.2622 Mbps 90 %TX 46 %RX
>>
>> [root@lang2 ~]# netstat -s | grep -i retrans
>>     0 segments retransmited
>>
>> 						-Thanks a lot!
>>
>> 						-Bill
>>
>>
>>
>> > Regards,
>> > Sangtae
>> >
>> >
>> > On 5/6/07, Bill Fink <billfink@mindspring.com> wrote:
>> > > The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
>> > > extent bic) seems to be way too slow.  With an ~80 ms RTT, this
>> > > is what cubic delivers (thirty second test with one second interval
>> > > reporting and specifying a socket buffer size of 60 MB):
>> > >
>> > > [root@lang2 ~]# netstat -s | grep -i retrans
>> > >     0 segments retransmited
>> > >
>> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
>> > > cubic
>> > >
>> > > [root@lang2 ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
>> > >     6.8188 MB /   1.00 sec =   57.0365 Mbps
>> > >    16.2097 MB /   1.00 sec =  135.9824 Mbps
>> > >    25.4553 MB /   1.00 sec =  213.5420 Mbps
>> > >    35.5127 MB /   1.00 sec =  297.9119 Mbps
>> > >    43.0066 MB /   1.00 sec =  360.7770 Mbps
>> > >    50.3210 MB /   1.00 sec =  422.1370 Mbps
>> > >    59.0796 MB /   1.00 sec =  495.6124 Mbps
>> > >    69.1284 MB /   1.00 sec =  579.9098 Mbps
>> > >    76.6479 MB /   1.00 sec =  642.9130 Mbps
>> > >    90.6189 MB /   1.00 sec =  760.2835 Mbps
>> > >   109.4348 MB /   1.00 sec =  918.0361 Mbps
>> > >   128.3105 MB /   1.00 sec = 1076.3813 Mbps
>> > >   150.4932 MB /   1.00 sec = 1262.4686 Mbps
>> > >   175.9229 MB /   1.00 sec = 1475.7965 Mbps
>> > >   205.9412 MB /   1.00 sec = 1727.6150 Mbps
>> > >   240.8130 MB /   1.00 sec = 2020.1504 Mbps
>> > >   282.1790 MB /   1.00 sec = 2367.1644 Mbps
>> > >   318.1841 MB /   1.00 sec = 2669.1349 Mbps
>> > >   372.6814 MB /   1.00 sec = 3126.1687 Mbps
>> > >   440.8411 MB /   1.00 sec = 3698.5200 Mbps
>> > >   524.8633 MB /   1.00 sec = 4403.0220 Mbps
>> > >   614.3542 MB /   1.00 sec = 5153.7367 Mbps
>> > >   718.9917 MB /   1.00 sec = 6031.5386 Mbps
>> > >   829.0474 MB /   1.00 sec = 6954.6438 Mbps
>> > >   867.3289 MB /   1.00 sec = 7275.9510 Mbps
>> > >   865.7759 MB /   1.00 sec = 7262.9813 Mbps
>> > >   864.4795 MB /   1.00 sec = 7251.7071 Mbps
>> > >   864.5425 MB /   1.00 sec = 7252.8519 Mbps
>> > >   867.3372 MB /   1.00 sec = 7246.9232 Mbps
>> > >
>> > > 10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX
>> > >
>> > > [root@lang2 ~]# netstat -s | grep -i retrans
>> > >     0 segments retransmited
>> > >
>> > > It takes 25 seconds for cubic TCP to reach its maximal rate.
>> > > Note that there were no TCP retransmissions (no congestion
>> > > experienced).
>> > >
>> > > Now with bic (only 20 second test this time):
>> > >
>> > > [root@lang2 ~]# echo bic > /proc/sys/net/ipv4/tcp_congestion_control
>> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
>> > > bic
>> > >
>> > > [root@lang2 ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
>> > >     9.9548 MB /   1.00 sec =   83.1497 Mbps
>> > >    47.2021 MB /   1.00 sec =  395.9762 Mbps
>> > >    92.4304 MB /   1.00 sec =  775.3889 Mbps
>> > >   134.3774 MB /   1.00 sec = 1127.2758 Mbps
>> > >   194.3286 MB /   1.00 sec = 1630.1987 Mbps
>> > >   280.0598 MB /   1.00 sec = 2349.3613 Mbps
>> > >   404.3201 MB /   1.00 sec = 3391.8250 Mbps
>> > >   559.1594 MB /   1.00 sec = 4690.6677 Mbps
>> > >   792.7100 MB /   1.00 sec = 6650.0257 Mbps
>> > >   857.2241 MB /   1.00 sec = 7190.6942 Mbps
>> > >   852.6912 MB /   1.00 sec = 7153.3283 Mbps
>> > >   852.6968 MB /   1.00 sec = 7153.2538 Mbps
>> > >   851.3162 MB /   1.00 sec = 7141.7575 Mbps
>> > >   851.4927 MB /   1.00 sec = 7143.0240 Mbps
>> > >   850.8782 MB /   1.00 sec = 7137.8762 Mbps
>> > >   852.7119 MB /   1.00 sec = 7153.2949 Mbps
>> > >   852.3879 MB /   1.00 sec = 7150.2982 Mbps
>> > >   850.2163 MB /   1.00 sec = 7132.5165 Mbps
>> > >   849.8340 MB /   1.00 sec = 7129.0026 Mbps
>> > >
>> > > 11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX
>> > >
>> > > [root@lang2 ~]# netstat -s | grep -i retrans
>> > >     0 segments retransmited
>> > >
>> > > bic does better but still takes 10 seconds to achieve its maximal
>> > > rate.
>> > >
>> > > Surprisingly venerable reno does the best (only a 10 second test):
>> > >
>> > > [root@lang2 ~]# echo reno >
>> /proc/sys/net/ipv4/tcp_congestion_control
>> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
>> > > reno
>> > >
>> > > [root@lang2 ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
>> > >    69.9829 MB /   1.01 sec =  583.5822 Mbps
>> > >   844.3870 MB /   1.00 sec = 7083.2808 Mbps
>> > >   862.7568 MB /   1.00 sec = 7237.7342 Mbps
>> > >   859.5725 MB /   1.00 sec = 7210.8981 Mbps
>> > >   860.1365 MB /   1.00 sec = 7215.4487 Mbps
>> > >   865.3940 MB /   1.00 sec = 7259.8434 Mbps
>> > >   863.9678 MB /   1.00 sec = 7247.4942 Mbps
>> > >   864.7493 MB /   1.00 sec = 7254.4634 Mbps
>> > >   864.6660 MB /   1.00 sec = 7253.5183 Mbps
>> > >
>> > >  7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX
>> > >
>> > > [root@lang2 ~]# netstat -s | grep -i retrans
>> > >     0 segments retransmited
>> > >
>> > > reno achieves its maximal rate in about 2 seconds.  This is what I
>> > > would expect from the exponential increase during TCP's initial
>> > > slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
>> > > jumbo frame packets would require:
>> > >
>> > >         [root@lang2 ~]# bc -l
>> > >         scale=10
>> > >         10^10*0.080/9000/8
>> > >         11111.1111111111
>> > >
>> > > So 11111 packets would have to be in flight during one RTT.
>> > > It should take log2(11111)+1 round trips to achieve 10 Gbps
>> > > (note bc's l() function is logE);
>> > >
>> > >         [root@lang2 ~]# bc -l
>> > >         scale=10
>> > >         l(11111)/l(2)+1
>> > >         14.4397010470
>> > >
>> > > And 15 round trips at 80 ms each gives a total time of:
>> > >
>> > >         [root@lang2 ~]# bc -l
>> > >         scale=10
>> > >         15*0.080
>> > >         1.200
>> > >
>> > > So if there is no packet loss (which there wasn't), it should only
>> > > take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
>> > > this ballpark range.
>> > >
>> > > Now it's quite possible there's something basic I don't understand,
>> > > such as some /proc/sys/net/ipv4/tcp_* or
>> /sys/module/tcp_*/parameters/*
>> > > parameter I've overlooked, in which case feel free to just refer me
>> > > to any suitable documentation.
>> > >
>> > > I also checked the Changelog for 2.6.20.{8,9,10,11} to see if there
>> > > might be any relevant recent bug fixes, but the only thing that
>> seemed
>> > > even remotely related was the 2.6.20.11 bug fix for the tcp_mem
>> setting.
>> > > Although this did affect me, I manually adjusted the tcp_mem
>> settings
>> > > before running these tests.
>> > >
>> > > [root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_mem
>>            393216  524288  786432
>> > >
>> > > The test setup was:
>> > >
>> > >         +-------+                 +-------+
>> +-------+
>> > >         |       |eth2         eth2|       |eth3         eth2|
>> |
>> > >         | lang2 |-----10-GigE-----| lang1 |-----10-GigE-----| lang3
>> |
>> > >         |       |                 |       |                 |
>> |
>> > >         +-------+                 +-------+
>> +-------+
>> > >         192.168.88.14    192.168.88.13/192.168.89.13
>> 192.168.89.15
>> > >
>> > > All three systems are dual 2.8 GHz AMD Opteron Processor 254 systems
>> > > with 4 GB memory and all running the 2.6.20.7 kernel.  All the NICs
>> > > are Myricom PCI-E 10-GigE NICs.
>> > >
>> > > The 80 ms delay was introduced by applying netem to lang1's eth3
>> > > interface:
>> > >
>> > > [root@lang1 ~]# tc qdisc add dev eth3 root netem delay 80ms limit
>> 20000
>> > > [root@lang1 ~]# tc qdisc show
>> > > qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1
>> 1 1 1 1 1 1 1
>> > > qdisc netem 8022: dev eth3 limit 20000 delay 80.0ms reorder 100%
>> > >
>> > > Experimentation determined that netem running on lang1 could handle
>> > > about 8-8.5 Gbps without dropping packets.
>> > >
>> > > 8.5 Gbps UDP test:
>> > >
>> > > [root@lang2 ~]# nuttcp -u -Ri8.5g -w20m 192.168.89.15
>> > > 10136.4844 MB /  10.01 sec = 8497.8205 Mbps 100 %TX 56 %RX 0 /
>> 1297470 drop/pkt 0.00 %loss
>> > >
>> > > Increasing the rate to 9 Gbps would give some loss:
>> > >
>> > > [root@lang2 ~]# nuttcp -u -Ri9g -w20m 192.168.89.15
>> > > 10219.1719 MB /  10.01 sec = 8560.2455 Mbps 100 %TX 58 %RX 65500 /
>> 1373554 drop/pkt 4.77 %loss
>> > >
>> > > Based on this, the specification of a 60 MB TCP socket buffer size
>> was
>> > > used during the TCP tests to avoid overstressing the lang1 netem
>> delay
>> > > emulator (to avoid dropping any packets).
>> > >
>> > > Simple ping through the lang1 netem delay emulator:
>> > >
>> > > [root@lang2 ~]# ping -c 5 192.168.89.15
>> > > PING 192.168.89.15 (192.168.89.15) 56(84) bytes of data.
>> > > 64 bytes from 192.168.89.15: icmp_seq=1 ttl=63 time=80.4 ms
>> > > 64 bytes from 192.168.89.15: icmp_seq=2 ttl=63 time=82.1 ms
>> > > 64 bytes from 192.168.89.15: icmp_seq=3 ttl=63 time=82.1 ms
>> > > 64 bytes from 192.168.89.15: icmp_seq=4 ttl=63 time=82.1 ms
>> > > 64 bytes from 192.168.89.15: icmp_seq=5 ttl=63 time=82.1 ms
>> > >
>> > > --- 192.168.89.15 ping statistics ---
>> > > 5 packets transmitted, 5 received, 0% packet loss, time 4014ms
>> > > rtt min/avg/max/mdev = 80.453/81.804/82.173/0.722 ms
>> > >
>> > > And a bidirectional traceroute (using the "nuttcp -xt" option):
>> > >
>> > > [root@lang2 ~]# nuttcp -xt 192.168.89.15
>> > > traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte
>> packets
>> > >  1  192.168.88.13 (192.168.88.13)  0.141 ms   0.125 ms   0.125 ms
>> > >  2  192.168.89.15 (192.168.89.15)  82.112 ms   82.039 ms   82.541 ms
>> > >
>> > > traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte
>> packets
>> > >  1  192.168.89.13 (192.168.89.13)  81.101 ms   83.001 ms   82.999 ms
>> > >  2  192.168.88.14 (192.168.88.14)  83.005 ms   82.985 ms   82.978 ms
>> > >
>> > > So is this a real bug in cubic (and bic), or do I just not
>> understand
>> > > something basic.
>> > >
>> > >                                                 -Bill
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-10 18:39       ` rhee
@ 2007-05-10 20:35         ` David Miller
  2007-05-10 20:45           ` Stephen Hemminger
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2007-05-10 20:35 UTC (permalink / raw)
  To: rhee; +Cc: billfink, sangtae.ha, netdev

From: rhee@ncsu.edu
Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)

> 
> Bill,
> Could you test with the lastest version of CUBIC? this is not the latest
> version of it you tested.

Rhee-sangsang-nim, it might be a lot easier for people if you provide
a patch against the current tree for users to test instead of
constantly pointing them to your web site.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-10 20:35         ` David Miller
@ 2007-05-10 20:45           ` Stephen Hemminger
  2007-05-10 20:57             ` Injong Rhee
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2007-05-10 20:45 UTC (permalink / raw)
  To: David Miller; +Cc: rhee, billfink, sangtae.ha, netdev

On Thu, 10 May 2007 13:35:22 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: rhee@ncsu.edu
> Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
> 
> > 
> > Bill,
> > Could you test with the lastest version of CUBIC? this is not the latest
> > version of it you tested.
> 
> Rhee-sangsang-nim, it might be a lot easier for people if you provide
> a patch against the current tree for users to test instead of
> constantly pointing them to your web site.
> -

The 2.6.22 version should have the latest version, that I know of.
There was small patch from 2.6.21 that went in.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-10 20:45           ` Stephen Hemminger
@ 2007-05-10 20:57             ` Injong Rhee
  2007-05-12 16:07               ` Bill Fink
  0 siblings, 1 reply; 21+ messages in thread
From: Injong Rhee @ 2007-05-10 20:57 UTC (permalink / raw)
  To: Stephen Hemminger, David Miller; +Cc: rhee, billfink, sangtae.ha, netdev

Oops. I thought Bill was using 2.6.20 instead of 2.6.22 which should contain 
our latest update.

Regarding slow start behavior, the latest version should not change though. 
I think it would be ok to change the slow start of bic and cubic to the 
default slow start. But what we observed is that when BDP is large, 
increasing cwnd by two times is really an overkill. consider increasing from 
1024 into 2048 packets..maybe the target is somewhere between them. We have 
potentially a large number of packets flushed into the network. That was the 
original motivation to change slow start from the default into a more gentle 
version. But I see the point that Bill is raising. We are working on 
improving this behavior in our lab. We will get back to this topic in a 
couple of weeks after we finish our testing and produce a patch.

----- Original Message ----- 
From: "Stephen Hemminger" <shemminger@linux-foundation.org>
To: "David Miller" <davem@davemloft.net>
Cc: <rhee@ncsu.edu>; <billfink@mindspring.com>; <sangtae.ha@gmail.com>; 
<netdev@vger.kernel.org>
Sent: Thursday, May 10, 2007 4:45 PM
Subject: Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

> On Thu, 10 May 2007 13:35:22 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
>
>> From: rhee@ncsu.edu
>> Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
>>
>> >
>> > Bill,
>> > Could you test with the lastest version of CUBIC? this is not the 
>> > latest
>> > version of it you tested.
>>
>> Rhee-sangsang-nim, it might be a lot easier for people if you provide
>> a patch against the current tree for users to test instead of
>> constantly pointing them to your web site.
>> -
>
> The 2.6.22 version should have the latest version, that I know of.
> There was small patch from 2.6.21 that went in.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-10 20:57             ` Injong Rhee
@ 2007-05-12 16:07               ` Bill Fink
  2007-05-12 16:45                 ` SANGTAE HA
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Fink @ 2007-05-12 16:07 UTC (permalink / raw)
  To: Injong Rhee; +Cc: Stephen Hemminger, David Miller, rhee, sangtae.ha, netdev

On Thu, 10 May 2007, Injong Rhee wrote:

> Oops. I thought Bill was using 2.6.20 instead of 2.6.22 which should contain 
> our latest update.

I am using 2.6.20.7.

> Regarding slow start behavior, the latest version should not change though. 
> I think it would be ok to change the slow start of bic and cubic to the 
> default slow start. But what we observed is that when BDP is large, 
> increasing cwnd by two times is really an overkill. consider increasing from 
> 1024 into 2048 packets..maybe the target is somewhere between them. We have 
> potentially a large number of packets flushed into the network. That was the 
> original motivation to change slow start from the default into a more gentle 
> version. But I see the point that Bill is raising. We are working on 
> improving this behavior in our lab. We will get back to this topic in a 
> couple of weeks after we finish our testing and produce a patch.

Is it feasible to replace the version of cubic in 2.6.20.7 with the
new 2.1 version of cubic without changing the rest of the kernel, or
are there kernel changes/dependencies that would prevent that?

I've tried building and running a 2.6.21-git13 kernel, but am having
some difficulties.  I will be away the rest of the weekend so won't be
able to get back to this until Monday.

						-Bill

P.S.  When getting into the the 10 Gbps range, I'm not sure there's
      any way to avoid the types of large increases during "slow start"
      that you mention, if you want to achieve those kinds of data
      rates.



> ----- Original Message ----- 
> From: "Stephen Hemminger" <shemminger@linux-foundation.org>
> To: "David Miller" <davem@davemloft.net>
> Cc: <rhee@ncsu.edu>; <billfink@mindspring.com>; <sangtae.ha@gmail.com>; 
> <netdev@vger.kernel.org>
> Sent: Thursday, May 10, 2007 4:45 PM
> Subject: Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
> 
> 
> > On Thu, 10 May 2007 13:35:22 -0700 (PDT)
> > David Miller <davem@davemloft.net> wrote:
> >
> >> From: rhee@ncsu.edu
> >> Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
> >>
> >> >
> >> > Bill,
> >> > Could you test with the lastest version of CUBIC? this is not the 
> >> > latest
> >> > version of it you tested.
> >>
> >> Rhee-sangsang-nim, it might be a lot easier for people if you provide
> >> a patch against the current tree for users to test instead of
> >> constantly pointing them to your web site.
> >> -
> >
> > The 2.6.22 version should have the latest version, that I know of.
> > There was small patch from 2.6.21 that went in.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-12 16:07               ` Bill Fink
@ 2007-05-12 16:45                 ` SANGTAE HA
  2007-05-16  6:44                   ` Bill Fink
  0 siblings, 1 reply; 21+ messages in thread
From: SANGTAE HA @ 2007-05-12 16:45 UTC (permalink / raw)
  To: Bill Fink; +Cc: Injong Rhee, Stephen Hemminger, David Miller, rhee, netdev

[-- Attachment #1: Type: text/plain, Size: 3058 bytes --]

Hi Bill,

This is the small patch that has been applied to 2.6.22.
Also, there is "limited slow start", which is an experimental RFC
(RFC3742), to surmount this large increase during slow start.
But, your kernel might not have this. Please check there is a sysctl
variable "tcp_max_ssthresh".

Thanks,
Sangtae


On 5/12/07, Bill Fink <billfink@mindspring.com> wrote:
> On Thu, 10 May 2007, Injong Rhee wrote:
>
> > Oops. I thought Bill was using 2.6.20 instead of 2.6.22 which should contain
> > our latest update.
>
> I am using 2.6.20.7.
>
> > Regarding slow start behavior, the latest version should not change though.
> > I think it would be ok to change the slow start of bic and cubic to the
> > default slow start. But what we observed is that when BDP is large,
> > increasing cwnd by two times is really an overkill. consider increasing from
> > 1024 into 2048 packets..maybe the target is somewhere between them. We have
> > potentially a large number of packets flushed into the network. That was the
> > original motivation to change slow start from the default into a more gentle
> > version. But I see the point that Bill is raising. We are working on
> > improving this behavior in our lab. We will get back to this topic in a
> > couple of weeks after we finish our testing and produce a patch.
>
> Is it feasible to replace the version of cubic in 2.6.20.7 with the
> new 2.1 version of cubic without changing the rest of the kernel, or
> are there kernel changes/dependencies that would prevent that?
>
> I've tried building and running a 2.6.21-git13 kernel, but am having
> some difficulties.  I will be away the rest of the weekend so won't be
> able to get back to this until Monday.
>
>                                                 -Bill
>
> P.S.  When getting into the the 10 Gbps range, I'm not sure there's
>       any way to avoid the types of large increases during "slow start"
>       that you mention, if you want to achieve those kinds of data
>       rates.
>
>
>
> > ----- Original Message -----
> > From: "Stephen Hemminger" <shemminger@linux-foundation.org>
> > To: "David Miller" <davem@davemloft.net>
> > Cc: <rhee@ncsu.edu>; <billfink@mindspring.com>; <sangtae.ha@gmail.com>;
> > <netdev@vger.kernel.org>
> > Sent: Thursday, May 10, 2007 4:45 PM
> > Subject: Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
> >
> >
> > > On Thu, 10 May 2007 13:35:22 -0700 (PDT)
> > > David Miller <davem@davemloft.net> wrote:
> > >
> > >> From: rhee@ncsu.edu
> > >> Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
> > >>
> > >> >
> > >> > Bill,
> > >> > Could you test with the lastest version of CUBIC? this is not the
> > >> > latest
> > >> > version of it you tested.
> > >>
> > >> Rhee-sangsang-nim, it might be a lot easier for people if you provide
> > >> a patch against the current tree for users to test instead of
> > >> constantly pointing them to your web site.
> > >> -
> > >
> > > The 2.6.22 version should have the latest version, that I know of.
> > > There was small patch from 2.6.21 that went in.
>

[-- Attachment #2: tcp_cubic-2.6.20.3.patch --]
[-- Type: application/octet-stream, Size: 927 bytes --]

--- net/ipv4/tcp_cubic.c.old	2007-03-22 14:08:56.000000000 -0400
+++ net/ipv4/tcp_cubic.c	2007-03-22 23:45:41.000000000 -0400
@@ -1,5 +1,5 @@
 /*
- * TCP CUBIC: Binary Increase Congestion control for TCP v2.0
+ * TCP CUBIC: Binary Increase Congestion control for TCP v2.1
  *
  * This is from the implementation of CUBIC TCP in
  * Injong Rhee, Lisong Xu.
@@ -215,7 +215,9 @@ static inline void bictcp_update(struct 
 	if (ca->delay_min > 0) {
 		/* max increment = Smax * rtt / 0.1  */
 		min_cnt = (cwnd * HZ * 8)/(10 * max_increment * ca->delay_min);
-		if (ca->cnt < min_cnt)
+
+		/* use concave growth when the target is above the origin */
+		if (ca->cnt < min_cnt && t >= ca->bic_K)
 			ca->cnt = min_cnt;
 	}
 
@@ -401,4 +403,4 @@ module_exit(cubictcp_unregister);
 MODULE_AUTHOR("Sangtae Ha, Stephen Hemminger");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("CUBIC TCP");
-MODULE_VERSION("2.0");
+MODULE_VERSION("2.1");

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-12 16:45                 ` SANGTAE HA
@ 2007-05-16  6:44                   ` Bill Fink
  2007-06-12 22:12                     ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Fink @ 2007-05-16  6:44 UTC (permalink / raw)
  To: SANGTAE HA; +Cc: Injong Rhee, Stephen Hemminger, David Miller, rhee, netdev

Hi Sangtae,

On Sat, 12 May 2007, Sangtae Ha wrote:

> Hi Bill,
> 
> This is the small patch that has been applied to 2.6.22.
> Also, there is "limited slow start", which is an experimental RFC
> (RFC3742), to surmount this large increase during slow start.
> But, your kernel might not have this. Please check there is a sysctl
> variable "tcp_max_ssthresh".

I reviewed RFC3742.  It seems to me to be problematic for 10-GigE and
faster nets, although it might be OK for GigE nets.

Take the case of a 10-GigE net connecting two sites with an 85 ms RTT
(real world case I am interested in connecting an East coast and
West coast site).  That equates to:

	gwiz% bc -l
	scale=10
	10^10*0.085/9000/8
	11805.5555555555

up to 11806 9000-byte jumbo frame packets possibly in flight during
one RTT at full 10-GigE line rate.  Using the formula from RFC3742:

	log(max_ssthresh) + (cwnd - max_ssthresh)/(max_ssthresh/2)

[note I believe the formula should have a "+1"]

and the recommended value for max_ssthresh of 100 gives:

	gwiz% bc -l
	scale=10
	l(100)/l(2)+(11806-100)/(100/2)
	240.7638561902

That's 241 RTTs to get up to full 10-GigE line rate, or a total
period of 241*0.085 = 20.485 seconds.  And if you were using
standard 1500-byte packets, the numbers would become up to 70834
packets in flight during one RTT, 1422 RTTs to achieve full 10-GigE
line rate, which results in a total period of 120.870 seconds.
Considering that international links can have even larger RTTs
and the future will bring 100-GigE links, it doesn't appear this
method scales well to extremely high performance, large RTT paths.

For 10-GigE, max_ssthresh would need to be scaled up to 1000.
Combined with using 9000-byte jumbo frames, this results in:

	gwiz% bc -l
	scale=10
	l(1000)/l(2)+(11806-1000)/(1000/2)
	31.5777842854

That's only 32 RTTs to achieve full 10-GigE line rate, or a total
period of 2.720 seconds (compared to a bare minimum of 14 RTTs
and 1.190 seconds).  Again using standard 1500-byte packets would
take six times as long.  And while a 1000/2=500 packet increase may
seem like a lot, it's only 4.2% of the available 10-GigE bandwidth,
which is the same percentage as a 100/2=50 packet increase on a
GigE network.

BTW there seems to be a calculation error in RFC3742.  They give an
example of a congestion window of 83000 packets (with max_ssthresh
set to 100).  If you do the calculation based on the formula given
(which I believe to be correct), the number of RTTs works out to
be 1665.  If you drop the "/2" from the formula, only then do you
get 836 RTTs as indicated in the RFC example.

My 2.6.20.7 kernel does not have a tcp_max_ssthresh variable, but
it is an undocumented variable in 2.6.21-git13, so I may do some
more tests if and when I can get that working.

I applied your tcp_cubic.c patch to my 2.6.20.7 kernel and ran some
additional tests.  The results were basically the same, which is to
be expected since you indicated the change didn't affect the slow start
behavior.

I did run a couple of new tests, transferring a fixed 1 GB amount
of data, and reducing the amount of buffering to the netem delay
emulator so it could only sustain a data rate of about 1.6 Gbps.

First with the default cubic slow start behavior:

[root@lang2 ~]# cat /proc/version
Linux version 2.6.20.7-cubic21 (root@lang2.eiger.nasa.atd.net) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #2 SMP Tue May 15 03:14:33 EDT 2007

[root@lang2 ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

[root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
100

[root@lang2 ~]# netstat -s | grep -i retrans
    22360 segments retransmited
    17850 fast retransmits
    4503 retransmits in slow start
    4 sack retransmits failed

[root@lang2 ~]# nuttcp -n1g -i1 -w60m 192.168.89.15
    6.8188 MB /   1.00 sec =   56.9466 Mbps
   16.2097 MB /   1.00 sec =  135.9795 Mbps
   25.4553 MB /   1.00 sec =  213.5377 Mbps
   35.6750 MB /   1.00 sec =  299.2676 Mbps
   43.9124 MB /   1.00 sec =  368.3687 Mbps
   52.2266 MB /   1.00 sec =  438.1139 Mbps
   62.1045 MB /   1.00 sec =  520.9765 Mbps
   73.9136 MB /   1.00 sec =  620.0401 Mbps
   87.7820 MB /   1.00 sec =  736.3775 Mbps
  104.2480 MB /   1.00 sec =  874.5074 Mbps
  117.8259 MB /   1.00 sec =  988.3441 Mbps
  139.7266 MB /   1.00 sec = 1172.1969 Mbps
  171.8384 MB /   1.00 sec = 1441.5107 Mbps

 1024.0000 MB /  13.44 sec =  639.2926 Mbps 7 %TX 7 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    22360 segments retransmited
    17850 fast retransmits
    4503 retransmits in slow start
    4 sack retransmits failed

It took 13.44 seconds to transfer 1 GB of data, never experiencing
any congestion.

Now with the standard aggressive Reno "slow start" behavior, which
experiences "congestion" at about 1.6 Gbps:

[root@lang2 ~]# echo 0 >> /sys/module/tcp_cubic/parameters/initial_ssthresh
[root@lang2 ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
0

[root@lang2 ~]# nuttcp -n1g -i1 -w60m 192.168.89.15
   34.5728 MB /   1.00 sec =  288.7865 Mbps
  108.0847 MB /   1.00 sec =  906.6994 Mbps
  160.3540 MB /   1.00 sec = 1345.0124 Mbps
  180.6226 MB /   1.00 sec = 1515.3385 Mbps
  195.5276 MB /   1.00 sec = 1640.2125 Mbps
  199.6750 MB /   1.00 sec = 1675.0192 Mbps

 1024.0000 MB /   6.70 sec = 1282.1900 Mbps 17 %TX 31 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    25446 segments retransmited
    20936 fast retransmits
    4503 retransmits in slow start
    4 sack retransmits failed

It only took 6.70 seconds to transfer 1 GB of data.  Note all the
retransmits were fast retransmits.

And finally with the standard aggressive Reno "slow start" behavior,
with no congestion experienced (increased the amount of buffering to
the netem delay emulator):

[root@lang2 ~]# nuttcp -n1g -i1 -w60m 192.168.89.15
   69.9829 MB /   1.01 sec =  583.0183 Mbps
  837.8787 MB /   1.00 sec = 7028.6427 Mbps

 1024.0000 MB /   2.14 sec = 4005.2066 Mbps 52 %TX 32 %RX

[root@lang2 ~]# netstat -s | grep -i retrans
    25446 segments retransmited
    20936 fast retransmits
    4503 retransmits in slow start
    4 sack retransmits failed

It then only took 2.14 seconds to transfer 1 GB of data.

That's all for now.

						-Bill

> Thanks,
> Sangtae
> 
> 
> On 5/12/07, Bill Fink <billfink@mindspring.com> wrote:
> > On Thu, 10 May 2007, Injong Rhee wrote:
> >
> > > Oops. I thought Bill was using 2.6.20 instead of 2.6.22 which should contain
> > > our latest update.
> >
> > I am using 2.6.20.7.
> >
> > > Regarding slow start behavior, the latest version should not change though.
> > > I think it would be ok to change the slow start of bic and cubic to the
> > > default slow start. But what we observed is that when BDP is large,
> > > increasing cwnd by two times is really an overkill. consider increasing from
> > > 1024 into 2048 packets..maybe the target is somewhere between them. We have
> > > potentially a large number of packets flushed into the network. That was the
> > > original motivation to change slow start from the default into a more gentle
> > > version. But I see the point that Bill is raising. We are working on
> > > improving this behavior in our lab. We will get back to this topic in a
> > > couple of weeks after we finish our testing and produce a patch.
> >
> > Is it feasible to replace the version of cubic in 2.6.20.7 with the
> > new 2.1 version of cubic without changing the rest of the kernel, or
> > are there kernel changes/dependencies that would prevent that?
> >
> > I've tried building and running a 2.6.21-git13 kernel, but am having
> > some difficulties.  I will be away the rest of the weekend so won't be
> > able to get back to this until Monday.
> >
> >                                                 -Bill
> >
> > P.S.  When getting into the the 10 Gbps range, I'm not sure there's
> >       any way to avoid the types of large increases during "slow start"
> >       that you mention, if you want to achieve those kinds of data
> >       rates.
> >
> >
> >
> > > ----- Original Message -----
> > > From: "Stephen Hemminger" <shemminger@linux-foundation.org>
> > > To: "David Miller" <davem@davemloft.net>
> > > Cc: <rhee@ncsu.edu>; <billfink@mindspring.com>; <sangtae.ha@gmail.com>;
> > > <netdev@vger.kernel.org>
> > > Sent: Thursday, May 10, 2007 4:45 PM
> > > Subject: Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
> > >
> > >
> > > > On Thu, 10 May 2007 13:35:22 -0700 (PDT)
> > > > David Miller <davem@davemloft.net> wrote:
> > > >
> > > >> From: rhee@ncsu.edu
> > > >> Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
> > > >>
> > > >> >
> > > >> > Bill,
> > > >> > Could you test with the lastest version of CUBIC? this is not the
> > > >> > latest
> > > >> > version of it you tested.
> > > >>
> > > >> Rhee-sangsang-nim, it might be a lot easier for people if you provide
> > > >> a patch against the current tree for users to test instead of
> > > >> constantly pointing them to your web site.
> > > >> -
> > > >
> > > > The 2.6.22 version should have the latest version, that I know of.
> > > > There was small patch from 2.6.21 that went in.
> >
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-05-16  6:44                   ` Bill Fink
@ 2007-06-12 22:12                     ` David Miller
  2007-06-13  0:16                       ` Stephen Hemminger
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2007-06-12 22:12 UTC (permalink / raw)
  To: billfink; +Cc: sangtae.ha, rhee, shemminger, rhee, netdev

From: Bill Fink <billfink@mindspring.com>
Date: Wed, 16 May 2007 02:44:09 -0400

> [root@lang2 ~]# netstat -s | grep -i retrans
>     25446 segments retransmited
>     20936 fast retransmits
>     4503 retransmits in slow start
>     4 sack retransmits failed
> 
> It then only took 2.14 seconds to transfer 1 GB of data.
> 
> That's all for now.

Thanks for all of your testing and numbers Bill.

Inhong et al., we have to do something about this, the issue
has been known and sitting around for weeks if not months.

How safely can we set the default initial_ssthresh to zero in
Cubic and BIC?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-06-12 22:12                     ` David Miller
@ 2007-06-13  0:16                       ` Stephen Hemminger
  2007-06-13  3:38                         ` Bill Fink
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2007-06-13  0:16 UTC (permalink / raw)
  To: David Miller; +Cc: billfink, sangtae.ha, rhee, rhee, netdev

On Tue, 12 Jun 2007 15:12:58 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Bill Fink <billfink@mindspring.com>
> Date: Wed, 16 May 2007 02:44:09 -0400
> 
> > [root@lang2 ~]# netstat -s | grep -i retrans
> >     25446 segments retransmited
> >     20936 fast retransmits
> >     4503 retransmits in slow start
> >     4 sack retransmits failed
> > 
> > It then only took 2.14 seconds to transfer 1 GB of data.
> > 
> > That's all for now.
> 
> Thanks for all of your testing and numbers Bill.
> 
> Inhong et al., we have to do something about this, the issue
> has been known and sitting around for weeks if not months.
> 
> How safely can we set the default initial_ssthresh to zero in
> Cubic and BIC?

Yes. set it to zero. The module parameter could even go, and just
leave the route metric as a way to set/remember it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-06-13  0:16                       ` Stephen Hemminger
@ 2007-06-13  3:38                         ` Bill Fink
  2007-06-13  7:56                           ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Bill Fink @ 2007-06-13  3:38 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, sangtae.ha, rhee, rhee, netdev

On Tue, 12 Jun 2007, Stephen Hemminger wrote:

> On Tue, 12 Jun 2007 15:12:58 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Bill Fink <billfink@mindspring.com>
> > Date: Wed, 16 May 2007 02:44:09 -0400
> > 
> > > [root@lang2 ~]# netstat -s | grep -i retrans
> > >     25446 segments retransmited
> > >     20936 fast retransmits
> > >     4503 retransmits in slow start
> > >     4 sack retransmits failed
> > > 
> > > It then only took 2.14 seconds to transfer 1 GB of data.
> > > 
> > > That's all for now.
> > 
> > Thanks for all of your testing and numbers Bill.
> > 
> > Inhong et al., we have to do something about this, the issue
> > has been known and sitting around for weeks if not months.
> > 
> > How safely can we set the default initial_ssthresh to zero in
> > Cubic and BIC?
> 
> Yes. set it to zero. The module parameter could even go, and just
> leave the route metric as a way to set/remember it.

Actually, after thinking about this some more I had some second
thoughts about the matter.  For my scenario of an uncongested 10-GigE
path an initial_ssthresh=0 is definitely what is desired.

But perhaps on a congested link with lots of connections, the
initial_ssthresh=100 setting might have some benefit.  I don't
have an easy way of testing that so I was hoping Injong or someone
else might do that and report back.  If there was a benefit, perhaps
it would be useful to have a per-route option for setting the
initial_ssthresh.  That would leave the question of what to make
the default.  There was also the mystery of why cubic's slow start
performance was so much worse than bic's.  If a real benefit could
be demonstrated for the congested case, and if bic's slow start
behavior could be grafted onto cubic, then bic's current slow start
performance (with initial_ssthresh=100) might serve as an adequate
compromise between performance and not being overly aggressive for
the default behavior.

OTOH just setting it to zero as a default should also be fine as
that's the standard Reno behavior.  I'm leaning in that direction
personally, but I'm possibly biased because of my environment,
where I'm trying to get maximum performance out of 10-GigE WAN
networks that aren't particularly congested normally.

						-Bill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  2007-06-13  3:38                         ` Bill Fink
@ 2007-06-13  7:56                           ` David Miller
  2007-06-13 17:27                             ` [PATCH] TCP: remove initial_ssthresh from Cubic Stephen Hemminger
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2007-06-13  7:56 UTC (permalink / raw)
  To: billfink; +Cc: shemminger, sangtae.ha, rhee, rhee, netdev

From: Bill Fink <billfink@mindspring.com>
Date: Tue, 12 Jun 2007 23:38:14 -0400

> If there was a benefit, perhaps it would be useful to have a
> per-route option for setting the initial_ssthresh.

We have this per-route setting already, BIC and CUBIC just override it
with their local initial_ssthresh value when that is non-zero.

I wonder how many folks get tripped up by that totally unexpected
behavior.  What's the point of the route metric we've had all of
these years if the congestion control algorithm is just going to
totally ignore it?  That should never be the default behavior.

> OTOH just setting it to zero as a default should also be fine as
> that's the standard Reno behavior.

This is what I'm going to do.

Injong-ssi and Sangtae-ssi have had several weeks to give us final
guidance in this area, we cannot wait forever.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] TCP: remove initial_ssthresh from Cubic
  2007-06-13  7:56                           ` David Miller
@ 2007-06-13 17:27                             ` Stephen Hemminger
  2007-06-13 18:26                               ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2007-06-13 17:27 UTC (permalink / raw)
  To: David Miller; +Cc: billfink, sangtae.ha, rhee, rhee, netdev

Remove the initial slow start override from TCP cubic.
The old code caused Cubic to start out in slow start mode, which
is less agressive but caused slow performance.

The administrator can override initial slow start threshold on any
TCP congestion control method via the TCP route metrics.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>

--- a/net/ipv4/tcp_cubic.c	2007-06-13 10:18:38.000000000 -0700
+++ b/net/ipv4/tcp_cubic.c	2007-06-13 10:19:19.000000000 -0700
@@ -29,7 +29,6 @@
 static int fast_convergence __read_mostly = 1;
 static int max_increment __read_mostly = 16;
 static int beta __read_mostly = 819;	/* = 819/1024 (BICTCP_BETA_SCALE) */
-static int initial_ssthresh __read_mostly = 100;
 static int bic_scale __read_mostly = 41;
 static int tcp_friendliness __read_mostly = 1;
 
@@ -44,8 +43,6 @@ module_param(max_increment, int, 0644);
 MODULE_PARM_DESC(max_increment, "Limit on increment allowed during binary search");
 module_param(beta, int, 0444);
 MODULE_PARM_DESC(beta, "beta for multiplicative increase");
-module_param(initial_ssthresh, int, 0644);
-MODULE_PARM_DESC(initial_ssthresh, "initial value of slow start threshold");
 module_param(bic_scale, int, 0444);
 MODULE_PARM_DESC(bic_scale, "scale (scaled by 1024) value for bic function (bic_scale/1024)");
 module_param(tcp_friendliness, int, 0644);
@@ -87,8 +84,6 @@ static inline void bictcp_reset(struct b
 static void bictcp_init(struct sock *sk)
 {
 	bictcp_reset(inet_csk_ca(sk));
-	if (initial_ssthresh)
-		tcp_sk(sk)->snd_ssthresh = initial_ssthresh;
 }
 
 /* calculate the cubic root of x using a table lookup followed by one

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] TCP: remove initial_ssthresh from Cubic
  2007-06-13 17:27                             ` [PATCH] TCP: remove initial_ssthresh from Cubic Stephen Hemminger
@ 2007-06-13 18:26                               ` David Miller
  2007-06-13 18:31                                 ` Stephen Hemminger
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2007-06-13 18:26 UTC (permalink / raw)
  To: shemminger; +Cc: billfink, sangtae.ha, rhee, rhee, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 13 Jun 2007 10:27:18 -0700

Please make patches against my net-2.6 tree, I already
made changes in this area.

> Remove the initial slow start override from TCP cubic.
> The old code caused Cubic to start out in slow start mode, which
> is less agressive but caused slow performance.
> 
> The administrator can override initial slow start threshold on any
> TCP congestion control method via the TCP route metrics.
> 
> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>

I already sent a merge request to Linus that just changed
the default to zero.

No need to remove it for now so people can still play
with it if they want to, that's harmless.

You missed BIC too, which I did take care of...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] TCP: remove initial_ssthresh from Cubic
  2007-06-13 18:26                               ` David Miller
@ 2007-06-13 18:31                                 ` Stephen Hemminger
  2007-06-13 18:46                                   ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen Hemminger @ 2007-06-13 18:31 UTC (permalink / raw)
  To: David Miller; +Cc: billfink, sangtae.ha, rhee, rhee, netdev

On Wed, 13 Jun 2007 11:26:52 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Wed, 13 Jun 2007 10:27:18 -0700
> 
> Please make patches against my net-2.6 tree, I already
> made changes in this area.
> 
> > Remove the initial slow start override from TCP cubic.
> > The old code caused Cubic to start out in slow start mode, which
> > is less agressive but caused slow performance.
> > 
> > The administrator can override initial slow start threshold on any
> > TCP congestion control method via the TCP route metrics.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
> 
> I already sent a merge request to Linus that just changed
> the default to zero.
> 
> No need to remove it for now so people can still play
> with it if they want to, that's harmless.
> 
> You missed BIC too, which I did take care of...

Maybe it is time to remove BIC?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] TCP: remove initial_ssthresh from Cubic
  2007-06-13 18:31                                 ` Stephen Hemminger
@ 2007-06-13 18:46                                   ` David Miller
  2007-06-14  4:03                                     ` Bill Fink
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2007-06-13 18:46 UTC (permalink / raw)
  To: shemminger; +Cc: billfink, sangtae.ha, rhee, rhee, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 13 Jun 2007 11:31:49 -0700

> Maybe it is time to remove BIC?

I don't see any compelling reason, the same could be said
of the other experimental protocols we include in the tree.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] TCP: remove initial_ssthresh from Cubic
  2007-06-13 18:46                                   ` David Miller
@ 2007-06-14  4:03                                     ` Bill Fink
  0 siblings, 0 replies; 21+ messages in thread
From: Bill Fink @ 2007-06-14  4:03 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, sangtae.ha, rhee, rhee, netdev

On Wed, 13 Jun 2007, David Miller wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Wed, 13 Jun 2007 11:31:49 -0700
> 
> > Maybe it is time to remove BIC?
> 
> I don't see any compelling reason, the same could be said
> of the other experimental protocols we include in the tree.

I agree bic should be kept.  As I pointed out, if someone did want
to set the bic/cubic initial_ssthresh to 100 globally, my tests
showed bic's performance during the initial slow start phase was
far superior to cubic's.  I don't know if this is a bug or a
feature with cubic.

						-Bill

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-06-14  4:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-07  0:37 2.6.20.7 TCP cubic (and bic) initial slow start way too slow? Bill Fink
2007-05-08  9:41 ` SANGTAE HA
2007-05-09  6:31   ` Bill Fink
2007-05-10 15:32     ` Bill Fink
2007-05-10 17:31       ` SANGTAE HA
2007-05-10 18:39       ` rhee
2007-05-10 20:35         ` David Miller
2007-05-10 20:45           ` Stephen Hemminger
2007-05-10 20:57             ` Injong Rhee
2007-05-12 16:07               ` Bill Fink
2007-05-12 16:45                 ` SANGTAE HA
2007-05-16  6:44                   ` Bill Fink
2007-06-12 22:12                     ` David Miller
2007-06-13  0:16                       ` Stephen Hemminger
2007-06-13  3:38                         ` Bill Fink
2007-06-13  7:56                           ` David Miller
2007-06-13 17:27                             ` [PATCH] TCP: remove initial_ssthresh from Cubic Stephen Hemminger
2007-06-13 18:26                               ` David Miller
2007-06-13 18:31                                 ` Stephen Hemminger
2007-06-13 18:46                                   ` David Miller
2007-06-14  4:03                                     ` Bill Fink

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).