* [PATCH] Make CUBIC Hystart more robust to RTT variations @ 2011-03-08 9:32 Lucas Nussbaum 2011-03-08 10:21 ` WANG Cong 2011-03-10 23:28 ` Stephen Hemminger 0 siblings, 2 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-08 9:32 UTC (permalink / raw) To: netdev; +Cc: Sangtae Ha CUBIC Hystart uses two heuristics to exit slow start earlier, before losses start to occur. Unfortunately, it tends to exit slow start far too early, causing poor performance since convergence to the optimal cwnd is then very slow. This was reported in http://permalink.gmane.org/gmane.linux.network/188169 and https://partner-bugzilla.redhat.com/show_bug.cgi?id=616985 I am using an experimental testbed (http://www.grid5000.fr/) with two machines connected using Gigabit ethernet to a dedicated 10-Gb backbone. RTT between both machines is 11.3ms. Using TCP CUBIC without Hystart, cwnd grows to ~2200. With Hystart enabled, CUBIC exits slow start with cwnd lower than 100, and often lower than 20, which leads to the poor performance that I reported. After instrumenting TCP CUBIC, I found out that the segment-to-ack RTT tends to vary quite a lot even when the network is not congested, due to several factors including the fact that TCP sends packet in burst (so the packets are queued locally before being sent, increasing their RTT), and delayed ACKs on the destination host. The patch below increases the thresholds used by the two Hystart heuristics. First, the length of an ACK train needs to reach 2*minRTT. Second, the max RTT of a group of packets also needs to reach 2*minRTT. In my setup, this causes Hystart to exit slow start when cwnd is in the 1900-2000 range using the ACK train heuristics, and sometimes to exit in the 700-900 range using the delay increase heuristic, dramatically improving performance. I've left commented out a printk that is useful for debugging the exit point of Hystart. And I could provide access to my testbed if someone wants to do further experiments. Signed-off-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 71d5f2f..a973a49 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -344,7 +344,7 @@ static void hystart_update(struct sock *sk, u32 delay) /* first detection parameter - ack-train detection */ if (curr_jiffies - ca->last_jiffies <= msecs_to_jiffies(2)) { ca->last_jiffies = curr_jiffies; - if (curr_jiffies - ca->round_start >= ca->delay_min>>4) + if (curr_jiffies - ca->round_start >= ca->delay_min>>2) ca->found |= HYSTART_ACK_TRAIN; } @@ -355,8 +355,7 @@ static void hystart_update(struct sock *sk, u32 delay) ca->sample_cnt++; } else { - if (ca->curr_rtt > ca->delay_min + - HYSTART_DELAY_THRESH(ca->delay_min>>4)) + if (ca->curr_rtt > ca->delay_min<<1) ca->found |= HYSTART_DELAY; } /* @@ -364,7 +363,10 @@ static void hystart_update(struct sock *sk, u32 delay) * we exit from slow start immediately. */ if (ca->found & hystart_detect) + { +// printk("hystart_update: cwnd=%u found=%d delay_min=%u cur_jif=%u round_start=%u curr_rtt=%u\n", tp->snd_cwnd, ca->found, ca tp->snd_ssthresh = tp->snd_cwnd; + } } } ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 9:32 [PATCH] Make CUBIC Hystart more robust to RTT variations Lucas Nussbaum @ 2011-03-08 10:21 ` WANG Cong 2011-03-08 11:10 ` Lucas Nussbaum 2011-03-10 23:28 ` Stephen Hemminger 1 sibling, 1 reply; 27+ messages in thread From: WANG Cong @ 2011-03-08 10:21 UTC (permalink / raw) To: netdev On Tue, 08 Mar 2011 10:32:15 +0100, Lucas Nussbaum wrote: > + { > +// printk("hystart_update: cwnd=%u found=%d > delay_min=%u cur_jif=%u round_start=%u curr_rtt=%u\n", tp->snd_cwnd, > ca->found, ca Please remove this line from your patch. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 10:21 ` WANG Cong @ 2011-03-08 11:10 ` Lucas Nussbaum 2011-03-08 15:26 ` Injong Rhee [not found] ` <AANLkTimdpEKHfVKw+bm6OnymcnUrauU+jGOPeLzy3Q0o@mail.gmail.com> 0 siblings, 2 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-08 11:10 UTC (permalink / raw) To: WANG Cong; +Cc: netdev CUBIC Hystart uses two heuristics to exit slow start earlier, before losses start to occur. Unfortunately, it tends to exit slow start far too early, causing poor performance since convergence to the optimal cwnd is then very slow. This was reported in http://permalink.gmane.org/gmane.linux.network/188169 and https://partner-bugzilla.redhat.com/show_bug.cgi?id=616985 I am using an experimental testbed (http://www.grid5000.fr/) with two machines connected using Gigabit ethernet to a dedicated 10-Gb backbone. RTT between both machines is 11.3ms. Using TCP CUBIC without Hystart, cwnd grows to ~2200. With Hystart enabled, CUBIC exits slow start with cwnd lower than 100, and often lower than 20, which leads to the poor performance that I reported. After instrumenting TCP CUBIC, I found out that the segment-to-ack RTT tends to vary quite a lot even when the network is not congested, due to several factors including the fact that TCP sends packet in burst (so the packets are queued locally before being sent, increasing their RTT), and delayed ACKs on the destination host. The patch below increases the thresholds used by the two Hystart heuristics. First, the length of an ACK train needs to reach 2*minRTT. Second, the max RTT of a group of packets also needs to reach 2*minRTT. In my setup, this causes Hystart to exit slow start when cwnd is in the 1900-2000 range using the ACK train heuristics, and sometimes to exit in the 700-900 range using the delay increase heuristic, dramatically improving performance. I could provide access to my testbed if someone wants to do further experiments. Signed-off-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | --- diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 71d5f2f..e404de4 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -344,7 +344,7 @@ static void hystart_update(struct sock *sk, u32 delay) /* first detection parameter - ack-train detection */ if (curr_jiffies - ca->last_jiffies <= msecs_to_jiffies(2)) { ca->last_jiffies = curr_jiffies; - if (curr_jiffies - ca->round_start >= ca->delay_min>>4) + if (curr_jiffies - ca->round_start >= ca->delay_min>>2) ca->found |= HYSTART_ACK_TRAIN; } @@ -355,8 +355,7 @@ static void hystart_update(struct sock *sk, u32 delay) ca->sample_cnt++; } else { - if (ca->curr_rtt > ca->delay_min + - HYSTART_DELAY_THRESH(ca->delay_min>>4)) + if (ca->curr_rtt > ca->delay_min<<1) ca->found |= HYSTART_DELAY; } /* ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 11:10 ` Lucas Nussbaum @ 2011-03-08 15:26 ` Injong Rhee 2011-03-08 19:43 ` David Miller [not found] ` <AANLkTimdpEKHfVKw+bm6OnymcnUrauU+jGOPeLzy3Q0o@mail.gmail.com> 1 sibling, 1 reply; 27+ messages in thread From: Injong Rhee @ 2011-03-08 15:26 UTC (permalink / raw) To: Lucas Nussbaum; +Cc: WANG Cong, netdev Thanks for updating CUBIC hystart. You might want to test the cases with more background traffic and verify whether this threshold is too conservative. On 3/8/11 6:10 AM, Lucas Nussbaum wrote: > CUBIC Hystart uses two heuristics to exit slow start earlier, before > losses start to occur. Unfortunately, it tends to exit slow start far > too early, causing poor performance since convergence to the optimal > cwnd is then very slow. This was reported in > http://permalink.gmane.org/gmane.linux.network/188169 and > https://partner-bugzilla.redhat.com/show_bug.cgi?id=616985 > > I am using an experimental testbed (http://www.grid5000.fr/) with two > machines connected using Gigabit ethernet to a dedicated 10-Gb backbone. > RTT between both machines is 11.3ms. Using TCP CUBIC without Hystart, > cwnd grows to ~2200. With Hystart enabled, CUBIC exits slow start with > cwnd lower than 100, and often lower than 20, which leads to the poor > performance that I reported. > > After instrumenting TCP CUBIC, I found out that the segment-to-ack RTT > tends to vary quite a lot even when the network is not congested, due to > several factors including the fact that TCP sends packet in burst (so > the packets are queued locally before being sent, increasing their RTT), > and delayed ACKs on the destination host. > > The patch below increases the thresholds used by the two Hystart > heuristics. First, the length of an ACK train needs to reach 2*minRTT. > Second, the max RTT of a group of packets also needs to reach 2*minRTT. > In my setup, this causes Hystart to exit slow start when cwnd is in the > 1900-2000 range using the ACK train heuristics, and sometimes to exit in > the 700-900 range using the delay increase heuristic, dramatically > improving performance. > > I could provide access to my testbed if someone wants to do further > experiments. > > Signed-off-by: Lucas Nussbaum<lucas.nussbaum@loria.fr> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 15:26 ` Injong Rhee @ 2011-03-08 19:43 ` David Miller 2011-03-08 23:21 ` Stephen Hemminger 0 siblings, 1 reply; 27+ messages in thread From: David Miller @ 2011-03-08 19:43 UTC (permalink / raw) To: rhee; +Cc: lucas.nussbaum, xiyou.wangcong, netdev From: Injong Rhee <rhee@ncsu.edu> Date: Tue, 08 Mar 2011 10:26:36 -0500 > Thanks for updating CUBIC hystart. You might want to test the > cases with more background traffic and verify whether this > threshold is too conservative. So let's get down to basics. What does Hystart do specially that allows it to avoid all of the problems that TCP VEGAS runs into. Specifically, that if you use RTTs to make congestion control decisions it is impossible to notice new bandwidth becomming available fast enough. Again, it's impossible to react fast enough. No matter what you tweak all of your various settings to, this problem will still exist. This is a core issue, you cannot get around it. This is why I feel that Hystart is fundamentally flawed and we should turn it off by default if not flat-out remove it. Distributions are turning it off by default already, therefore it's stupid for the upstream kernel to behave differently if that's what %99 of the world is going to end up experiencing. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 19:43 ` David Miller @ 2011-03-08 23:21 ` Stephen Hemminger 2011-03-09 1:30 ` Injong Rhee 2011-03-09 1:33 ` Sangtae Ha 0 siblings, 2 replies; 27+ messages in thread From: Stephen Hemminger @ 2011-03-08 23:21 UTC (permalink / raw) To: David Miller; +Cc: rhee, lucas.nussbaum, xiyou.wangcong, netdev On Tue, 08 Mar 2011 11:43:46 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: Injong Rhee <rhee@ncsu.edu> > Date: Tue, 08 Mar 2011 10:26:36 -0500 > > > Thanks for updating CUBIC hystart. You might want to test the > > cases with more background traffic and verify whether this > > threshold is too conservative. > > So let's get down to basics. > > What does Hystart do specially that allows it to avoid all of the > problems that TCP VEGAS runs into. > > Specifically, that if you use RTTs to make congestion control > decisions it is impossible to notice new bandwidth becomming available > fast enough. > > Again, it's impossible to react fast enough. No matter what you tweak > all of your various settings to, this problem will still exist. > > This is a core issue, you cannot get around it. > > This is why I feel that Hystart is fundamentally flawed and we should > turn it off by default if not flat-out remove it. > > Distributions are turning it off by default already, therefore it's > stupid for the upstream kernel to behave differently if that's what > %99 of the world is going to end up experiencing. The assumption in Hystart that spacing between ACK's is solely due to congestion is a bad. If you read the paper, this is why FreeBSD's estimation logic is dismissed. The Hystart problem is different than the Vegas issue. Algorithms that look at min RTT are ok, since the lower bound is fixed; additional queuing and variation in network only increases RTT it never reduces it. With a min RTT it is possible to compute the upper bound on available bandwidth. i.e If all packets were as good as this estimate minRTT then the available bandwidth is X. But then using an individual RTT sample to estimate unused bandwidth is flawed. To quote paper. "Thus, by checking whether ∆(N ) is larger than Dmin , we can detect whether cwnd has reached the available capacity of the path" So what goes wrong: 1. Dmin can be too large because this connection always sees delays due to other traffic or hardware. i.e buffer bloat. This would cause the bandwidth estimate to be too low and therefore TCP would leave slow start too early (and not get up to full bandwidth). 2. Dmin can be smaller than the clock resolution. This would cause either sample to be ignored, or Dmin to be zero. If Dmin is zero, the bandwidth estimate would in theory be infinite, which would lead to TCP not leaving slow start because of Hystart. Instead TCP would leave slow start at first loss. Other possible problems: 3. ACK's could be nudged together by variations in delay. This would cause HyStart to exit slow start prematurely. To false think it is an ACK train. Noise in network is not catastrophic, it just causes TCP to exit slow-start early and have to go into normal window growth phase. The problem is that the original non-Hystart behavior of Cubic is unfair; the first flow dominates the link and other flows are unable to get in. If you run tests with two flows one will get a larger share of the bandwidth. I think Hystart is okay in concept but there may be issues on low RTT links as well as other corner cases that need bug fixing. 1. Needs to use better resolution than HZ. Since HZ can be 100. 2. Hardcoding 2ms as spacing between ACK's as train is wrong for local networks. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 23:21 ` Stephen Hemminger @ 2011-03-09 1:30 ` Injong Rhee 2011-03-09 6:53 ` Lucas Nussbaum 2011-03-09 1:33 ` Sangtae Ha 1 sibling, 1 reply; 27+ messages in thread From: Injong Rhee @ 2011-03-09 1:30 UTC (permalink / raw) To: Stephen Hemminger Cc: David Miller, lucas.nussbaum, xiyou.wangcong, netdev, sangtae.ha HyStart is a slow start algorithm, but not a congestion control algorithm. So the difference between vegas and hystart is obvious. Yes. Both hystart and vegas use delays for indication of congestion. But hystart exits slow starts at the detection of congestion and enters normal congestion avoidance; in some sense, it is much safer than vegas as it does not change the regular behaviors of congestion control. I think the main problem arising right now is not because it is using noisy delays as congestion indication, but because of rather some implementation issues like use of Hz, hardcoding 2ms, etc. Then, you might ask why hystart can use delays while vegas can't. The main motivation for use delays during slow start is that slow start creates an environment where delay samples can be more trusted. That is because it sends so many packets as a a burst because of doubling windows, which can be used as packet train to estimate the available capacity more reliably. (tool 1) When many packets are sent in burst, the spacing in returning ACKs can be a good indicator. Hystart also uses delays as an estimation. (tool 2) If estimated avg delays increase beyond a certain threshold, it sees that as a possible congestion. Now, both tools can be wrong. But that is not catastrophic since congestion avoidance can kick in to save the day. In a pipe where no other flows are competing, then exiting slow start too early can slow things down as the window can be still too small. But that is in fact when delays are most reliable. So those tests that say bad performance with hystart are in fact, where hystart is supposed to perform well. Then why do we have a bad performance? I think the answer is again the implementation flaws -- use different hardware, some hardwired codes, etc, and also could be related to a few corner cases like very low RTT links. Let us examine Stephen's analysis in more detail. 1. Use of minRTT is ok. I agree. 2. Dmin can be too large at the beginning. But it is just like minRTT. This cannot be too large. If you trust minRTT, then delay estimation should say that there is a congestion. This is exactly the opposite case to the cases we are seeing. If Dmin is too large, then hystart would not exit the slow start as it does not detect the congestion. That is not what we are seeing right now. 3. Dmin can be smaller than clock resolution. That is why we are using a bunch of ACKs to get better accuracy. With a bunch of ACKs, we get higher value of spacing so that we can take average. 4. If ACKs are nudged together, then hystart does not quit slow start. Instead, it sees that there is no congestion. It is when it sees big spacing between ACKs -- that is when it detects congestion. On 3/8/11 6:21 PM, Stephen Hemminger wrote: > On Tue, 08 Mar 2011 11:43:46 -0800 (PST) > David Miller<davem@davemloft.net> wrote: > >> From: Injong Rhee<rhee@ncsu.edu> >> Date: Tue, 08 Mar 2011 10:26:36 -0500 >> >>> Thanks for updating CUBIC hystart. You might want to test the >>> cases with more background traffic and verify whether this >>> threshold is too conservative. >> So let's get down to basics. >> >> What does Hystart do specially that allows it to avoid all of the >> problems that TCP VEGAS runs into. >> >> Specifically, that if you use RTTs to make congestion control >> decisions it is impossible to notice new bandwidth becomming available >> fast enough. >> >> Again, it's impossible to react fast enough. No matter what you tweak >> all of your various settings to, this problem will still exist. >> >> This is a core issue, you cannot get around it. >> >> This is why I feel that Hystart is fundamentally flawed and we should >> turn it off by default if not flat-out remove it. >> >> Distributions are turning it off by default already, therefore it's >> stupid for the upstream kernel to behave differently if that's what >> %99 of the world is going to end up experiencing. > The assumption in Hystart that spacing between ACK's is solely due to > congestion is a bad. If you read the paper, this is why FreeBSD's > estimation logic is dismissed. The Hystart problem is different > than the Vegas issue. > > Algorithms that look at min RTT are ok, since the lower bound is > fixed; additional queuing and variation in network only increases RTT > it never reduces it. With a min RTT it is possible to compute the > upper bound on available bandwidth. i.e If all packets were as good as > this estimate minRTT then the available bandwidth is X. But then using > an individual RTT sample to estimate unused bandwidth is flawed. To > quote paper. > > "Thus, by checking whether ∆(N ) is larger than Dmin , we > can detect whether cwnd has reached the available capacity > of the path" > > So what goes wrong: > 1. Dmin can be too large because this connection always sees delays > due to other traffic or hardware. i.e buffer bloat. This would cause > the bandwidth estimate to be too low and therefore TCP would leave > slow start too early (and not get up to full bandwidth). > > 2. Dmin can be smaller than the clock resolution. This would cause > either sample to be ignored, or Dmin to be zero. If Dmin is zero, > the bandwidth estimate would in theory be infinite, which would > lead to TCP not leaving slow start because of Hystart. Instead > TCP would leave slow start at first loss. > > Other possible problems: > 3. ACK's could be nudged together by variations in delay. > This would cause HyStart to exit slow start prematurely. To false > think it is an ACK train. > > Noise in network is not catastrophic, it just > causes TCP to exit slow-start early and have to go into normal > window growth phase. The problem is that the original non-Hystart > behavior of Cubic is unfair; the first flow dominates the link > and other flows are unable to get in. If you run tests with two > flows one will get a larger share of the bandwidth. > > I think Hystart is okay in concept but there may be issues > on low RTT links as well as other corner cases that need bug > fixing. > > 1. Needs to use better resolution than HZ. Since HZ can be 100. > 2. Hardcoding 2ms as spacing between ACK's as train is wrong > for local networks. > > > > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 1:30 ` Injong Rhee @ 2011-03-09 6:53 ` Lucas Nussbaum 2011-03-09 17:56 ` Stephen Hemminger 2011-03-10 5:24 ` Bill Fink 0 siblings, 2 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-09 6:53 UTC (permalink / raw) To: Injong Rhee Cc: Stephen Hemminger, David Miller, xiyou.wangcong, netdev, sangtae.ha On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > Now, both tools can be wrong. But that is not catastrophic since > congestion avoidance can kick in to save the day. In a pipe where no > other flows are competing, then exiting slow start too early can > slow things down as the window can be still too small. But that is > in fact when delays are most reliable. So those tests that say bad > performance with hystart are in fact, where hystart is supposed to > perform well. Hi, In my setup, there is no congestion at all (except the buffer bloat). Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting slow start at ~2000 packets. With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting slow start at ~20 packets. I don't think that this is "hystart performing well". We could just as well remove slow start completely, and only do congestion avoidance, then. While I see the value in Hystart, it's clear that there are some flaws in the current implementation. It probably makes sense to disable hystart by default until those problems are fixed. -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 6:53 ` Lucas Nussbaum @ 2011-03-09 17:56 ` Stephen Hemminger 2011-03-09 18:25 ` Lucas Nussbaum 2011-03-10 5:24 ` Bill Fink 1 sibling, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2011-03-09 17:56 UTC (permalink / raw) To: Lucas Nussbaum Cc: Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On Wed, 9 Mar 2011 07:53:19 +0100 Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > Now, both tools can be wrong. But that is not catastrophic since > > congestion avoidance can kick in to save the day. In a pipe where no > > other flows are competing, then exiting slow start too early can > > slow things down as the window can be still too small. But that is > > in fact when delays are most reliable. So those tests that say bad > > performance with hystart are in fact, where hystart is supposed to > > perform well. > > Hi, > > In my setup, there is no congestion at all (except the buffer bloat). > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > slow start at ~2000 packets. > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > slow start at ~20 packets. > I don't think that this is "hystart performing well". We could just as > well remove slow start completely, and only do congestion avoidance, > then. > > While I see the value in Hystart, it's clear that there are some flaws > in the current implementation. It probably makes sense to disable > hystart by default until those problems are fixed. What is the speed and RTT time of your network? I think you maybe blaming hystart for other issues in the network. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 17:56 ` Stephen Hemminger @ 2011-03-09 18:25 ` Lucas Nussbaum 2011-03-09 19:56 ` Stephen Hemminger 2011-03-09 20:01 ` Stephen Hemminger 0 siblings, 2 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-09 18:25 UTC (permalink / raw) To: Stephen Hemminger Cc: Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > On Wed, 9 Mar 2011 07:53:19 +0100 > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > > Now, both tools can be wrong. But that is not catastrophic since > > > congestion avoidance can kick in to save the day. In a pipe where no > > > other flows are competing, then exiting slow start too early can > > > slow things down as the window can be still too small. But that is > > > in fact when delays are most reliable. So those tests that say bad > > > performance with hystart are in fact, where hystart is supposed to > > > perform well. > > > > Hi, > > > > In my setup, there is no congestion at all (except the buffer bloat). > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > > slow start at ~2000 packets. > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > > slow start at ~20 packets. > > I don't think that this is "hystart performing well". We could just as > > well remove slow start completely, and only do congestion avoidance, > > then. > > > > While I see the value in Hystart, it's clear that there are some flaws > > in the current implementation. It probably makes sense to disable > > hystart by default until those problems are fixed. > > What is the speed and RTT time of your network? > I think you maybe blaming hystart for other issues in the network. What kind of issues? Host1 is connected through a gigabit ethernet LAN to Router1 Host2 is connected through a gigabit ethernet LAN to Router2 Router1 and Router2 are connected through an experimentation network at 10 Gb/s RTT between Host1 and Host2 is 11.3ms. The network is not congested. (I can provide access to the testbed if someone wants to do further testing) -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 18:25 ` Lucas Nussbaum @ 2011-03-09 19:56 ` Stephen Hemminger 2011-03-09 21:28 ` Lucas Nussbaum 2011-03-09 20:01 ` Stephen Hemminger 1 sibling, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2011-03-09 19:56 UTC (permalink / raw) To: Lucas Nussbaum Cc: Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On Wed, 9 Mar 2011 19:25:05 +0100 Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > > On Wed, 9 Mar 2011 07:53:19 +0100 > > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > > > Now, both tools can be wrong. But that is not catastrophic since > > > > congestion avoidance can kick in to save the day. In a pipe where no > > > > other flows are competing, then exiting slow start too early can > > > > slow things down as the window can be still too small. But that is > > > > in fact when delays are most reliable. So those tests that say bad > > > > performance with hystart are in fact, where hystart is supposed to > > > > perform well. > > > > > > Hi, > > > > > > In my setup, there is no congestion at all (except the buffer bloat). > > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > > > slow start at ~2000 packets. > > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > > > slow start at ~20 packets. > > > I don't think that this is "hystart performing well". We could just as > > > well remove slow start completely, and only do congestion avoidance, > > > then. > > > > > > While I see the value in Hystart, it's clear that there are some flaws > > > in the current implementation. It probably makes sense to disable > > > hystart by default until those problems are fixed. > > > > What is the speed and RTT time of your network? > > I think you maybe blaming hystart for other issues in the network. > > What kind of issues? > > Host1 is connected through a gigabit ethernet LAN to Router1 > Host2 is connected through a gigabit ethernet LAN to Router2 > Router1 and Router2 are connected through an experimentation network at > 10 Gb/s > RTT between Host1 and Host2 is 11.3ms. > The network is not congested. > > (I can provide access to the testbed if someone wants to do further > testing) Your backbone is faster than the LAN, interesting. Could you check packet stats to see where packet drop is occuring? It could be that routers don't have enough buffering to take packet trains from 10G network and pace them out to 1G network. -- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 19:56 ` Stephen Hemminger @ 2011-03-09 21:28 ` Lucas Nussbaum 0 siblings, 0 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-09 21:28 UTC (permalink / raw) To: Stephen Hemminger Cc: Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On 09/03/11 at 11:56 -0800, Stephen Hemminger wrote: > On Wed, 9 Mar 2011 19:25:05 +0100 > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > > > On Wed, 9 Mar 2011 07:53:19 +0100 > > > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > > > > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > > > > Now, both tools can be wrong. But that is not catastrophic since > > > > > congestion avoidance can kick in to save the day. In a pipe where no > > > > > other flows are competing, then exiting slow start too early can > > > > > slow things down as the window can be still too small. But that is > > > > > in fact when delays are most reliable. So those tests that say bad > > > > > performance with hystart are in fact, where hystart is supposed to > > > > > perform well. > > > > > > > > Hi, > > > > > > > > In my setup, there is no congestion at all (except the buffer bloat). > > > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > > > > slow start at ~2000 packets. > > > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > > > > slow start at ~20 packets. > > > > I don't think that this is "hystart performing well". We could just as > > > > well remove slow start completely, and only do congestion avoidance, > > > > then. > > > > > > > > While I see the value in Hystart, it's clear that there are some flaws > > > > in the current implementation. It probably makes sense to disable > > > > hystart by default until those problems are fixed. > > > > > > What is the speed and RTT time of your network? > > > I think you maybe blaming hystart for other issues in the network. > > > > What kind of issues? > > > > Host1 is connected through a gigabit ethernet LAN to Router1 > > Host2 is connected through a gigabit ethernet LAN to Router2 > > Router1 and Router2 are connected through an experimentation network at > > 10 Gb/s > > RTT between Host1 and Host2 is 11.3ms. > > The network is not congested. > > > > (I can provide access to the testbed if someone wants to do further > > testing) > > Your backbone is faster than the LAN, interesting. > Could you check packet stats to see where packet drop is occuring? > It could be that routers don't have enough buffering to take packet > trains from 10G network and pace them out to 1G network. I don't have access to the routers to check the packet counts here. However, according to "netstat -s" on the sender(s), no retransmissions are occuring, whether hystart is enabled or not: the host can just send data at the network rate without experiencing congestion anywhere. Also, it is unlikely that transient congestion in the backbone is an issue according to the monitoring tools I have access to. (Replying to your other mail as well) > By my calculations (1G * 11.3ms) gives BDP of 941 packets which means > CUBIC would ideally exit slow start at 900 or so packets. Old CUBIC > slowstrart of 2000 packets means there is huge overshoot which means > large packet loss burst which would cause a large CPU load on receiver > processing SACK. Since the network capacity is higher or equal to the network capacity on the host, there's no reason why losses would occur if there's no congestion caused by other traffic, right? > I assume you haven't done anything that would disable RFC1323 > support like turn off window scaling or tcp timestamps. No, nothing strange that could cause different results. I've tried to exclude hardware problems by using different parts of the testbed (see map at https://www.grid5000.fr/mediawiki/images/Renater5-g5k.jpg). I used machines in rennes, lille, lyon and grenoble today (using different hardware). My original testing was done between rennes and nancy. The same symptoms appear everywhere, in both directions, and disappear when disabling hystart. -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 18:25 ` Lucas Nussbaum 2011-03-09 19:56 ` Stephen Hemminger @ 2011-03-09 20:01 ` Stephen Hemminger 2011-03-09 21:12 ` Yuchung Cheng 1 sibling, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2011-03-09 20:01 UTC (permalink / raw) To: Lucas Nussbaum Cc: Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On Wed, 9 Mar 2011 19:25:05 +0100 Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > > On Wed, 9 Mar 2011 07:53:19 +0100 > > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > > > Now, both tools can be wrong. But that is not catastrophic since > > > > congestion avoidance can kick in to save the day. In a pipe where no > > > > other flows are competing, then exiting slow start too early can > > > > slow things down as the window can be still too small. But that is > > > > in fact when delays are most reliable. So those tests that say bad > > > > performance with hystart are in fact, where hystart is supposed to > > > > perform well. > > > > > > Hi, > > > > > > In my setup, there is no congestion at all (except the buffer bloat). > > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > > > slow start at ~2000 packets. > > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > > > slow start at ~20 packets. > > > I don't think that this is "hystart performing well". We could just as > > > well remove slow start completely, and only do congestion avoidance, > > > then. > > > > > > While I see the value in Hystart, it's clear that there are some flaws > > > in the current implementation. It probably makes sense to disable > > > hystart by default until those problems are fixed. > > > > What is the speed and RTT time of your network? > > I think you maybe blaming hystart for other issues in the network. > > What kind of issues? > > Host1 is connected through a gigabit ethernet LAN to Router1 > Host2 is connected through a gigabit ethernet LAN to Router2 > Router1 and Router2 are connected through an experimentation network at > 10 Gb/s > RTT between Host1 and Host2 is 11.3ms. > The network is not congested. By my calculations (1G * 11.3ms) gives BDP of 941 packets which means CUBIC would ideally exit slow start at 900 or so packets. Old CUBIC slowstrart of 2000 packets means there is huge overshoot which means large packet loss burst which would cause a large CPU load on receiver processing SACK. I assume you haven't done anything that would disable RFC1323 support like turn off window scaling or tcp timestamps. -- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 20:01 ` Stephen Hemminger @ 2011-03-09 21:12 ` Yuchung Cheng 2011-03-09 21:33 ` Lucas Nussbaum 0 siblings, 1 reply; 27+ messages in thread From: Yuchung Cheng @ 2011-03-09 21:12 UTC (permalink / raw) To: Stephen Hemminger Cc: Lucas Nussbaum, Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On Wed, Mar 9, 2011 at 12:01 PM, Stephen Hemminger <shemminger@vyatta.com> wrote: > On Wed, 9 Mar 2011 19:25:05 +0100 > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > >> On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: >> > On Wed, 9 Mar 2011 07:53:19 +0100 >> > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: >> > >> > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: >> > > > Now, both tools can be wrong. But that is not catastrophic since >> > > > congestion avoidance can kick in to save the day. In a pipe where no >> > > > other flows are competing, then exiting slow start too early can >> > > > slow things down as the window can be still too small. But that is >> > > > in fact when delays are most reliable. So those tests that say bad >> > > > performance with hystart are in fact, where hystart is supposed to >> > > > perform well. >> > > >> > > Hi, >> > > >> > > In my setup, there is no congestion at all (except the buffer bloat). >> > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting >> > > slow start at ~2000 packets. >> > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting >> > > slow start at ~20 packets. >> > > I don't think that this is "hystart performing well". We could just as >> > > well remove slow start completely, and only do congestion avoidance, >> > > then. >> > > >> > > While I see the value in Hystart, it's clear that there are some flaws >> > > in the current implementation. It probably makes sense to disable >> > > hystart by default until those problems are fixed. >> > >> > What is the speed and RTT time of your network? >> > I think you maybe blaming hystart for other issues in the network. >> >> What kind of issues? >> >> Host1 is connected through a gigabit ethernet LAN to Router1 >> Host2 is connected through a gigabit ethernet LAN to Router2 >> Router1 and Router2 are connected through an experimentation network at >> 10 Gb/s >> RTT between Host1 and Host2 is 11.3ms. >> The network is not congested. > > By my calculations (1G * 11.3ms) gives BDP of 941 packets which means > CUBIC would ideally exit slow start at 900 or so packets. Old CUBIC > slowstrart of 2000 packets means there is huge overshoot which means > large packet loss burst which would cause a large CPU load on receiver > processing SACK. It's not clear from Lucas's report that the hystart is exiting when cwnd=2000 or when sender has sent 2000 packets. Lucas could you clarify? > > I assume you haven't done anything that would disable RFC1323 > support like turn off window scaling or tcp timestamps. > > > -- > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 21:12 ` Yuchung Cheng @ 2011-03-09 21:33 ` Lucas Nussbaum 2011-03-09 21:51 ` Stephen Hemminger 0 siblings, 1 reply; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-09 21:33 UTC (permalink / raw) To: Yuchung Cheng Cc: Stephen Hemminger, Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On 09/03/11 at 13:12 -0800, Yuchung Cheng wrote: > On Wed, Mar 9, 2011 at 12:01 PM, Stephen Hemminger > <shemminger@vyatta.com> wrote: > > On Wed, 9 Mar 2011 19:25:05 +0100 > > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > >> On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > >> > On Wed, 9 Mar 2011 07:53:19 +0100 > >> > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > >> > > >> > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > >> > > > Now, both tools can be wrong. But that is not catastrophic since > >> > > > congestion avoidance can kick in to save the day. In a pipe where no > >> > > > other flows are competing, then exiting slow start too early can > >> > > > slow things down as the window can be still too small. But that is > >> > > > in fact when delays are most reliable. So those tests that say bad > >> > > > performance with hystart are in fact, where hystart is supposed to > >> > > > perform well. > >> > > > >> > > Hi, > >> > > > >> > > In my setup, there is no congestion at all (except the buffer bloat). > >> > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > >> > > slow start at ~2000 packets. > >> > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > >> > > slow start at ~20 packets. > >> > > I don't think that this is "hystart performing well". We could just as > >> > > well remove slow start completely, and only do congestion avoidance, > >> > > then. > >> > > > >> > > While I see the value in Hystart, it's clear that there are some flaws > >> > > in the current implementation. It probably makes sense to disable > >> > > hystart by default until those problems are fixed. > >> > > >> > What is the speed and RTT time of your network? > >> > I think you maybe blaming hystart for other issues in the network. > >> > >> What kind of issues? > >> > >> Host1 is connected through a gigabit ethernet LAN to Router1 > >> Host2 is connected through a gigabit ethernet LAN to Router2 > >> Router1 and Router2 are connected through an experimentation network at > >> 10 Gb/s > >> RTT between Host1 and Host2 is 11.3ms. > >> The network is not congested. > > > > By my calculations (1G * 11.3ms) gives BDP of 941 packets which means > > CUBIC would ideally exit slow start at 900 or so packets. Old CUBIC > > slowstrart of 2000 packets means there is huge overshoot which means > > large packet loss burst which would cause a large CPU load on receiver > > processing SACK. > It's not clear from Lucas's report that the hystart is exiting when > cwnd=2000 or when sender has sent 2000 packets. > Lucas could you clarify? When cwnd is around 2000. -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 21:33 ` Lucas Nussbaum @ 2011-03-09 21:51 ` Stephen Hemminger 2011-03-09 22:03 ` Lucas Nussbaum 0 siblings, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2011-03-09 21:51 UTC (permalink / raw) To: Lucas Nussbaum Cc: Yuchung Cheng, Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On Wed, 9 Mar 2011 22:33:56 +0100 Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > On 09/03/11 at 13:12 -0800, Yuchung Cheng wrote: > > On Wed, Mar 9, 2011 at 12:01 PM, Stephen Hemminger > > <shemminger@vyatta.com> wrote: > > > On Wed, 9 Mar 2011 19:25:05 +0100 > > > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > > > >> On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > > >> > On Wed, 9 Mar 2011 07:53:19 +0100 > > >> > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > >> > > > >> > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > >> > > > Now, both tools can be wrong. But that is not catastrophic since > > >> > > > congestion avoidance can kick in to save the day. In a pipe where no > > >> > > > other flows are competing, then exiting slow start too early can > > >> > > > slow things down as the window can be still too small. But that is > > >> > > > in fact when delays are most reliable. So those tests that say bad > > >> > > > performance with hystart are in fact, where hystart is supposed to > > >> > > > perform well. > > >> > > > > >> > > Hi, > > >> > > > > >> > > In my setup, there is no congestion at all (except the buffer bloat). > > >> > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > > >> > > slow start at ~2000 packets. > > >> > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > > >> > > slow start at ~20 packets. > > >> > > I don't think that this is "hystart performing well". We could just as > > >> > > well remove slow start completely, and only do congestion avoidance, > > >> > > then. > > >> > > > > >> > > While I see the value in Hystart, it's clear that there are some flaws > > >> > > in the current implementation. It probably makes sense to disable > > >> > > hystart by default until those problems are fixed. > > >> > > > >> > What is the speed and RTT time of your network? > > >> > I think you maybe blaming hystart for other issues in the network. > > >> > > >> What kind of issues? > > >> > > >> Host1 is connected through a gigabit ethernet LAN to Router1 > > >> Host2 is connected through a gigabit ethernet LAN to Router2 > > >> Router1 and Router2 are connected through an experimentation network at > > >> 10 Gb/s > > >> RTT between Host1 and Host2 is 11.3ms. > > >> The network is not congested. > > > > > > By my calculations (1G * 11.3ms) gives BDP of 941 packets which means > > > CUBIC would ideally exit slow start at 900 or so packets. Old CUBIC > > > slowstrart of 2000 packets means there is huge overshoot which means > > > large packet loss burst which would cause a large CPU load on receiver > > > processing SACK. > > It's not clear from Lucas's report that the hystart is exiting when > > cwnd=2000 or when sender has sent 2000 packets. > > Lucas could you clarify? > > When cwnd is around 2000. What is HZ on the kernel configuration. Part of the problem is the hystart code was only tested with HZ=1000 and there are some bad assumptions there. -- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 21:51 ` Stephen Hemminger @ 2011-03-09 22:03 ` Lucas Nussbaum 0 siblings, 0 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-09 22:03 UTC (permalink / raw) To: Stephen Hemminger Cc: Yuchung Cheng, Injong Rhee, David Miller, xiyou.wangcong, netdev, sangtae.ha On 09/03/11 at 13:51 -0800, Stephen Hemminger wrote: > On Wed, 9 Mar 2011 22:33:56 +0100 > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > On 09/03/11 at 13:12 -0800, Yuchung Cheng wrote: > > > On Wed, Mar 9, 2011 at 12:01 PM, Stephen Hemminger > > > <shemminger@vyatta.com> wrote: > > > > On Wed, 9 Mar 2011 19:25:05 +0100 > > > > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > > > > > >> On 09/03/11 at 09:56 -0800, Stephen Hemminger wrote: > > > >> > On Wed, 9 Mar 2011 07:53:19 +0100 > > > >> > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > >> > > > > >> > > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > > >> > > > Now, both tools can be wrong. But that is not catastrophic since > > > >> > > > congestion avoidance can kick in to save the day. In a pipe where no > > > >> > > > other flows are competing, then exiting slow start too early can > > > >> > > > slow things down as the window can be still too small. But that is > > > >> > > > in fact when delays are most reliable. So those tests that say bad > > > >> > > > performance with hystart are in fact, where hystart is supposed to > > > >> > > > perform well. > > > >> > > > > > >> > > Hi, > > > >> > > > > > >> > > In my setup, there is no congestion at all (except the buffer bloat). > > > >> > > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > > > >> > > slow start at ~2000 packets. > > > >> > > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > > > >> > > slow start at ~20 packets. > > > >> > > I don't think that this is "hystart performing well". We could just as > > > >> > > well remove slow start completely, and only do congestion avoidance, > > > >> > > then. > > > >> > > > > > >> > > While I see the value in Hystart, it's clear that there are some flaws > > > >> > > in the current implementation. It probably makes sense to disable > > > >> > > hystart by default until those problems are fixed. > > > >> > > > > >> > What is the speed and RTT time of your network? > > > >> > I think you maybe blaming hystart for other issues in the network. > > > >> > > > >> What kind of issues? > > > >> > > > >> Host1 is connected through a gigabit ethernet LAN to Router1 > > > >> Host2 is connected through a gigabit ethernet LAN to Router2 > > > >> Router1 and Router2 are connected through an experimentation network at > > > >> 10 Gb/s > > > >> RTT between Host1 and Host2 is 11.3ms. > > > >> The network is not congested. > > > > > > > > By my calculations (1G * 11.3ms) gives BDP of 941 packets which means > > > > CUBIC would ideally exit slow start at 900 or so packets. Old CUBIC > > > > slowstrart of 2000 packets means there is huge overshoot which means > > > > large packet loss burst which would cause a large CPU load on receiver > > > > processing SACK. > > > It's not clear from Lucas's report that the hystart is exiting when > > > cwnd=2000 or when sender has sent 2000 packets. > > > Lucas could you clarify? > > > > When cwnd is around 2000. > > What is HZ on the kernel configuration. Part of the problem is the hystart > code was only tested with HZ=1000 and there are some bad assumptions there. $ grep HZ /boot/config-2.6.32-5-amd64 CONFIG_NO_HZ=y # CONFIG_HZ_100 is not set CONFIG_HZ_250=y # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=250 -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-09 6:53 ` Lucas Nussbaum 2011-03-09 17:56 ` Stephen Hemminger @ 2011-03-10 5:24 ` Bill Fink 2011-03-10 6:17 ` Stephen Hemminger 2011-03-10 14:37 ` Injong Rhee 1 sibling, 2 replies; 27+ messages in thread From: Bill Fink @ 2011-03-10 5:24 UTC (permalink / raw) To: Lucas Nussbaum Cc: Injong Rhee, Stephen Hemminger, David Miller, xiyou.wangcong, netdev, sangtae.ha On Wed, 9 Mar 2011, Lucas Nussbaum wrote: > On 08/03/11 at 20:30 -0500, Injong Rhee wrote: > > Now, both tools can be wrong. But that is not catastrophic since > > congestion avoidance can kick in to save the day. In a pipe where no > > other flows are competing, then exiting slow start too early can > > slow things down as the window can be still too small. But that is > > in fact when delays are most reliable. So those tests that say bad > > performance with hystart are in fact, where hystart is supposed to > > perform well. > > Hi, > > In my setup, there is no congestion at all (except the buffer bloat). > Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting > slow start at ~2000 packets. > With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting > slow start at ~20 packets. > I don't think that this is "hystart performing well". We could just as > well remove slow start completely, and only do congestion avoidance, > then. > > While I see the value in Hystart, it's clear that there are some flaws > in the current implementation. It probably makes sense to disable > hystart by default until those problems are fixed. Here are some tests I performed across real networks, where congestion is generally not an issue, with a 2.6.35 kernel on the transmit side. 8 GB transfer across an 18 ms RTT path with autotuning and hystart: i7test7% nuttcp -n8g -i1 192.168.1.23 517.9375 MB / 1.00 sec = 4344.6096 Mbps 0 retrans 688.4375 MB / 1.00 sec = 5775.1998 Mbps 0 retrans 692.9375 MB / 1.00 sec = 5812.7462 Mbps 0 retrans 698.0625 MB / 1.00 sec = 5855.8078 Mbps 0 retrans 699.8750 MB / 1.00 sec = 5871.0123 Mbps 0 retrans 710.5625 MB / 1.00 sec = 5960.5707 Mbps 0 retrans 728.8125 MB / 1.00 sec = 6113.7652 Mbps 0 retrans 751.3750 MB / 1.00 sec = 6302.9210 Mbps 0 retrans 783.8750 MB / 1.00 sec = 6575.6201 Mbps 0 retrans 825.1875 MB / 1.00 sec = 6921.8145 Mbps 0 retrans 875.4375 MB / 1.00 sec = 7343.9811 Mbps 0 retrans 8192.0000 MB / 11.26 sec = 6102.4718 Mbps 11 %TX 28 %RX 0 retrans 18.92 msRTT Ramps up quickly to a little under 6 Gbps, then increases more slowly to 7+ Gbps, with no TCP retransmissions. 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and hystart: i7test7% nuttcp -n8g -w40m -i1 192.168.1.23 970.0625 MB / 1.00 sec = 8136.8475 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9909.0045 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.6369 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.8747 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0531 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.8153 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0729 Mbps 0 retrans 8192.0000 MB / 7.13 sec = 9633.5814 Mbps 17 %TX 42 %RX 0 retrans 18.91 msRTT Quickly ramps up to full 10-GigE line rate, with no TCP retrans. 8 GB transfer across an 18 ms RTT path with autotuning and no hystart: i7test7% nuttcp -n8g -i1 192.168.1.23 845.4375 MB / 1.00 sec = 7091.5828 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9910.0134 Mbps 0 retrans 1181.0625 MB / 1.00 sec = 9907.1830 Mbps 0 retrans 1181.4375 MB / 1.00 sec = 9910.8936 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.1721 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.5774 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.6874 Mbps 0 retrans 8192.0000 MB / 7.25 sec = 9484.4524 Mbps 18 %TX 41 %RX 0 retrans 18.92 msRTT Also quickly ramps up to full 10-GigE line rate, with no TCP retrans. 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and no hystart: i7test7% nuttcp -n8g -w40m -i1 192.168.1.23 969.8750 MB / 1.00 sec = 8135.6571 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.3990 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.9342 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.4098 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.8252 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0630 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.3504 Mbps 0 retrans 8192.0000 MB / 7.15 sec = 9611.8053 Mbps 18 %TX 42 %RX 0 retrans 18.95 msRTT Basically the same as the case with 40 MB socket buffer and hystart enabled. Now trying the same type of tests across an 80 ms RTT path. 8 GB transfer across an 80 ms RTT path with autotuning and hystart: i7test7% nuttcp -n8g -i1 192.168.1.18 11.3125 MB / 1.00 sec = 94.8954 Mbps 0 retrans 441.5625 MB / 1.00 sec = 3704.1021 Mbps 0 retrans 687.3750 MB / 1.00 sec = 5765.8657 Mbps 0 retrans 715.5625 MB / 1.00 sec = 6002.6273 Mbps 0 retrans 709.9375 MB / 1.00 sec = 5955.5958 Mbps 0 retrans 691.3125 MB / 1.00 sec = 5799.0626 Mbps 0 retrans 718.6250 MB / 1.00 sec = 6028.3538 Mbps 0 retrans 718.0000 MB / 1.00 sec = 6023.0205 Mbps 0 retrans 704.0000 MB / 1.00 sec = 5905.5387 Mbps 0 retrans 733.3125 MB / 1.00 sec = 6151.4096 Mbps 0 retrans 738.8750 MB / 1.00 sec = 6198.2381 Mbps 0 retrans 731.8750 MB / 1.00 sec = 6139.3695 Mbps 0 retrans 8192.0000 MB / 12.85 sec = 5348.9677 Mbps 10 %TX 23 %RX 0 retrans 80.81 msRTT Similar to the 20 ms RTT path, but achieving somewhat lower performance levels, presumably due to the larger RTT. Ramps up fairly quickly to a little under 6 Gbps, then increases more slowly to 6+ Gbps, with no TCP retransmissions. 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and hystart: i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 103.9375 MB / 1.00 sec = 871.8378 Mbps 0 retrans 1086.5625 MB / 1.00 sec = 9114.6102 Mbps 0 retrans 1106.6875 MB / 1.00 sec = 9283.5583 Mbps 0 retrans 1109.3125 MB / 1.00 sec = 9305.5226 Mbps 0 retrans 1111.1875 MB / 1.00 sec = 9321.9596 Mbps 0 retrans 1112.8125 MB / 1.00 sec = 9334.8452 Mbps 0 retrans 1113.6875 MB / 1.00 sec = 9341.6620 Mbps 0 retrans 1120.2500 MB / 1.00 sec = 9398.0054 Mbps 0 retrans 8192.0000 MB / 8.37 sec = 8207.2049 Mbps 16 %TX 38 %RX 0 retrans 80.81 msRTT Quickly ramps up to 9+ Gbps and then slowly increases further, with no TCP retrans. 8 GB transfer across an 80 ms RTT path with autotuning and no hystart: i7test7% nuttcp -n8g -i1 192.168.1.18 11.2500 MB / 1.00 sec = 94.3703 Mbps 0 retrans 519.0625 MB / 1.00 sec = 4354.1596 Mbps 0 retrans 861.2500 MB / 1.00 sec = 7224.7970 Mbps 0 retrans 871.0000 MB / 1.00 sec = 7306.4191 Mbps 0 retrans 860.7500 MB / 1.00 sec = 7220.4438 Mbps 0 retrans 869.0625 MB / 1.00 sec = 7290.3340 Mbps 0 retrans 863.4375 MB / 1.00 sec = 7242.7707 Mbps 0 retrans 860.4375 MB / 1.00 sec = 7218.0606 Mbps 0 retrans 875.5000 MB / 1.00 sec = 7344.3071 Mbps 0 retrans 863.1875 MB / 1.00 sec = 7240.8257 Mbps 0 retrans 8192.0000 MB / 10.98 sec = 6259.4379 Mbps 12 %TX 27 %RX 0 retrans 80.81 msRTT Ramps up quickly to 7+ Gbps, then appears to stabilize at that level, with no TCP retransmissions. Performance is somewhat better than with autotuning enabled, but less than using a manually set 100 MB socket buffer. 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and no hystart: i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 102.8750 MB / 1.00 sec = 862.9487 Mbps 0 retrans 522.8750 MB / 1.00 sec = 4386.2811 Mbps 414 retrans 881.5625 MB / 1.00 sec = 7394.6534 Mbps 0 retrans 1164.3125 MB / 1.00 sec = 9766.6682 Mbps 0 retrans 1170.5625 MB / 1.00 sec = 9819.7042 Mbps 0 retrans 1166.8125 MB / 1.00 sec = 9788.2067 Mbps 0 retrans 1159.8750 MB / 1.00 sec = 9729.1530 Mbps 0 retrans 811.1250 MB / 1.00 sec = 6804.8017 Mbps 21 retrans 73.2500 MB / 1.00 sec = 614.4674 Mbps 0 retrans 884.6250 MB / 1.00 sec = 7420.2900 Mbps 0 retrans 8192.0000 MB / 10.34 sec = 6647.9394 Mbps 13 %TX 31 %RX 435 retrans 80.81 msRTT Disabling hystart on a large RTT path does not seem to play nice with a manually specified socket buffer, resulting in TCP retransmissions that limit the effective network performance. This is a repeatable but extremely variable phenomenon. i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 103.7500 MB / 1.00 sec = 870.3015 Mbps 0 retrans 1146.3750 MB / 1.00 sec = 9616.4520 Mbps 0 retrans 1175.9375 MB / 1.00 sec = 9864.6070 Mbps 0 retrans 615.6875 MB / 1.00 sec = 5164.7353 Mbps 21 retrans 139.2500 MB / 1.00 sec = 1168.1253 Mbps 0 retrans 1090.0625 MB / 1.00 sec = 9143.8053 Mbps 0 retrans 1170.4375 MB / 1.00 sec = 9818.6654 Mbps 0 retrans 1174.5625 MB / 1.00 sec = 9852.8754 Mbps 0 retrans 1174.8750 MB / 1.00 sec = 9855.6052 Mbps 0 retrans 8192.0000 MB / 9.42 sec = 7292.9879 Mbps 14 %TX 34 %RX 21 retrans 80.81 msRTT And: i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 102.8125 MB / 1.00 sec = 862.4227 Mbps 0 retrans 1148.4375 MB / 1.00 sec = 9633.6860 Mbps 0 retrans 1177.4375 MB / 1.00 sec = 9877.3086 Mbps 0 retrans 1168.1250 MB / 1.00 sec = 9798.9133 Mbps 11 retrans 133.1250 MB / 1.00 sec = 1116.7457 Mbps 0 retrans 479.8750 MB / 1.00 sec = 4025.4631 Mbps 0 retrans 1150.6875 MB / 1.00 sec = 9652.4830 Mbps 0 retrans 1177.3125 MB / 1.00 sec = 9876.0624 Mbps 0 retrans 1177.3750 MB / 1.00 sec = 9876.0139 Mbps 0 retrans 320.2500 MB / 1.00 sec = 2686.6452 Mbps 19 retrans 64.9375 MB / 1.00 sec = 544.7363 Mbps 0 retrans 73.6250 MB / 1.00 sec = 617.6113 Mbps 0 retrans 8192.0000 MB / 12.39 sec = 5545.7570 Mbps 12 %TX 26 %RX 30 retrans 80.80 msRTT Re-enabling hystart immediately gives a clean test with no TCP retrans. i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 103.8750 MB / 1.00 sec = 871.3353 Mbps 0 retrans 1086.7500 MB / 1.00 sec = 9116.4474 Mbps 0 retrans 1105.8125 MB / 1.00 sec = 9276.2276 Mbps 0 retrans 1109.4375 MB / 1.00 sec = 9306.5339 Mbps 0 retrans 1111.3125 MB / 1.00 sec = 9322.5327 Mbps 0 retrans 1111.3750 MB / 1.00 sec = 9322.8053 Mbps 0 retrans 1113.7500 MB / 1.00 sec = 9342.8962 Mbps 0 retrans 1120.3125 MB / 1.00 sec = 9397.5711 Mbps 0 retrans 8192.0000 MB / 8.38 sec = 8204.8394 Mbps 16 %TX 39 %RX 0 retrans 80.80 msRTT -Bill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-10 5:24 ` Bill Fink @ 2011-03-10 6:17 ` Stephen Hemminger 2011-03-10 7:17 ` Bill Fink 2011-03-10 14:37 ` Injong Rhee 1 sibling, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2011-03-10 6:17 UTC (permalink / raw) To: Bill Fink Cc: Injong Rhee, David Miller, xiyou wangcong, netdev, sangtae ha, Lucas Nussbaum Bill what is the HZ in your kernel config. I am concerned hystart doesn't work well with HZ=100 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-10 6:17 ` Stephen Hemminger @ 2011-03-10 7:17 ` Bill Fink 2011-03-10 8:54 ` Lucas Nussbaum 0 siblings, 1 reply; 27+ messages in thread From: Bill Fink @ 2011-03-10 7:17 UTC (permalink / raw) To: Stephen Hemminger Cc: Injong Rhee, David Miller, xiyou wangcong, netdev, sangtae ha, Lucas Nussbaum On Wed, 9 Mar 2011, Stephen Hemminger wrote: > Bill what is the HZ in your kernel config. > I am concerned hystart doesn't work well with HZ=100 HZ=1000 But I did have tcp_timestamps disabled. Should I re-run the tests with tcp_timestamps enabled? -Bill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-10 7:17 ` Bill Fink @ 2011-03-10 8:54 ` Lucas Nussbaum 2011-03-11 2:25 ` Bill Fink 0 siblings, 1 reply; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-10 8:54 UTC (permalink / raw) To: Bill Fink Cc: Stephen Hemminger, Injong Rhee, David Miller, xiyou wangcong, netdev, sangtae ha On 10/03/11 at 02:17 -0500, Bill Fink wrote: > On Wed, 9 Mar 2011, Stephen Hemminger wrote: > > > Bill what is the HZ in your kernel config. > > I am concerned hystart doesn't work well with HZ=100 > > HZ=1000 > > But I did have tcp_timestamps disabled. Should I re-run > the tests with tcp_timestamps enabled? I ran my tests with timestamps enabled and HZ=250. If you have the opportunity to run tests in the same config, it would be great. The HZ=250 vs HZ=1000 difference could explain why it's working. However, enabling or disabling timestamps shouldn't make a difference, since the hystart code doesn't use TCP_CONG_RTT_STAMP. -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-10 8:54 ` Lucas Nussbaum @ 2011-03-11 2:25 ` Bill Fink 0 siblings, 0 replies; 27+ messages in thread From: Bill Fink @ 2011-03-11 2:25 UTC (permalink / raw) To: Lucas Nussbaum Cc: Stephen Hemminger, Injong Rhee, David Miller, xiyou wangcong, netdev, sangtae ha On Thu, 10 Mar 2011, Lucas Nussbaum wrote: > On 10/03/11 at 02:17 -0500, Bill Fink wrote: > > On Wed, 9 Mar 2011, Stephen Hemminger wrote: > > > > > Bill what is the HZ in your kernel config. > > > I am concerned hystart doesn't work well with HZ=100 > > > > HZ=1000 > > > > But I did have tcp_timestamps disabled. Should I re-run > > the tests with tcp_timestamps enabled? > > I ran my tests with timestamps enabled and HZ=250. If you have the > opportunity to run tests in the same config, it would be great. The > HZ=250 vs HZ=1000 difference could explain why it's working. > > However, enabling or disabling timestamps shouldn't make a difference, > since the hystart code doesn't use TCP_CONG_RTT_STAMP. I reran the same tests with HZ=250 and tcp_timestamps enabled. BTW all my tests are with 9000-byte jumbo frames. If you want, I can also try them using standard 1500-byte Ethernet frames. First on the 18 ms RTT path: 8 GB transfer across an 18 ms RTT path with autotuning and hystart: i7test7% nuttcp -n8g -i1 192.168.1.23 614.5625 MB / 1.00 sec = 5155.1383 Mbps 0 retrans 824.2500 MB / 1.00 sec = 6914.5038 Mbps 0 retrans 826.6875 MB / 1.00 sec = 6934.5632 Mbps 0 retrans 831.5625 MB / 1.00 sec = 6975.7146 Mbps 0 retrans 835.1875 MB / 1.00 sec = 7006.1867 Mbps 0 retrans 844.8125 MB / 1.00 sec = 7086.7867 Mbps 0 retrans 862.1250 MB / 1.00 sec = 7231.9274 Mbps 0 retrans 886.5625 MB / 1.00 sec = 7437.0402 Mbps 0 retrans 918.6875 MB / 1.00 sec = 7706.5633 Mbps 0 retrans 8192.0000 MB / 9.80 sec = 7009.7460 Mbps 12 %TX 31 %RX 0 retrans 18.91 msRTT Ramps up quickly to a little under 7 Gbps, then increases more slowly to 7.7 Gbps, with no TCP retransmissions. Actually performed somewhat better than the HZ=1000 case. 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and hystart: i7test7% nuttcp -n8g -i1 -w40m 192.168.1.23 716.0000 MB / 1.00 sec = 6006.0812 Mbps 0 retrans 864.5000 MB / 1.00 sec = 7251.9589 Mbps 0 retrans 866.1250 MB / 1.00 sec = 7265.4596 Mbps 0 retrans 871.1250 MB / 1.00 sec = 7307.7746 Mbps 0 retrans 875.6250 MB / 1.00 sec = 7345.2308 Mbps 0 retrans 886.1875 MB / 1.00 sec = 7433.8796 Mbps 0 retrans 904.1250 MB / 1.00 sec = 7584.3654 Mbps 0 retrans 929.1875 MB / 1.00 sec = 7794.4728 Mbps 0 retrans 961.6250 MB / 1.00 sec = 8066.7839 Mbps 0 retrans 8192.0000 MB / 9.34 sec = 7356.7856 Mbps 13 %TX 32 %RX 0 retrans 18.92 msRTT Ramps up quickly to 7+ Gbps, then increases more slowly to 8+ Gbps, with no TCP retransmissions. Performed significantly worse than the HZ=1000 case. 8 GB transfer across an 18 ms RTT path with autotuning and no hystart: i7test7% nuttcp -n8g -i1 192.168.1.23 850.8750 MB / 1.00 sec = 7137.3642 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.3396 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.5486 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.5883 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.0621 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.4396 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.5189 Mbps 0 retrans 8192.0000 MB / 7.23 sec = 9499.4276 Mbps 17 %TX 40 %RX 0 retrans 18.95 msRTT Quickly ramps up to full 10-GigE line rate, with no TCP retrans. Same performance as HZ=1000 case. 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and no hystart: i7test7% nuttcp -n8g -i1 -w40m 192.168.1.23 969.8125 MB / 1.00 sec = 8135.2793 Mbps 0 retrans 1181.1250 MB / 1.00 sec = 9908.0541 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.1810 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.9044 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0729 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.0532 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.9549 Mbps 0 retrans 8192.0000 MB / 7.15 sec = 9609.9893 Mbps 17 %TX 41 %RX 0 retrans 18.92 msRTT Also quickly ramps up to full 10-GigE line rate, with no TCP retrans. Same performance as HZ=1000 case. Now trying the same type of tests across an 80 ms RTT path. 8 GB transfer across an 80 ms RTT path with autotuning and hystart: i7test7% nuttcp -n8g -i1 192.168.1.18 10.6250 MB / 1.00 sec = 89.1274 Mbps 0 retrans 501.7500 MB / 1.00 sec = 4208.6979 Mbps 0 retrans 872.9375 MB / 1.00 sec = 7323.2651 Mbps 0 retrans 865.5000 MB / 1.00 sec = 7259.8901 Mbps 0 retrans 854.9375 MB / 1.00 sec = 7172.0224 Mbps 0 retrans 872.0000 MB / 1.00 sec = 7314.8735 Mbps 0 retrans 866.6875 MB / 1.00 sec = 7270.3017 Mbps 0 retrans 855.1250 MB / 1.00 sec = 7172.9354 Mbps 0 retrans 868.7500 MB / 1.00 sec = 7288.1352 Mbps 0 retrans 868.3750 MB / 1.00 sec = 7283.8238 Mbps 0 retrans 8192.0000 MB / 10.99 sec = 6250.8745 Mbps 11 %TX 25 %RX 0 retrans 80.78 msRTT Similar to the 20 ms RTT path, but achieving somewhat lower performance levels, presumably due to the larger RTT. Ramps up fairly quickly to 7+ Gbps, then appears to stabilize at that level, with no TCP retransmissions. Somewhat better performance than the HZ=1000 case. 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and hystart: i7test7% nuttcp -n8g -i1 -w100m 192.168.1.18 103.8125 MB / 1.00 sec = 870.8197 Mbps 0 retrans 1071.6875 MB / 1.00 sec = 8989.8315 Mbps 0 retrans 1089.6250 MB / 1.00 sec = 9140.6929 Mbps 0 retrans 1093.4375 MB / 1.00 sec = 9172.4186 Mbps 0 retrans 1095.1875 MB / 1.00 sec = 9187.1262 Mbps 0 retrans 1094.7500 MB / 1.00 sec = 9183.3460 Mbps 0 retrans 1097.8750 MB / 1.00 sec = 9208.9431 Mbps 0 retrans 1103.9375 MB / 1.00 sec = 9261.2584 Mbps 0 retrans 8192.0000 MB / 8.48 sec = 8102.4984 Mbps 15 %TX 38 %RX 0 retrans 80.81 msRTT Quickly ramps up to 9 Gbps and then slowly increases further, with no TCP retrans. Basically same performance as HZ=1000 case. 8 GB transfer across an 80 ms RTT path with autotuning and no hystart: i7test7% nuttcp -n8g -i1 192.168.1.18 10.0000 MB / 1.00 sec = 83.8847 Mbps 0 retrans 482.3125 MB / 1.00 sec = 4045.8172 Mbps 0 retrans 863.2500 MB / 1.00 sec = 7241.4224 Mbps 0 retrans 874.3750 MB / 1.00 sec = 7334.7304 Mbps 0 retrans 855.0000 MB / 1.00 sec = 7172.3889 Mbps 0 retrans 863.6250 MB / 1.00 sec = 7244.6840 Mbps 0 retrans 875.0625 MB / 1.00 sec = 7340.5489 Mbps 0 retrans 855.1875 MB / 1.00 sec = 7173.6390 Mbps 0 retrans 863.8750 MB / 1.00 sec = 7246.9044 Mbps 0 retrans 873.3125 MB / 1.00 sec = 7325.9788 Mbps 0 retrans 8192.0000 MB / 10.99 sec = 6253.7478 Mbps 11 %TX 26 %RX 0 retrans 80.80 msRTT Ramps up quickly to 7+ Gbps, then appears to stabilize at that level, with no TCP retransmissions. Performance is same as with autotuning enabled, but less than using a manually set 100 MB socket buffer. Same performance as HZ=1000 case. 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and no hystart: i7test7% nuttcp -n8g -i1 -w100m 192.168.1.18 103.8125 MB / 1.00 sec = 870.7945 Mbps 0 retrans 1148.4375 MB / 1.00 sec = 9633.6860 Mbps 0 retrans 1176.9375 MB / 1.00 sec = 9872.7291 Mbps 0 retrans 1088.1250 MB / 1.00 sec = 9127.4342 Mbps 39 retrans 171.0625 MB / 1.00 sec = 1435.1370 Mbps 0 retrans 901.0625 MB / 1.00 sec = 7558.3275 Mbps 0 retrans 1160.0625 MB / 1.00 sec = 9731.1831 Mbps 0 retrans 1172.5625 MB / 1.00 sec = 9836.5508 Mbps 0 retrans 1085.0625 MB / 1.00 sec = 9101.2174 Mbps 31 retrans 150.3750 MB / 1.00 sec = 1261.5908 Mbps 2 retrans 28.1875 MB / 1.00 sec = 236.4544 Mbps 0 retrans 8192.0000 MB / 11.31 sec = 6077.0651 Mbps 14 %TX 29 %RX 72 retrans 80.82 msRTT As in the HZ=1000 case, disabling hystart on a large RTT path does not seem to play nice with a manually specified socket buffer, resulting in TCP retransmissions that limit the effective network performance. Performance seems similar to the HZ=1000 case. This is a repeatable phenomenon, but didn't seem quite as variable as in the HZ=1000 case (but probably need a larger number of repetitions to draw any firm conclusions about that). i7test7% nuttcp -n8g -i1 -w100m 192.168.1.18 103.4375 MB / 1.00 sec = 867.6472 Mbps 0 retrans 1143.0625 MB / 1.00 sec = 9589.1347 Mbps 0 retrans 629.4375 MB / 1.00 sec = 5280.0886 Mbps 24 retrans 164.8750 MB / 1.00 sec = 1383.0759 Mbps 0 retrans 1121.6250 MB / 1.00 sec = 9408.7878 Mbps 0 retrans 1168.1250 MB / 1.00 sec = 9799.0309 Mbps 0 retrans 1167.5000 MB / 1.00 sec = 9793.5725 Mbps 0 retrans 1165.9375 MB / 1.00 sec = 9780.0841 Mbps 0 retrans 959.8750 MB / 1.00 sec = 8052.4902 Mbps 9 retrans 568.1250 MB / 1.00 sec = 4765.8065 Mbps 0 retrans 8192.0000 MB / 10.03 sec = 6852.2803 Mbps 13 %TX 32 %RX 33 retrans 80.81 msRTT And: i7test7% nuttcp -n8g -i1 -w100m 192.168.1.18 103.8125 MB / 1.00 sec = 870.8241 Mbps 0 retrans 1148.8125 MB / 1.00 sec = 9636.9570 Mbps 0 retrans 1177.3750 MB / 1.00 sec = 9876.4287 Mbps 0 retrans 1177.4375 MB / 1.00 sec = 9877.0024 Mbps 0 retrans 693.5000 MB / 1.00 sec = 5817.6335 Mbps 36 retrans 263.4375 MB / 1.00 sec = 2209.7701 Mbps 0 retrans 1137.3125 MB / 1.00 sec = 9540.7263 Mbps 0 retrans 1169.9375 MB / 1.00 sec = 9814.2354 Mbps 0 retrans 1168.6875 MB / 1.00 sec = 9803.7005 Mbps 0 retrans 8192.0000 MB / 9.21 sec = 7460.8789 Mbps 14 %TX 34 %RX 36 retrans 80.81 msRTT Re-enabling hystart immediately gives a clean test with no TCP retrans. i7test7% nuttcp -n8g -i1 -w100m 192.168.1.18 103.8125 MB / 1.00 sec = 870.8075 Mbps 0 retrans 1072.3125 MB / 1.00 sec = 8995.0653 Mbps 0 retrans 1089.4375 MB / 1.00 sec = 9139.0926 Mbps 0 retrans 1093.1875 MB / 1.00 sec = 9170.0646 Mbps 0 retrans 1095.5625 MB / 1.00 sec = 9190.3914 Mbps 0 retrans 1095.5000 MB / 1.00 sec = 9189.8303 Mbps 0 retrans 1097.6875 MB / 1.00 sec = 9207.8952 Mbps 0 retrans 1104.1875 MB / 1.00 sec = 9262.5405 Mbps 0 retrans 8192.0000 MB / 8.48 sec = 8104.4831 Mbps 15 %TX 38 %RX 0 retrans 80.77 msRTT -Bill Previous HZ=1000 tests (with tcp_timestamps disabled): Here are some tests I performed across real networks, where congestion is generally not an issue, with a 2.6.35 kernel on the transmit side. 8 GB transfer across an 18 ms RTT path with autotuning and hystart: i7test7% nuttcp -n8g -i1 192.168.1.23 517.9375 MB / 1.00 sec = 4344.6096 Mbps 0 retrans 688.4375 MB / 1.00 sec = 5775.1998 Mbps 0 retrans 692.9375 MB / 1.00 sec = 5812.7462 Mbps 0 retrans 698.0625 MB / 1.00 sec = 5855.8078 Mbps 0 retrans 699.8750 MB / 1.00 sec = 5871.0123 Mbps 0 retrans 710.5625 MB / 1.00 sec = 5960.5707 Mbps 0 retrans 728.8125 MB / 1.00 sec = 6113.7652 Mbps 0 retrans 751.3750 MB / 1.00 sec = 6302.9210 Mbps 0 retrans 783.8750 MB / 1.00 sec = 6575.6201 Mbps 0 retrans 825.1875 MB / 1.00 sec = 6921.8145 Mbps 0 retrans 875.4375 MB / 1.00 sec = 7343.9811 Mbps 0 retrans 8192.0000 MB / 11.26 sec = 6102.4718 Mbps 11 %TX 28 %RX 0 retrans 18.92 msRTT Ramps up quickly to a little under 6 Gbps, then increases more slowly to 7+ Gbps, with no TCP retransmissions. 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and hystart: i7test7% nuttcp -n8g -w40m -i1 192.168.1.23 970.0625 MB / 1.00 sec = 8136.8475 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9909.0045 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.6369 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.8747 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0531 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.8153 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0729 Mbps 0 retrans 8192.0000 MB / 7.13 sec = 9633.5814 Mbps 17 %TX 42 %RX 0 retrans 18.91 msRTT Quickly ramps up to full 10-GigE line rate, with no TCP retrans. 8 GB transfer across an 18 ms RTT path with autotuning and no hystart: i7test7% nuttcp -n8g -i1 192.168.1.23 845.4375 MB / 1.00 sec = 7091.5828 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9910.0134 Mbps 0 retrans 1181.0625 MB / 1.00 sec = 9907.1830 Mbps 0 retrans 1181.4375 MB / 1.00 sec = 9910.8936 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.1721 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.5774 Mbps 0 retrans 1181.1875 MB / 1.00 sec = 9908.6874 Mbps 0 retrans 8192.0000 MB / 7.25 sec = 9484.4524 Mbps 18 %TX 41 %RX 0 retrans 18.92 msRTT Also quickly ramps up to full 10-GigE line rate, with no TCP retrans. 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and no hystart: i7test7% nuttcp -n8g -w40m -i1 192.168.1.23 969.8750 MB / 1.00 sec = 8135.6571 Mbps 0 retrans 1181.3125 MB / 1.00 sec = 9909.3990 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.9342 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.4098 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9908.8252 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.0630 Mbps 0 retrans 1181.2500 MB / 1.00 sec = 9909.3504 Mbps 0 retrans 8192.0000 MB / 7.15 sec = 9611.8053 Mbps 18 %TX 42 %RX 0 retrans 18.95 msRTT Basically the same as the case with 40 MB socket buffer and hystart enabled. Now trying the same type of tests across an 80 ms RTT path. 8 GB transfer across an 80 ms RTT path with autotuning and hystart: i7test7% nuttcp -n8g -i1 192.168.1.18 11.3125 MB / 1.00 sec = 94.8954 Mbps 0 retrans 441.5625 MB / 1.00 sec = 3704.1021 Mbps 0 retrans 687.3750 MB / 1.00 sec = 5765.8657 Mbps 0 retrans 715.5625 MB / 1.00 sec = 6002.6273 Mbps 0 retrans 709.9375 MB / 1.00 sec = 5955.5958 Mbps 0 retrans 691.3125 MB / 1.00 sec = 5799.0626 Mbps 0 retrans 718.6250 MB / 1.00 sec = 6028.3538 Mbps 0 retrans 718.0000 MB / 1.00 sec = 6023.0205 Mbps 0 retrans 704.0000 MB / 1.00 sec = 5905.5387 Mbps 0 retrans 733.3125 MB / 1.00 sec = 6151.4096 Mbps 0 retrans 738.8750 MB / 1.00 sec = 6198.2381 Mbps 0 retrans 731.8750 MB / 1.00 sec = 6139.3695 Mbps 0 retrans 8192.0000 MB / 12.85 sec = 5348.9677 Mbps 10 %TX 23 %RX 0 retrans 80.81 msRTT Similar to the 20 ms RTT path, but achieving somewhat lower performance levels, presumably due to the larger RTT. Ramps up fairly quickly to a little under 6 Gbps, then increases more slowly to 6+ Gbps, with no TCP retransmissions. 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and hystart: i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 103.9375 MB / 1.00 sec = 871.8378 Mbps 0 retrans 1086.5625 MB / 1.00 sec = 9114.6102 Mbps 0 retrans 1106.6875 MB / 1.00 sec = 9283.5583 Mbps 0 retrans 1109.3125 MB / 1.00 sec = 9305.5226 Mbps 0 retrans 1111.1875 MB / 1.00 sec = 9321.9596 Mbps 0 retrans 1112.8125 MB / 1.00 sec = 9334.8452 Mbps 0 retrans 1113.6875 MB / 1.00 sec = 9341.6620 Mbps 0 retrans 1120.2500 MB / 1.00 sec = 9398.0054 Mbps 0 retrans 8192.0000 MB / 8.37 sec = 8207.2049 Mbps 16 %TX 38 %RX 0 retrans 80.81 msRTT Quickly ramps up to 9+ Gbps and then slowly increases further, with no TCP retrans. 8 GB transfer across an 80 ms RTT path with autotuning and no hystart: i7test7% nuttcp -n8g -i1 192.168.1.18 11.2500 MB / 1.00 sec = 94.3703 Mbps 0 retrans 519.0625 MB / 1.00 sec = 4354.1596 Mbps 0 retrans 861.2500 MB / 1.00 sec = 7224.7970 Mbps 0 retrans 871.0000 MB / 1.00 sec = 7306.4191 Mbps 0 retrans 860.7500 MB / 1.00 sec = 7220.4438 Mbps 0 retrans 869.0625 MB / 1.00 sec = 7290.3340 Mbps 0 retrans 863.4375 MB / 1.00 sec = 7242.7707 Mbps 0 retrans 860.4375 MB / 1.00 sec = 7218.0606 Mbps 0 retrans 875.5000 MB / 1.00 sec = 7344.3071 Mbps 0 retrans 863.1875 MB / 1.00 sec = 7240.8257 Mbps 0 retrans 8192.0000 MB / 10.98 sec = 6259.4379 Mbps 12 %TX 27 %RX 0 retrans 80.81 msRTT Ramps up quickly to 7+ Gbps, then appears to stabilize at that level, with no TCP retransmissions. Performance is somewhat better than with autotuning enabled, but less than using a manually set 100 MB socket buffer. 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and no hystart: i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 102.8750 MB / 1.00 sec = 862.9487 Mbps 0 retrans 522.8750 MB / 1.00 sec = 4386.2811 Mbps 414 retrans 881.5625 MB / 1.00 sec = 7394.6534 Mbps 0 retrans 1164.3125 MB / 1.00 sec = 9766.6682 Mbps 0 retrans 1170.5625 MB / 1.00 sec = 9819.7042 Mbps 0 retrans 1166.8125 MB / 1.00 sec = 9788.2067 Mbps 0 retrans 1159.8750 MB / 1.00 sec = 9729.1530 Mbps 0 retrans 811.1250 MB / 1.00 sec = 6804.8017 Mbps 21 retrans 73.2500 MB / 1.00 sec = 614.4674 Mbps 0 retrans 884.6250 MB / 1.00 sec = 7420.2900 Mbps 0 retrans 8192.0000 MB / 10.34 sec = 6647.9394 Mbps 13 %TX 31 %RX 435 retrans 80.81 msRTT Disabling hystart on a large RTT path does not seem to play nice with a manually specified socket buffer, resulting in TCP retransmissions that limit the effective network performance. This is a repeatable but extremely variable phenomenon. i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 103.7500 MB / 1.00 sec = 870.3015 Mbps 0 retrans 1146.3750 MB / 1.00 sec = 9616.4520 Mbps 0 retrans 1175.9375 MB / 1.00 sec = 9864.6070 Mbps 0 retrans 615.6875 MB / 1.00 sec = 5164.7353 Mbps 21 retrans 139.2500 MB / 1.00 sec = 1168.1253 Mbps 0 retrans 1090.0625 MB / 1.00 sec = 9143.8053 Mbps 0 retrans 1170.4375 MB / 1.00 sec = 9818.6654 Mbps 0 retrans 1174.5625 MB / 1.00 sec = 9852.8754 Mbps 0 retrans 1174.8750 MB / 1.00 sec = 9855.6052 Mbps 0 retrans 8192.0000 MB / 9.42 sec = 7292.9879 Mbps 14 %TX 34 %RX 21 retrans 80.81 msRTT And: i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 102.8125 MB / 1.00 sec = 862.4227 Mbps 0 retrans 1148.4375 MB / 1.00 sec = 9633.6860 Mbps 0 retrans 1177.4375 MB / 1.00 sec = 9877.3086 Mbps 0 retrans 1168.1250 MB / 1.00 sec = 9798.9133 Mbps 11 retrans 133.1250 MB / 1.00 sec = 1116.7457 Mbps 0 retrans 479.8750 MB / 1.00 sec = 4025.4631 Mbps 0 retrans 1150.6875 MB / 1.00 sec = 9652.4830 Mbps 0 retrans 1177.3125 MB / 1.00 sec = 9876.0624 Mbps 0 retrans 1177.3750 MB / 1.00 sec = 9876.0139 Mbps 0 retrans 320.2500 MB / 1.00 sec = 2686.6452 Mbps 19 retrans 64.9375 MB / 1.00 sec = 544.7363 Mbps 0 retrans 73.6250 MB / 1.00 sec = 617.6113 Mbps 0 retrans 8192.0000 MB / 12.39 sec = 5545.7570 Mbps 12 %TX 26 %RX 30 retrans 80.80 msRTT Re-enabling hystart immediately gives a clean test with no TCP retrans. i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 103.8750 MB / 1.00 sec = 871.3353 Mbps 0 retrans 1086.7500 MB / 1.00 sec = 9116.4474 Mbps 0 retrans 1105.8125 MB / 1.00 sec = 9276.2276 Mbps 0 retrans 1109.4375 MB / 1.00 sec = 9306.5339 Mbps 0 retrans 1111.3125 MB / 1.00 sec = 9322.5327 Mbps 0 retrans 1111.3750 MB / 1.00 sec = 9322.8053 Mbps 0 retrans 1113.7500 MB / 1.00 sec = 9342.8962 Mbps 0 retrans 1120.3125 MB / 1.00 sec = 9397.5711 Mbps 0 retrans 8192.0000 MB / 8.38 sec = 8204.8394 Mbps 16 %TX 39 %RX 0 retrans 80.80 msRTT ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-10 5:24 ` Bill Fink 2011-03-10 6:17 ` Stephen Hemminger @ 2011-03-10 14:37 ` Injong Rhee 1 sibling, 0 replies; 27+ messages in thread From: Injong Rhee @ 2011-03-10 14:37 UTC (permalink / raw) To: Bill Fink Cc: Lucas Nussbaum, Stephen Hemminger, David Miller, xiyou.wangcong, netdev, sangtae.ha This is a good example why I think the problem is in implementation. The original idea is sound. The tests where Lucas report problems in (fat pipes with only a small # of flows) are the ones where hystart should perform very well. If you have many flows, then leaving slow start early (even if by mistake) can be easily covered by cubic growth function in congestion avoidance. We need to look into the issue of Hz setting, other implementation issues, and run more extensive tests. On 3/10/11 12:24 AM, Bill Fink wrote: > On Wed, 9 Mar 2011, Lucas Nussbaum wrote: > >> On 08/03/11 at 20:30 -0500, Injong Rhee wrote: >>> Now, both tools can be wrong. But that is not catastrophic since >>> congestion avoidance can kick in to save the day. In a pipe where no >>> other flows are competing, then exiting slow start too early can >>> slow things down as the window can be still too small. But that is >>> in fact when delays are most reliable. So those tests that say bad >>> performance with hystart are in fact, where hystart is supposed to >>> perform well. >> Hi, >> >> In my setup, there is no congestion at all (except the buffer bloat). >> Without Hystart, transferring 8 Gb of data takes 9s, with CUBIC exiting >> slow start at ~2000 packets. >> With Hystart, transferring 8 Gb of data takes 19s, with CUBIC exiting >> slow start at ~20 packets. >> I don't think that this is "hystart performing well". We could just as >> well remove slow start completely, and only do congestion avoidance, >> then. >> >> While I see the value in Hystart, it's clear that there are some flaws >> in the current implementation. It probably makes sense to disable >> hystart by default until those problems are fixed. > Here are some tests I performed across real networks, where > congestion is generally not an issue, with a 2.6.35 kernel on > the transmit side. > > 8 GB transfer across an 18 ms RTT path with autotuning and hystart: > > i7test7% nuttcp -n8g -i1 192.168.1.23 > 517.9375 MB / 1.00 sec = 4344.6096 Mbps 0 retrans > 688.4375 MB / 1.00 sec = 5775.1998 Mbps 0 retrans > 692.9375 MB / 1.00 sec = 5812.7462 Mbps 0 retrans > 698.0625 MB / 1.00 sec = 5855.8078 Mbps 0 retrans > 699.8750 MB / 1.00 sec = 5871.0123 Mbps 0 retrans > 710.5625 MB / 1.00 sec = 5960.5707 Mbps 0 retrans > 728.8125 MB / 1.00 sec = 6113.7652 Mbps 0 retrans > 751.3750 MB / 1.00 sec = 6302.9210 Mbps 0 retrans > 783.8750 MB / 1.00 sec = 6575.6201 Mbps 0 retrans > 825.1875 MB / 1.00 sec = 6921.8145 Mbps 0 retrans > 875.4375 MB / 1.00 sec = 7343.9811 Mbps 0 retrans > > 8192.0000 MB / 11.26 sec = 6102.4718 Mbps 11 %TX 28 %RX 0 retrans 18.92 msRTT > > Ramps up quickly to a little under 6 Gbps, then increases more > slowly to 7+ Gbps, with no TCP retransmissions. > > 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and hystart: > > i7test7% nuttcp -n8g -w40m -i1 192.168.1.23 > 970.0625 MB / 1.00 sec = 8136.8475 Mbps 0 retrans > 1181.1875 MB / 1.00 sec = 9909.0045 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9908.6369 Mbps 0 retrans > 1181.3125 MB / 1.00 sec = 9909.8747 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9909.0531 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9908.8153 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9909.0729 Mbps 0 retrans > > 8192.0000 MB / 7.13 sec = 9633.5814 Mbps 17 %TX 42 %RX 0 retrans 18.91 msRTT > > Quickly ramps up to full 10-GigE line rate, with no TCP retrans. > > 8 GB transfer across an 18 ms RTT path with autotuning and no hystart: > > i7test7% nuttcp -n8g -i1 192.168.1.23 > 845.4375 MB / 1.00 sec = 7091.5828 Mbps 0 retrans > 1181.3125 MB / 1.00 sec = 9910.0134 Mbps 0 retrans > 1181.0625 MB / 1.00 sec = 9907.1830 Mbps 0 retrans > 1181.4375 MB / 1.00 sec = 9910.8936 Mbps 0 retrans > 1181.1875 MB / 1.00 sec = 9908.1721 Mbps 0 retrans > 1181.3125 MB / 1.00 sec = 9909.5774 Mbps 0 retrans > 1181.1875 MB / 1.00 sec = 9908.6874 Mbps 0 retrans > > 8192.0000 MB / 7.25 sec = 9484.4524 Mbps 18 %TX 41 %RX 0 retrans 18.92 msRTT > > Also quickly ramps up to full 10-GigE line rate, with no TCP retrans. > > 8 GB transfer across an 18 ms RTT path with 40 MB socket buffer and no hystart: > > i7test7% nuttcp -n8g -w40m -i1 192.168.1.23 > 969.8750 MB / 1.00 sec = 8135.6571 Mbps 0 retrans > 1181.3125 MB / 1.00 sec = 9909.3990 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9908.9342 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9909.4098 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9908.8252 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9909.0630 Mbps 0 retrans > 1181.2500 MB / 1.00 sec = 9909.3504 Mbps 0 retrans > > 8192.0000 MB / 7.15 sec = 9611.8053 Mbps 18 %TX 42 %RX 0 retrans 18.95 msRTT > > Basically the same as the case with 40 MB socket buffer and hystart enabled. > > Now trying the same type of tests across an 80 ms RTT path. > > 8 GB transfer across an 80 ms RTT path with autotuning and hystart: > > i7test7% nuttcp -n8g -i1 192.168.1.18 > 11.3125 MB / 1.00 sec = 94.8954 Mbps 0 retrans > 441.5625 MB / 1.00 sec = 3704.1021 Mbps 0 retrans > 687.3750 MB / 1.00 sec = 5765.8657 Mbps 0 retrans > 715.5625 MB / 1.00 sec = 6002.6273 Mbps 0 retrans > 709.9375 MB / 1.00 sec = 5955.5958 Mbps 0 retrans > 691.3125 MB / 1.00 sec = 5799.0626 Mbps 0 retrans > 718.6250 MB / 1.00 sec = 6028.3538 Mbps 0 retrans > 718.0000 MB / 1.00 sec = 6023.0205 Mbps 0 retrans > 704.0000 MB / 1.00 sec = 5905.5387 Mbps 0 retrans > 733.3125 MB / 1.00 sec = 6151.4096 Mbps 0 retrans > 738.8750 MB / 1.00 sec = 6198.2381 Mbps 0 retrans > 731.8750 MB / 1.00 sec = 6139.3695 Mbps 0 retrans > > 8192.0000 MB / 12.85 sec = 5348.9677 Mbps 10 %TX 23 %RX 0 retrans 80.81 msRTT > > Similar to the 20 ms RTT path, but achieving somewhat lower > performance levels, presumably due to the larger RTT. Ramps > up fairly quickly to a little under 6 Gbps, then increases > more slowly to 6+ Gbps, with no TCP retransmissions. > > 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and hystart: > > i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 > 103.9375 MB / 1.00 sec = 871.8378 Mbps 0 retrans > 1086.5625 MB / 1.00 sec = 9114.6102 Mbps 0 retrans > 1106.6875 MB / 1.00 sec = 9283.5583 Mbps 0 retrans > 1109.3125 MB / 1.00 sec = 9305.5226 Mbps 0 retrans > 1111.1875 MB / 1.00 sec = 9321.9596 Mbps 0 retrans > 1112.8125 MB / 1.00 sec = 9334.8452 Mbps 0 retrans > 1113.6875 MB / 1.00 sec = 9341.6620 Mbps 0 retrans > 1120.2500 MB / 1.00 sec = 9398.0054 Mbps 0 retrans > > 8192.0000 MB / 8.37 sec = 8207.2049 Mbps 16 %TX 38 %RX 0 retrans 80.81 msRTT > > Quickly ramps up to 9+ Gbps and then slowly increases further, > with no TCP retrans. > > 8 GB transfer across an 80 ms RTT path with autotuning and no hystart: > > i7test7% nuttcp -n8g -i1 192.168.1.18 > 11.2500 MB / 1.00 sec = 94.3703 Mbps 0 retrans > 519.0625 MB / 1.00 sec = 4354.1596 Mbps 0 retrans > 861.2500 MB / 1.00 sec = 7224.7970 Mbps 0 retrans > 871.0000 MB / 1.00 sec = 7306.4191 Mbps 0 retrans > 860.7500 MB / 1.00 sec = 7220.4438 Mbps 0 retrans > 869.0625 MB / 1.00 sec = 7290.3340 Mbps 0 retrans > 863.4375 MB / 1.00 sec = 7242.7707 Mbps 0 retrans > 860.4375 MB / 1.00 sec = 7218.0606 Mbps 0 retrans > 875.5000 MB / 1.00 sec = 7344.3071 Mbps 0 retrans > 863.1875 MB / 1.00 sec = 7240.8257 Mbps 0 retrans > > 8192.0000 MB / 10.98 sec = 6259.4379 Mbps 12 %TX 27 %RX 0 retrans 80.81 msRTT > > Ramps up quickly to 7+ Gbps, then appears to stabilize at that > level, with no TCP retransmissions. Performance is somewhat > better than with autotuning enabled, but less than using a > manually set 100 MB socket buffer. > > 8 GB transfer across an 80 ms RTT path with 100 MB socket buffer and no hystart: > > i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 > 102.8750 MB / 1.00 sec = 862.9487 Mbps 0 retrans > 522.8750 MB / 1.00 sec = 4386.2811 Mbps 414 retrans > 881.5625 MB / 1.00 sec = 7394.6534 Mbps 0 retrans > 1164.3125 MB / 1.00 sec = 9766.6682 Mbps 0 retrans > 1170.5625 MB / 1.00 sec = 9819.7042 Mbps 0 retrans > 1166.8125 MB / 1.00 sec = 9788.2067 Mbps 0 retrans > 1159.8750 MB / 1.00 sec = 9729.1530 Mbps 0 retrans > 811.1250 MB / 1.00 sec = 6804.8017 Mbps 21 retrans > 73.2500 MB / 1.00 sec = 614.4674 Mbps 0 retrans > 884.6250 MB / 1.00 sec = 7420.2900 Mbps 0 retrans > > 8192.0000 MB / 10.34 sec = 6647.9394 Mbps 13 %TX 31 %RX 435 retrans 80.81 msRTT > > Disabling hystart on a large RTT path does not seem to play nice with > a manually specified socket buffer, resulting in TCP retransmissions > that limit the effective network performance. > > This is a repeatable but extremely variable phenomenon. > > i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 > 103.7500 MB / 1.00 sec = 870.3015 Mbps 0 retrans > 1146.3750 MB / 1.00 sec = 9616.4520 Mbps 0 retrans > 1175.9375 MB / 1.00 sec = 9864.6070 Mbps 0 retrans > 615.6875 MB / 1.00 sec = 5164.7353 Mbps 21 retrans > 139.2500 MB / 1.00 sec = 1168.1253 Mbps 0 retrans > 1090.0625 MB / 1.00 sec = 9143.8053 Mbps 0 retrans > 1170.4375 MB / 1.00 sec = 9818.6654 Mbps 0 retrans > 1174.5625 MB / 1.00 sec = 9852.8754 Mbps 0 retrans > 1174.8750 MB / 1.00 sec = 9855.6052 Mbps 0 retrans > > 8192.0000 MB / 9.42 sec = 7292.9879 Mbps 14 %TX 34 %RX 21 retrans 80.81 msRTT > > And: > > i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 > 102.8125 MB / 1.00 sec = 862.4227 Mbps 0 retrans > 1148.4375 MB / 1.00 sec = 9633.6860 Mbps 0 retrans > 1177.4375 MB / 1.00 sec = 9877.3086 Mbps 0 retrans > 1168.1250 MB / 1.00 sec = 9798.9133 Mbps 11 retrans > 133.1250 MB / 1.00 sec = 1116.7457 Mbps 0 retrans > 479.8750 MB / 1.00 sec = 4025.4631 Mbps 0 retrans > 1150.6875 MB / 1.00 sec = 9652.4830 Mbps 0 retrans > 1177.3125 MB / 1.00 sec = 9876.0624 Mbps 0 retrans > 1177.3750 MB / 1.00 sec = 9876.0139 Mbps 0 retrans > 320.2500 MB / 1.00 sec = 2686.6452 Mbps 19 retrans > 64.9375 MB / 1.00 sec = 544.7363 Mbps 0 retrans > 73.6250 MB / 1.00 sec = 617.6113 Mbps 0 retrans > > 8192.0000 MB / 12.39 sec = 5545.7570 Mbps 12 %TX 26 %RX 30 retrans 80.80 msRTT > > Re-enabling hystart immediately gives a clean test with no TCP retrans. > > i7test7% nuttcp -n8g -w100m -i1 192.168.1.18 > 103.8750 MB / 1.00 sec = 871.3353 Mbps 0 retrans > 1086.7500 MB / 1.00 sec = 9116.4474 Mbps 0 retrans > 1105.8125 MB / 1.00 sec = 9276.2276 Mbps 0 retrans > 1109.4375 MB / 1.00 sec = 9306.5339 Mbps 0 retrans > 1111.3125 MB / 1.00 sec = 9322.5327 Mbps 0 retrans > 1111.3750 MB / 1.00 sec = 9322.8053 Mbps 0 retrans > 1113.7500 MB / 1.00 sec = 9342.8962 Mbps 0 retrans > 1120.3125 MB / 1.00 sec = 9397.5711 Mbps 0 retrans > > 8192.0000 MB / 8.38 sec = 8204.8394 Mbps 16 %TX 39 %RX 0 retrans 80.80 msRTT > > -Bill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 23:21 ` Stephen Hemminger 2011-03-09 1:30 ` Injong Rhee @ 2011-03-09 1:33 ` Sangtae Ha 1 sibling, 0 replies; 27+ messages in thread From: Sangtae Ha @ 2011-03-09 1:33 UTC (permalink / raw) To: Stephen Hemminger Cc: David Miller, rhee, lucas.nussbaum, xiyou.wangcong, netdev Hi Stephen, Thank you for your feedback. Please see my answers below. On Tue, Mar 8, 2011 at 6:21 PM, Stephen Hemminger <shemminger@vyatta.com> wrote: > On Tue, 08 Mar 2011 11:43:46 -0800 (PST) > David Miller <davem@davemloft.net> wrote: > >> From: Injong Rhee <rhee@ncsu.edu> >> Date: Tue, 08 Mar 2011 10:26:36 -0500 >> >> > Thanks for updating CUBIC hystart. You might want to test the >> > cases with more background traffic and verify whether this >> > threshold is too conservative. >> >> So let's get down to basics. >> >> What does Hystart do specially that allows it to avoid all of the >> problems that TCP VEGAS runs into. >> >> Specifically, that if you use RTTs to make congestion control >> decisions it is impossible to notice new bandwidth becomming available >> fast enough. >> >> Again, it's impossible to react fast enough. No matter what you tweak >> all of your various settings to, this problem will still exist. >> >> This is a core issue, you cannot get around it. >> >> This is why I feel that Hystart is fundamentally flawed and we should >> turn it off by default if not flat-out remove it. >> >> Distributions are turning it off by default already, therefore it's >> stupid for the upstream kernel to behave differently if that's what >> %99 of the world is going to end up experiencing. > > The assumption in Hystart that spacing between ACK's is solely due to > congestion is a bad. If you read the paper, this is why FreeBSD's > estimation logic is dismissed. The Hystart problem is different > than the Vegas issue. > > Algorithms that look at min RTT are ok, since the lower bound is > fixed; additional queuing and variation in network only increases RTT > it never reduces it. With a min RTT it is possible to compute the > upper bound on available bandwidth. i.e If all packets were as good as > this estimate minRTT then the available bandwidth is X. But then using > an individual RTT sample to estimate unused bandwidth is flawed. To > quote paper. > > "Thus, by checking whether ∆(N ) is larger than Dmin , we > can detect whether cwnd has reached the available capacity > of the path" > > So what goes wrong: > 1. Dmin can be too large because this connection always sees delays > due to other traffic or hardware. i.e buffer bloat. This would cause > the bandwidth estimate to be too low and therefore TCP would leave > slow start too early (and not get up to full bandwidth). This is true. But the idea is that running the congestion avoidance algorithm of CUBIC for this case is better than hurting other flows with abrupt perturbation, since the growth of CUBIC is quite responsive and grab the bandwidth quickly in normal network conditions. > > 2. Dmin can be smaller than the clock resolution. This would cause > either sample to be ignored, or Dmin to be zero. If Dmin is zero, > the bandwidth estimate would in theory be infinite, which would > lead to TCP not leaving slow start because of Hystart. Instead > TCP would leave slow start at first loss. True. But since HyStart didn't clamp the threshold, ca->delay_min>>4, it can prematurely leave slow start for very small Dmin. I think this needs to be fixed, along with the hard-coded 2ms you mentioned below. > > Other possible problems: > 3. ACK's could be nudged together by variations in delay. > This would cause HyStart to exit slow start prematurely. To false > think it is an ACK train. This doesn't happen when the delay is not too small (in typical WAN including DSL), but it is possible with very small delays since the code checking the valid ACK train uses the 2ms fixed value, which is large for LAN. > > Noise in network is not catastrophic, it just > causes TCP to exit slow-start early and have to go into normal > window growth phase. The problem is that the original non-Hystart > behavior of Cubic is unfair; the first flow dominates the link > and other flows are unable to get in. If you run tests with two > flows one will get a larger share of the bandwidth. > > I think Hystart is okay in concept but there may be issues > on low RTT links as well as other corner cases that need bug > fixing. We do not use the delay as indication of congestion, but we use it for improving the stability and overall performance. Preventing burst losses quite help for mid to large BDP paths and the performance results with non-TCP-SACK receivers are also encouraging. I will work on the fix for the issues below. > > 1. Needs to use better resolution than HZ. Since HZ can be 100. > 2. Hardcoding 2ms as spacing between ACK's as train is wrong > for local networks. > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <AANLkTimdpEKHfVKw+bm6OnymcnUrauU+jGOPeLzy3Q0o@mail.gmail.com>]
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations [not found] ` <AANLkTimdpEKHfVKw+bm6OnymcnUrauU+jGOPeLzy3Q0o@mail.gmail.com> @ 2011-03-08 18:14 ` Lucas Nussbaum 0 siblings, 0 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-08 18:14 UTC (permalink / raw) To: Sangtae Ha; +Cc: WANG Cong, Injong Rhee, Netdev On 08/03/11 at 11:43 -0500, Sangtae Ha wrote: > Hi Lucas, > > The current packet-train threshold and the delay threshold have been tested > with the bandwidth ranging from 10M to 400M, the RTT from 10ms to 320ms, and > the buffer size from 10% BDP to 200% BDP and they were set conservatively to > make it work over the network with very small buffer sizes. I will recreate > your setup and check whether the current thresholds are too conservative and > will come up with the patch. I'm surprised. It's possible that a seemingly unrelated change broke it, but it was already broken for me on 2.6.32. I can provide access to the testbed if you want to run tests on it. -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-08 9:32 [PATCH] Make CUBIC Hystart more robust to RTT variations Lucas Nussbaum 2011-03-08 10:21 ` WANG Cong @ 2011-03-10 23:28 ` Stephen Hemminger 2011-03-11 5:59 ` Lucas Nussbaum 1 sibling, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2011-03-10 23:28 UTC (permalink / raw) To: Lucas Nussbaum; +Cc: netdev, Sangtae Ha On Tue, 8 Mar 2011 10:32:15 +0100 Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > CUBIC Hystart uses two heuristics to exit slow start earlier, before > losses start to occur. Unfortunately, it tends to exit slow start far too > early, causing poor performance since convergence to the optimal cwnd is > then very slow. This was reported in > http://permalink.gmane.org/gmane.linux.network/188169 and > https://partner-bugzilla.redhat.com/show_bug.cgi?id=616985 Ignore the RHEL bug. RHEL 5 ships with TCP BIC (not CUBIC) by default. There are many research papers which show that BIC is too aggressive, and not fair. -- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] Make CUBIC Hystart more robust to RTT variations 2011-03-10 23:28 ` Stephen Hemminger @ 2011-03-11 5:59 ` Lucas Nussbaum 0 siblings, 0 replies; 27+ messages in thread From: Lucas Nussbaum @ 2011-03-11 5:59 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev, Sangtae Ha On 10/03/11 at 15:28 -0800, Stephen Hemminger wrote: > On Tue, 8 Mar 2011 10:32:15 +0100 > Lucas Nussbaum <lucas.nussbaum@loria.fr> wrote: > > > CUBIC Hystart uses two heuristics to exit slow start earlier, before > > losses start to occur. Unfortunately, it tends to exit slow start far too > > early, causing poor performance since convergence to the optimal cwnd is > > then very slow. This was reported in > > http://permalink.gmane.org/gmane.linux.network/188169 and > > https://partner-bugzilla.redhat.com/show_bug.cgi?id=616985 > > Ignore the RHEL bug. RHEL 5 ships with TCP BIC (not CUBIC) by default. > There are many research papers which show that BIC is too aggressive, > and not fair. According to the bug report, the server is running RHEL6 (with CUBIC and Hystart), it's the client that is running RHEL5. -- | Lucas Nussbaum MCF Université Nancy 2 | | lucas.nussbaum@loria.fr LORIA / AlGorille | | http://www.loria.fr/~lnussbau/ +33 3 54 95 86 19 | ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2011-03-11 6:02 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-08 9:32 [PATCH] Make CUBIC Hystart more robust to RTT variations Lucas Nussbaum 2011-03-08 10:21 ` WANG Cong 2011-03-08 11:10 ` Lucas Nussbaum 2011-03-08 15:26 ` Injong Rhee 2011-03-08 19:43 ` David Miller 2011-03-08 23:21 ` Stephen Hemminger 2011-03-09 1:30 ` Injong Rhee 2011-03-09 6:53 ` Lucas Nussbaum 2011-03-09 17:56 ` Stephen Hemminger 2011-03-09 18:25 ` Lucas Nussbaum 2011-03-09 19:56 ` Stephen Hemminger 2011-03-09 21:28 ` Lucas Nussbaum 2011-03-09 20:01 ` Stephen Hemminger 2011-03-09 21:12 ` Yuchung Cheng 2011-03-09 21:33 ` Lucas Nussbaum 2011-03-09 21:51 ` Stephen Hemminger 2011-03-09 22:03 ` Lucas Nussbaum 2011-03-10 5:24 ` Bill Fink 2011-03-10 6:17 ` Stephen Hemminger 2011-03-10 7:17 ` Bill Fink 2011-03-10 8:54 ` Lucas Nussbaum 2011-03-11 2:25 ` Bill Fink 2011-03-10 14:37 ` Injong Rhee 2011-03-09 1:33 ` Sangtae Ha [not found] ` <AANLkTimdpEKHfVKw+bm6OnymcnUrauU+jGOPeLzy3Q0o@mail.gmail.com> 2011-03-08 18:14 ` Lucas Nussbaum 2011-03-10 23:28 ` Stephen Hemminger 2011-03-11 5:59 ` Lucas Nussbaum
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).