From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [PATCH] Make CUBIC Hystart more robust to RTT variations Date: Tue, 8 Mar 2011 15:21:03 -0800 Message-ID: <20110308152103.714f5f05@nehalam> References: <20110308111011.GA27967@xanadu.blop.info> <4D764AAC.30302@ncsu.edu> <20110308.114346.48506864.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: rhee@ncsu.edu, lucas.nussbaum@loria.fr, xiyou.wangcong@gmail.com, netdev@vger.kernel.org To: David Miller Return-path: Received: from mail.vyatta.com ([76.74.103.46]:59768 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756170Ab1CHXVH convert rfc822-to-8bit (ORCPT ); Tue, 8 Mar 2011 18:21:07 -0500 In-Reply-To: <20110308.114346.48506864.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, 08 Mar 2011 11:43:46 -0800 (PST) David Miller wrote: > From: Injong Rhee > Date: Tue, 08 Mar 2011 10:26:36 -0500 >=20 > > Thanks for updating CUBIC hystart. You might want to test the > > cases with more background traffic and verify whether this > > threshold is too conservative. >=20 > So let's get down to basics. >=20 > What does Hystart do specially that allows it to avoid all of the > problems that TCP VEGAS runs into. >=20 > Specifically, that if you use RTTs to make congestion control > decisions it is impossible to notice new bandwidth becomming availabl= e > fast enough. >=20 > Again, it's impossible to react fast enough. No matter what you twea= k > all of your various settings to, this problem will still exist. >=20 > This is a core issue, you cannot get around it. >=20 > This is why I feel that Hystart is fundamentally flawed and we should > turn it off by default if not flat-out remove it. >=20 > Distributions are turning it off by default already, therefore it's > stupid for the upstream kernel to behave differently if that's what > %99 of the world is going to end up experiencing. The assumption in Hystart that spacing between ACK's is solely due to congestion is a bad. If you read the paper, this is why FreeBSD's estimation logic is dismissed. The Hystart problem is different than the Vegas issue. Algorithms that look at min RTT are ok, since the lower bound is fixed; additional queuing and variation in network only increases RTT it never reduces it. With a min RTT it is possible to compute the upper bound on available bandwidth. i.e If all packets were as good as this estimate minRTT then the available bandwidth is X. But then using an individual RTT sample to estimate unused bandwidth is flawed. To quote paper. "Thus, by checking whether =E2=88=86(N ) is larger than Dmin , we can detect whether cwnd has reached the available capacity of the path"=20 So what goes wrong: 1. Dmin can be too large because this connection always sees delays due to other traffic or hardware. i.e buffer bloat. This would cause the bandwidth estimate to be too low and therefore TCP would leave slow start too early (and not get up to full bandwidth). 2. Dmin can be smaller than the clock resolution. This would cause either sample to be ignored, or Dmin to be zero. If Dmin is zero, the bandwidth estimate would in theory be infinite, which would lead to TCP not leaving slow start because of Hystart. Instead TCP would leave slow start at first loss. Other possible problems: 3. ACK's could be nudged together by variations in delay. This would cause HyStart to exit slow start prematurely. To false think it is an ACK train. Noise in network is not catastrophic, it just causes TCP to exit slow-start early and have to go into normal window growth phase. The problem is that the original non-Hystart behavior of Cubic is unfair; the first flow dominates the link and other flows are unable to get in. If you run tests with two flows one will get a larger share of the bandwidth. I think Hystart is okay in concept but there may be issues on low RTT links as well as other corner cases that need bug fixing. 1. Needs to use better resolution than HZ. Since HZ can be 100. 2. Hardcoding 2ms as spacing between ACK's as train is wrong for local networks.