From mboxrd@z Thu Jan 1 00:00:00 1970 From: Werner Almesberger Subject: Re: snd_cwnd drawn and quartered Date: Tue, 14 Jan 2003 01:01:57 -0300 Sender: netdev-bounce@oss.sgi.com Message-ID: <20030114010157.M1516@almesberger.net> References: <20030102030858.E1363@almesberger.net> <200301140012.DAA09790@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@oss.sgi.com, chengjin@cs.caltech.edu Return-path: To: kuznet@ms2.inr.ac.ru Content-Disposition: inline In-Reply-To: <200301140012.DAA09790@sex.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 14, 2003 at 03:12:37AM +0300 Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org kuznet@ms2.inr.ac.ru wrote: > Of course. But draining happens when you received more ACKs than > you sent packets. When such pathalogy happens we just have to do something, > at least to understand when it happens under normal conditions. Yes, that's a separate problem. There are (at least :-) two problems we're looking at: - cwnd getting too small, probably due to "natural causes", and not recovering properly - cwnd getting reduced too much by the ssthresh/2 test Cheng's been talking about the first one, I'm on the latter. Note that in my case, no retransmissions are lost. There are only some additional losses in the initial cwnd (actually, just one loss would be sufficient), which then extend the recovery period. I went through draft-ratehalving and compared it with what our TCP is doing. It seems that the test is actually about 50% right :-) Below is my analysis of the situation. How do you like the idea of adding another variable ? :-) Here we go: To analyze whether changing tcp_input.c:tcp_cwnd_down from if (decr && tp->snd_cwnd > tp->snd_ssthresh/2) to if (decr && tp->snd_cwnd > tp->snd_ssthresh) yields valid TCP behaviour, I'm comparing the algorithm specification in draft-ratehalving [1] with the implementation of Linux TCP. The goal is to show whether Linux TCP still performs according to draft-ratehalving after the change. [1] http://www.psc.edu/networking/ftp/papers/draft-ratehalving.txt draft-ratehalving generally aims to set cwnd (they call it rhcwnd) to (prior_cwnd-loss)/2 (section 5.14), where "loss" are the packets that have been lost from the RTT sent before we entered recovery, and "prior_cwnd" is the cwnd at the time when we began recovery. This is also explained in section 3.1 of RFC2581. For simplicity, let's assume there is no reordering, no ACK loss, and no jumps in delay. Without NewReno, there are two cases: if the loss is indicated by an "exact" means (ECN, SACK), it reduces cwnd by half the distance by which fack is advanced, plus half the size of any "holes" found via SACK (5.6). At the end of recovery, cwnd should therefore reach (prior_cwnd-loss)/2, as specified above. Still without NewReno, if loss is indicated by a duplicate ACK without SACK, cwnd is reduced by half a segment for each duplicate ACK received (4.7). This way, cwnd will shrink to (prior_cwnd+loss)/2. (No typo - it's "+"). With NewReno, the algorithms are the same, but cwnd stops decrementing at cwnd <= prior_cwnd/2 (5.12). Once we get out of recovery, cwnd gets set to (prior_cwnd-num_retrans)/2, where num_retrans is the number of retransmissions in the "repair interval" (4.13). This is effectively the number of retransmissions we needed to fix the initial loss. I'm not entirely sure how this changes if we lose a retransmission. RFC2581 requires cwnd to be halved twice in this case (4.3). At the end (4.14), draft-ratehalving forces the new cwnd below prior_cwnd/2 (in case we didn't decrement enough, e.g. in the second "old" Reno case). It also sets ssthresh to the new cwnd, but makes sure it (ssthresh) does not drop below prior_cwnd/4 to ensure "that the TCP connection is not unduly harmed by extreme network conditions" (5.14, probably meaning reordering). When entering congestion (tcp_input.c:tcp_fastretrans_alert), Linux TCP sets ssthresh to roughly half cwnd (tcp.h:tcp_recalc_ssthresh). Note that this differs from the requirement of setting ssthresh to half the amount of data in flight. During recovery, cwnd reduction is done by tcp_input.c:tcp_cwnd_down as follows: snd_cwnd is decremented for every second (duplicate) ACK, which corresponds to sections 4.7 and 4.12 of draft-ratehalving, except that snd_cwnd reduction stops at snd_ssthresh/2 (corresponding roughly to prior_cwnd/4) instead of prior_cwnd/2. Additionally, cwnd may be further reduced if there are less than cwnd packets in flight. (This deserves further analysis.) The equivalent to the first part of 4.14 happens in tcp_complete_cwr: cwnd is set to the minimum of cwnd and ssthresh, where the latter is (roughly) prior_cwnd/2. Raising the cut-off point for cwnd reduction to ssthresh would still yield the cwnd decrease described in section 4.7, and the cut-off would occur at the point described in section 4.12. Furthermore, at the end of recovery, snd_cwnd is set up prior_cwnd/2, which is consistent with section 5.14. So far, so good. Unfortunately, there are are two exceptions: a loss outside the cwnd in which the initial loss occurred (i.e. loss of data above high_seq) or the loss of a retransmission is required to cause another halving of cwnd. A loss above high_seq is detected and handled as separate loss after the current loss episode has ended, and there does not need to concern us here. However, loss of a retransmission is handled implicitly, as follows: it extends the recovery interval to at least 2*RTT. This causes the current implementation of tcp_cwnd_down to decrement snd_cwnd to ssthresh/2, yielding the correct result. The most correct solution seems therefore to introduce yet another TCP state variable cwnd_bound that limits how far tcp_cwnd_down can decrement snd_cwnd. Initially, tcp_input.c:tcp_clear_retrans and tcp_input.c:tcp_fastretrans_alert would set cwnd_bound to the reduced snd_ssthresh, limiting snd_cwnd reductions to prior_cwnd/2. If tcp_input.c:tcp_sacktag_write_queue detects loss of a retransmission, it sets cwnd_bound to ssthresh/2, allowing reduction down to prior_cwnd/4. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/