From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: Re: TCP and reordering Date: Tue, 27 Nov 2012 10:00:58 -0800 Message-ID: <50B4FFDA.1050808@hp.com> References: <50B4F2DA.8020206@hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org To: Saku Ytti Return-path: Received: from g5t0008.atlanta.hp.com ([15.192.0.45]:32163 "EHLO g5t0008.atlanta.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755812Ab2K0SBA (ORCPT ); Tue, 27 Nov 2012 13:01:00 -0500 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 11/27/2012 09:15 AM, Saku Ytti wrote: > On 27 November 2012 19:05, Rick Jones wrote: > >> Packet reordering is supposed to be the exception, not the rule. >> Links which habitually/constantly introduce reordering are, in my >> opinion, broken. Optimizing for them would be optimizing an error >> case. > > TCP used to be friendly to reordering before fast retransmit > optimization was implemented. It remained "friendly" to reordering even after fast retransmit was implemented - just not to particularly bad reordering. And friendly is somewhat relative. Even before fast retransmit came to be, TCP would immediately ACK each out-of-order segment. > It seems like minimal complexity in TCP algorithm and would > dynamically work correctly depending on situation. It is rather slim > comfort that network should work, when it does not, and you cannot > affect it. It is probably considered an "ancient" text these days, but one of the chapter intros for The Mythical Man Month includes a quote from Ovid: adde parvum parvo magnus acervus erit which if recollection serves the book translated as: add little to little and soon there will be a big pile. > But if the complexity is higher than I expect, then I fully agree, > makes no sense to add it. Reason why reordering can happen in modern > MPLS network is that you have to essentially duck type your traffic, > and sometimes you duck type them wrong and you are then calculating > ECMP on incorrect values, causing packets inside flow to take > different ports. I appreciate that one may not always have "access" and there can be layer 8 and layer 9 issues involved, but if incorrect typing is the root cause of the reordering, treating root cause rather than the symptom is what should happen. How many kludges, no matter how angelic, can fit in a TCP implementation? For other reasons (CPU utilization) various stacks (HP-UX, Solaris, some versions of MacOS) have had explicit ACK-avoidance heuristics. They would back-off from ack-every-other to back-every-N, N >> 2. The heuristics worked quite well in LAN environments and on bulk flows (eg FTP, ttcp, netperf TCP_STREAM), not necessarily as well in other environments. One (very necessary) part of the heuristic in those is to go back to ack-every-other when necessary. That keyed off the standalone ACK timer - if that ever fired the current avoidance count would go back to 2, and the max allowed would be half what it was before. However, that took a non-trivial performance hit when there was "tail-drop" and something that wasn't a continuous stream of traffic - the tail got dropped, nothing to cause out-of-order to force immediate ACKs. (*) Standalone ACK timer is then the only thing getting us back out which means idleness. I worked a number of "WAN performance problems" involving one of those stacks where part of the solution was turning down the limits on the ack avoidance heuristic by a considerable quantity. (And I say this as someone with a fondness for the schemes) I cannot say with certainty your idea would have the same problems but as you look to work-out a solution to propose as a patch, you will have to keep that in mind. rick * yes, the same holds true for a non-ack-avoiding setup, the heuristic simply made it worse - especially if the sender wanted to send but had gotten limited by cwnd - the ACK(s) of the head of that chunk of data were "avoided" and so wouldn't open the cwnd which might then allow futher segments to enable detection of the dropped segments. Even without losses it also tended to interact poorly with sending TCPs which wantend to increase the congestion window by one MSS for each ACK rather than based on the quantity of bytes covered by the ACK.