From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jerry Chu" Subject: Re: Socket buffer sizes with autotuning Date: Thu, 24 Apr 2008 17:49:33 -0700 Message-ID: References: <1e41a3230804240932u510609beh8fb577baaadeb9bd@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, "rick.jones2" , davem@davemloft.net To: "John Heffner" Return-path: Received: from smtp-out.google.com ([216.239.33.17]:32726 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753402AbYDYAtk (ORCPT ); Thu, 24 Apr 2008 20:49:40 -0400 Received: from zps77.corp.google.com (zps77.corp.google.com [172.25.146.77]) by smtp-out.google.com with ESMTP id m3P0nZEq027346 for ; Fri, 25 Apr 2008 01:49:35 +0100 Received: from wx-out-0506.google.com (wxct8.prod.google.com [10.70.121.8]) by zps77.corp.google.com with ESMTP id m3P0nXO2013685 for ; Thu, 24 Apr 2008 17:49:34 -0700 Received: by wx-out-0506.google.com with SMTP id t8so3378864wxc.30 for ; Thu, 24 Apr 2008 17:49:33 -0700 (PDT) In-Reply-To: <1e41a3230804240932u510609beh8fb577baaadeb9bd@mail.gmail.com> Content-Disposition: inline Sender: netdev-owner@vger.kernel.org List-ID: On Thu, Apr 24, 2008 at 9:32 AM, John Heffner wrote: > > On Wed, Apr 23, 2008 at 4:29 PM, Jerry Chu wrote: > > > > I've been seeing the same problem here and am trying to fix it. > > My fix is to not count those pkts still in the host queue as "prior_in_flight" > > when feeding the latter to tcp_cong_avoid(). This should cause > > tcp_is_cwnd_limited() test to fail when the previous in_flight build-up > > is all due to the large host queue, and stop the cwnd to grow beyond > > what's really necessary. > > Sounds like a useful optimization. Do you have a patch? Am working on one, but still need to completely rootcause the problem first, and do a lot more testing. I, like Rick Jones, have for a while thought either the autotuning, or the Congestion Window Validation (rfc2861) code should dampen the cwnd growth so the bug must be there, until last week when I decided to get to the bottom of this problem. One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the sk_write_queue list as the heuristic to determine if a packet has hit the wire. This seems a good solution for the normal cases without requiring changes to the driver to notify TCP in the xmit completion path. But I can imagine there may be cases where another below-IP consumer of skb, e.g., tcpdump, can nullify the above heuristic. If the below IP consumer causes the skb ref count to drop to 1 prematurally, well the inflated cwnd problem comes back but it's no worse than before. What if the below IP skb reader causes the skb ref count to remain > 1 while pkts have long hit the wire? This may cause the fix to prevent cwnd from growing when needed, hence hurting performance. Is there a better solution than checking against dataref to determine if a pkt has hit the wire? Also the code to determine when/how much to defer in the TSO path seems too aggressive. It's currently based on a percentage (sysctl_tcp_tso_win_divisor) of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g., when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO defer factor should be based on an absolute count, e.g., 64KB. Jerry > > -John >