From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nivedita Singhvi Subject: Re: bad TSO performance in 2.6.9-rc2-BK Date: Tue, 28 Sep 2004 00:23:38 -0700 Sender: netdev-bounce@oss.sgi.com Message-ID: <4159117A.4010904@us.ibm.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: "David S. Miller" , Andi Kleen , andy.grover@gmail.com, anton@samba.org, netdev@oss.sgi.com Return-path: To: John Heffner In-Reply-To: Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org John Heffner wrote: > On Thu, 23 Sep 2004, David S. Miller wrote: > > >>I think I know what may be going on here. >> >>Let's say that we even get the congestion window openned up >>so that we can build 64K TSO frames, that's around 43 or 44 >>1500 mtu frames. >> >>That means as the window fills up, we have to see 44 ACKs >>before we are able to send the next TSO frame. Needless to >>say that breaks ACK clocking completely. > > > > More specifically, I think it is an interaction with delayed ack (acking > less than 1 virtual segment), and the small cwnd. This works for me, but > I'm not sure that aren't some lurking problems still. In terms of what goes out over the wire from the sender, there is (or should be) no difference between the TSO and non-TSO case. The sequence of regular sized packets should be the same, and the only difference might be the delays between the frames, at most. So the sequence of acks coming back from the receiver should be the same, TSO and non-TSO case. If we've sent out say 44 1500MTU frames, we should probably see 22 acks back, roughly (acking every second packet if delayed acks are on) in both the TSO and non-TSO case. In terms of overall throughput, assuming we were doing no other work other than this connection, we would see a gain in the TSO case only if by the time the congestion window opened fully for us to send another virtual MTU frame, the application had written another frame's worth of data (minus the extra delta that would take for driver handoff and send at that point). In the non-TSO case, the finer granularity is helping us to utilize the channel more efficiently, (although not the path down the stack or the CPU).. actually, I think - although that is just another way to say ack clocking is bumpy. But I guess my question is - don't we need some heuristics to figure out when we should send partial (i.e. abandoning waiting for full TSO)? thanks, Nivedita