From mboxrd@z Thu Jan 1 00:00:00 1970 From: Willy Tarreau Subject: Re: TCP: orphans broken by RFC 2525 #2.17 Date: Sun, 26 Sep 2010 19:40:15 +0200 Message-ID: <20100926174014.GA12373@1wt.eu> References: <20100926131717.GA13046@1wt.eu> <1285520567.2530.8.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@vger.kernel.org To: Eric Dumazet Return-path: Received: from 1wt.eu ([62.212.114.60]:45691 "EHLO 1wt.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757197Ab0IZRkS (ORCPT ); Sun, 26 Sep 2010 13:40:18 -0400 Content-Disposition: inline In-Reply-To: <1285520567.2530.8.camel@edumazet-laptop> Sender: netdev-owner@vger.kernel.org List-ID: Hi Eric, On Sun, Sep 26, 2010 at 07:02:47PM +0200, Eric Dumazet wrote: > How could we delay the close() ? We must either send a FIN or RST. I don't mean to delay the close(), but I'm aware that my description was not very clear. Here's what I would find normal : 1) upon close(), we send a FIN, whether there are incoming pending data or not (after all, the only difference is only a timing issue, as the data in the rx buffer might very well come just after the FIN, as it almost always does, BTW). The connection then becomes FIN_WAIT1 just as now. 2) mark the socket as orphaned 3) when an ACK comes from the other side, either it's below our last seq, and we simply ignore it, just as if we were in TIME_WAIT, or it is equal to the last seq and indicates that it's now safe to reset ; we would then just send the RST to notify the other side that the data it sent were not read. The connection can then either be destroyed or put in TIME_WAIT. It's the point where the connection normally switches from FIN_WAIT1 to FIN_WAIT2, since the FIN has been acked. The only difference is that we don't need a FIN_WAIT2 state for an orphan. > I would say, fix the program, so that RST is avoided ? Not that easy, see below. > The program does : > > recv() // read the request > send() // queue the answer > close() // could work if world was perfect... > > Change it to > > recv() > send() > shutdown() > recv() // read & flush in excess data New data arrives now, close() below will cause an RST again. > close() > > This for sure will send FIN after all queued data is sent. > I am not sure the final rcv() is even needed, its Sunday after all ;) Currently the real code (ie: not the poc I posted) does : recv() send() shutdown() close() The extra CRLF almost always happens between the recv() and send(). What I intend to do as a workaround is exactly what you described above, but I'm well aware it's not enough. It will only reduce the rate at which this case happens. Well, in fact, in 10 years of production at many sites, it's the first time such an issue is reported and it could be tracked down to these two extra bytes. But the workaround will not prevent the two extra bytes from coming after the last recv(). Also, the issue remains when processing large POST requests. Let's suppose the application is receiving a massive POST (eg: 10 MB) but the request is not authenticated, so the application returns an HTTP 401 response to require the client to authenticate. There's no way for the application to be notified that the small response was completely read by the client and that it's safe to close(). For these reasons, I concluded that the application can't get everything right and needs help from the kernel (said differently, I think that the RFC2525 fix is causing harm in addition to goods). In my opinion, this section in the RFC was added based on a few observations of trivial cases but was but its impact was not completely explored. I'm willing to experiment, but I'm not much familiar with the code itself and sometimes I'm not sure about what I'm doing, probably that some help would be welcome. What I'd like to do is to implement the step 3 above, which is to only send the RST upon receipt of an ACK on an orphan that would switch a normal socket from FIN_WAIT1 to FIN_WAIT2. Also, I'm not sure about what other OSes are doing. For instance, I tried on Solaris and did not observe the issue at all, though I think that Solaris simply does not implement the RFC2525 recommendation. Have a nice sunday evening ;-) Willy