From mboxrd@z Thu Jan 1 00:00:00 1970 From: Willy Tarreau Subject: Re: TCP: orphans broken by RFC 2525 #2.17 Date: Sun, 26 Sep 2010 23:46:14 +0200 Message-ID: <20100926214614.GD12373@1wt.eu> References: <20100926131717.GA13046@1wt.eu> <1285520567.2530.8.camel@edumazet-laptop> <20100926174014.GA12373@1wt.eu> <1285526115.2530.12.camel@edumazet-laptop> <20100926184914.GC12373@1wt.eu> <1285534871.2357.19.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: netdev@vger.kernel.org To: Eric Dumazet Return-path: Received: from 1wt.eu ([62.212.114.60]:45698 "EHLO 1wt.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757639Ab0IZVqR (ORCPT ); Sun, 26 Sep 2010 17:46:17 -0400 Content-Disposition: inline In-Reply-To: <1285534871.2357.19.camel@edumazet-laptop> Sender: netdev-owner@vger.kernel.org List-ID: On Sun, Sep 26, 2010 at 11:01:11PM +0200, Eric Dumazet wrote: > Le dimanche 26 septembre 2010 =E0 20:49 +0200, Willy Tarreau a =E9cri= t : > > On Sun, Sep 26, 2010 at 08:35:15PM +0200, Eric Dumazet wrote: > > > I was referring to this code. It works well for me. > > >=20 > > > shutdown(fd, SHUT_RDWR); > > > while (recv(fd, buff, sizeof(buff), 0) > 0) > > > ; > > > close(fd); > >=20 > > Ah this one yes, but it's overkill. We're actively pumping data fro= m the > > other side to drop it on the floor until it finally closes while we= only > > need to know when it has ACKed the FIN. In practice, doing that on = a POST > > request which causes a 302 or 401 will result in the whole amount o= f data > > being transferred twice. Not only this is bad for the bandwidth, th= is is > > also bad for the user, as we're causing him to experience a complet= e upload > > twice, just to be sure it has received the FIN, while it's pretty o= bvious > > that it's not necessary in 99.9% of the cases. > >=20 >=20 > I dont understand how recv() could transfert data twice. That's not what I said, I said the client would have to retransfer. Her= e's what typically happens, for instance with an authentication requirement= : Client Server SYN -----------> <----------- SYN/ACK ACK -----------> POST /some_url HTTP/1.1 Host: xxx Content-length: 10 meg ------ headers are sent ----> xxxxxxxxxxxxx xxxxxxxxxxxxx ----- data are being sent -> HTTP/1.1 401 forbidden xxxxxxxxxxxxx WWW-Authenticate: basic rea= lm=3D"xxx" =2E.. Connection: close <---------------------------- xxxxxxxxxxxxx <---------------------------- FIN xxxxxxxxxxxxx =2E.. 10 megs of data being sent and drained by the server xxxxxxxxxxxxx FIN ---------------> <-------------- ACK second attempt with credentials this time SYN -----------> <----------- SYN/ACK ACK -----------> POST /some_url HTTP/1.1 Host: x authorization: basic xyz Content-length: 10 meg ------ headers are sent ----> xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxx etc... So in this case the data is effectively transmitted twice. With an RST once the client acks the FIN, the first transfer aborts very early instead, saving half of the bandwidth. > You only read from the socket receive queue to null buffer, and in mo= st > cases a single recv() call will be enough to drain the queue. Indeed, in *most* cases, and just as right now in most cases there is n= o problem. As I said, that's the first reported issue in 10 years and hun= dreds of billions of connections cumulated on various sites. But instead of r= eally fixing the issue, it just reduces its occurrences. Also, it does only w= ork for low bandwidth clients (most common case too). That's what I'm going= to implement anyway, but this is an unreliable workaround. All I know is t= hat it will probably divide by 10 the number of times this problem is encou= ntered but it will not fix it. > > Since this method is the least efficient one and clearly not accept= able > > for practical cases, I wanted to dig at the root, where the informa= tion > > is known. And the TCP recv code is precisely the place where we kno= w > > exactly when it's safe to reset. > >=20 >=20 > And its safe to reset exactly when application just does close(), if > unread data was not drained. Not only its safe, but required. A new R= =46C > might be needed ? I'm not requesting a new RFC, but I'm just trying to make a correct use of orphans as implemented in the Linux stack, and I'm realizing that since RFC2525 was implemented, orphans cannot be relied on at all anymo= re. We can simply delete all the orphans code and emit an RST immediately u= pon close(), there is no safe use of them now. And that's my concern. In my opinion, the code is there and was written precisely for that usage. Si= nce I'm seeing that it can't be used for what it's designed for, I'm natura= lly interested in trying to get it usable again. And in fact, when I really want an RST, I can already have one by disabling lingering before the close(). This too shows that the default close() by default protects orphaned data. > > Also there's another issue in doing this. It requires polling of th= e > > receive side for all requests, which adds one epoll_ctl() syscall a= nd > > one recv() call, which have a much noticeable negative performance > > impact at high rates (at 100000 connections per second, every sysca= ll > > counts). For now I could very well consider that I do this only for > > POST requests, which currently are the ones exhibiting the issue th= e > > most, but since HTTP browsers will try to enable pipelining again > > soon, the problem will generalize to all types of requests. Hence m= y > > attempts to do it the optimal way. >=20 > This might be overkill but is a portable way of doing this, on all > versions of linux. I'm not discussing the portability at all. You see, right now, Linux is by some margin the fastest platform to run haproxy, and the one I alway= s recommend for that. Some people experience good performance on FreeBSD too, but the fine-grained control we have on Linux helps maintaining a high level of performance by avoiding many unnecessary steps when we can trust the OS to do some things correctly. Having workarounds for some versions that we know don't work as expected is not an issue, and it's even a good reason to sometimes make people upgrade. But having to cut performance in half under Linux on some workloads because the user land can't know something obvious that the OS knows is a bit of a waste= =2E > Sure, you could patch kernel to add a new system call >=20 > close_proper(fd); That would just be the normal close() (the one with lingering enabled) in theory. > As shutdown() only uses two bits, you can eventually add another bit = to > flush receive queue as well (to avoid the copy of it) This is a good idea, but it will still leave some incorrectly handled cases where the other side has the time to send a few packets between the shutdown() and the close(). > Another question, is : why the server closes the socket, if you belie= ve > more pipelining is coming soon ? There are quite a bunch of situations in HTTP where you have no other solution than closing. All responses that don't advertise a length must terminated by a close. Haproxy is a reverse-proxy, so it sits between the client and the server. When a server sends such responses, haproxy must forward the close to the client, regardless of what's in the reque= st buffer. Also, some response codes require a close. The 400 bad request for instance, implies a mandatory close (as well as many 4xx or 5xx). All redirects (301/302/303/307) should lead to a close if the target is another site. Eventhough we optimize for the most common cases, that doesn't save us from having to support the legacy cases. Regards, Willy