TCP: orphans broken by RFC 2525 #2.17

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TCP: orphans broken by RFC 2525 #2.17
@ 2010-09-26 13:17 Willy Tarreau
  2010-09-26 17:02 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 13:17 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 6731 bytes --]

Hi,

one haproxy user was reporting occasionally truncated responses to
HTTP POST requests exclusively. After he took many captures, we
could verify that the strace dumps were showing all data to be
emitted, but network captures showed that an RST was emitted before
the end of the data.

Looking more closely, I noticed that in traces showing the issue,
the client was sending an additional CRLF after the data in a
separate packet (permitted eventhough not recommended).

I could thus finally understand what happens and I'm now able to
reproduce it very easily using the attached program. What happens
is that haproxy sends the last data to the client, followed by a
shutdown()+close(). This is mimmicked by the attached program,
which is connected to by a simple netcat from another machine
sending two distinct chunks :

server:$ ./abort-data
client:$ (echo "req1";usleep 200000; echo "req2") | nc6 server 8000
block1
("block2" is missing here)
client:$

It gives the following capture, with client=10.8.3.4 and server=10.8.3.1 :

reading from file abort-linux.cap, link-type EN10MB (Ethernet)
10:47:07.057793 IP (tos 0x0, ttl 64, id 57159, offset 0, flags [DF], proto TCP (6), length 60)
    10.8.3.4.39925 > 10.8.3.1.8000: Flags [S], cksum 0xdad9 (correct), seq 2570439277, win 5840, options [mss 1460,sackOK,TS val 138417450 ecr 0,nop,wscale 6], length 0
10:47:07.058015 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.8.3.1.8000 > 10.8.3.4.39925: Flags [S.], cksum 0x3851 (correct), seq 1066199564, ack 2570439278, win 5792, options [mss 1460,sackOK,TS val 295921514 ecr 138417450,nop,wscale 7], length 0
10:47:07.058071 IP (tos 0x0, ttl 64, id 57160, offset 0, flags [DF], proto TCP (6), length 52)
    10.8.3.4.39925 > 10.8.3.1.8000: Flags [.], cksum 0x7d60 (correct), seq 2570439278, ack 1066199565, win 92, options [nop,nop,TS val 138417451 ecr 295921514], length 0
10:47:07.058213 IP (tos 0x0, ttl 64, id 57161, offset 0, flags [DF], proto TCP (6), length 57)
    10.8.3.4.39925 > 10.8.3.1.8000: Flags [P.], cksum 0x1a40 (incorrect -> 0x8fbc), seq 2570439278:2570439283, ack 1066199565, win 92, options [nop,nop,TS val 138417451 ecr 295921514], length 5
10:47:07.058410 IP (tos 0x0, ttl 64, id 36199, offset 0, flags [DF], proto TCP (6), length 52)
    10.8.3.1.8000 > 10.8.3.4.39925: Flags [.], cksum 0x7d89 (correct), seq 1066199565, ack 2570439283, win 46, options [nop,nop,TS val 295921514 ecr 138417451], length 0
10:47:07.253294 IP (tos 0x0, ttl 64, id 57162, offset 0, flags [DF], proto TCP (6), length 53)
    10.8.3.4.39925 > 10.8.3.1.8000: Flags [P.], cksum 0x1a3c (incorrect -> 0x7321), seq 2570439283:2570439284, ack 1066199565, win 92, options [nop,nop,TS val 138417500 ecr 295921514], length 1
10:47:07.253468 IP (tos 0x0, ttl 64, id 36200, offset 0, flags [DF], proto TCP (6), length 52)
    10.8.3.1.8000 > 10.8.3.4.39925: Flags [.], cksum 0x7d27 (correct), seq 1066199565, ack 2570439284, win 46, options [nop,nop,TS val 295921562 ecr 138417500], length 0
10:47:08.060213 IP (tos 0x0, ttl 64, id 36201, offset 0, flags [DF], proto TCP (6), length 59)
    10.8.3.1.8000 > 10.8.3.4.39925: Flags [P.], cksum 0x354c (correct), seq 1066199565:1066199572, ack 2570439284, win 46, options [nop,nop,TS val 295921765 ecr 138417500], length 7
10:47:08.060270 IP (tos 0x0, ttl 64, id 57163, offset 0, flags [DF], proto TCP (6), length 52)
    10.8.3.4.39925 > 10.8.3.1.8000: Flags [.], cksum 0x7b5e (correct), seq 2570439284, ack 1066199572, win 92, options [nop,nop,TS val 138417701 ecr 295921765], length 0
10:47:08.060298 IP (tos 0x0, ttl 64, id 36202, offset 0, flags [DF], proto TCP (6), length 52)
    10.8.3.1.8000 > 10.8.3.4.39925: Flags [R.], cksum 0x7c51 (correct), seq 1066199572, ack 2570439284, win 46, options [nop,nop,TS val 295921765 ecr 138417500], length 0
10:47:08.060613 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    10.8.3.1.8000 > 10.8.3.4.39925: Flags [R], cksum 0xb0f5 (correct), seq 1066199572, win 0, length 0
.

The connection should in theory become an orphan. I'm saying "in theory",
because since the following test was added to tcp_close(), if the client
happens to send any data between the last recv() and the close(), we
immediately send an RST to it, regardless of any pending outgoing data :

        /* As outlined in RFC 2525, section 2.17, we send a RST here because
         * data was lost. To witness the awful effects of the old behavior of
         * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
         * GET in an FTP client, suspend the process, wait for the client to
         * advertise a zero window, then kill -9 the FTP client, wheee...
         * Note: timeout is always zero in such a case.
         */
        if (data_was_unread) {
                /* Unread data was tossed, zap the connection. */
                NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
                tcp_set_state(sk, TCP_CLOSE);
                tcp_send_active_reset(sk, sk->sk_allocation);
	}

The immediate effect then is that the client receives an abort before it
even gets the last data that were scheduled for being sent.

I've read RFC 2525 #2.17 and it shows quite interesting examples of what
it wanted to protect against. However, the recommendation did not consider
the fact that there could be some unacked pending data in the outgoing
buffers.

What is even more more embarrassing is that the HTTP working group is
trying to encourage browsers to enable pipelining by default. That means
that the situation above can become much more common, where two requests
will be pipeline, the first one will cause a short response followed by
a close(), and the simple presence of the second one will kill the first
one's data.

I tried to think about a finer way to process those unwanted data. Ideally,
we should just ignore until the ACK indicates that our last segment was
properly received. Then we could emit the RST.

I made a few attempts by first changing the test above like this :

-        if (data_was_unread) {
+        if (data_was_unread && !tcp_sk(sk)->packets_out) {

then fiddling a little bit in tcp_input.c:tcp_rcv_state_process() for
the TCP_FIN_WAIT1 state, but I'm not satisfied with my experimentations,
they were a bit too much experimental for the results to be considered
reliable.

What I was looking for was a way to only send an RST when the socket is
an orphan and all of its outgoing data has been ACKed. This would cover
the situations that RFC 2525 #2.17 tries to fix without rendering orphans
unusable.

Has anyone an opinion on this, or even could suggest a patch to relax
the conditions in which we send an RST ?

Thanks,
Willy

[-- Attachment #2: abort-data.c --]
[-- Type: text/plain, Size: 1512 bytes --]

#include <stdio.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

int port = 8000;
int one = 1;

struct sockaddr_in lst_a;
int lst_fd, srv_fd;
int lbuf;

char buf[1024];

void die_msg(const char *msg)
{
	if (msg)
		fprintf(stderr, "%s\n", msg);
	exit(1);
}

void die_err(const char *msg)
{
	perror(msg);
	exit(1);
}

int main(int argc, char **argv)
{
	if (argc > 1)
		port = atol(argv[1]);

	if ((lst_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) == -1)
		die_err("socket");

	if ((setsockopt(lst_fd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one))) == -1)
		die_err("setsockopt");

	bzero((char *)&lst_a, sizeof(lst_a));
	lst_a.sin_family      = AF_INET;
	lst_a.sin_addr.s_addr = htonl(INADDR_ANY);
	lst_a.sin_port        = htons(port);
	if (bind(lst_fd, (struct sockaddr *)&lst_a, sizeof(lst_a)) == -1)
		die_err("bind");

	if (listen(lst_fd, 1) == -1)
		die_err("listen");

	if ((srv_fd = accept(lst_fd, NULL, NULL)) == -1)
		die_err("accept");

	fprintf(stderr, "accept() returns %d\n", srv_fd);

	if ((lbuf = recv(srv_fd, buf, sizeof(buf), 0)) == -1)
		die_err("recv");

	fprintf(stderr, "recv() returns %d\n", lbuf);

	/* now let's pretend some processing time. If the sender sends any more
	 * data during the sleep(), it causes the response to be truncated.
	 */
	sleep(1);

	send(srv_fd, "block1\n", 7, 0);
	send(srv_fd, "block2\n", 7, 0);
	//shutdown(srv_fd, SHUT_WR);
	close(srv_fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 13:17 TCP: orphans broken by RFC 2525 #2.17 Willy Tarreau
@ 2010-09-26 17:02 ` Eric Dumazet
  2010-09-26 17:40   ` Willy Tarreau
  2010-09-26 22:13 ` David Miller
  2010-09-27  8:02 ` Herbert Xu
  2 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2010-09-26 17:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

Le dimanche 26 septembre 2010 à 15:17 +0200, Willy Tarreau a écrit :
> Hi,
> 
> one haproxy user was reporting occasionally truncated responses to
> HTTP POST requests exclusively. After he took many captures, we
> could verify that the strace dumps were showing all data to be
> emitted, but network captures showed that an RST was emitted before
> the end of the data.
> 
> Looking more closely, I noticed that in traces showing the issue,
> the client was sending an additional CRLF after the data in a
> separate packet (permitted eventhough not recommended).
> 
> I could thus finally understand what happens and I'm now able to
> reproduce it very easily using the attached program. What happens
> is that haproxy sends the last data to the client, followed by a
> shutdown()+close(). This is mimmicked by the attached program,
> which is connected to by a simple netcat from another machine
> sending two distinct chunks :
> 
> server:$ ./abort-data
> client:$ (echo "req1";usleep 200000; echo "req2") | nc6 server 8000
> block1
> ("block2" is missing here)
> client:$
> 
> It gives the following capture, with client=10.8.3.4 and server=10.8.3.1 :
> 
> reading from file abort-linux.cap, link-type EN10MB (Ethernet)
> 10:47:07.057793 IP (tos 0x0, ttl 64, id 57159, offset 0, flags [DF], proto TCP (6), length 60)
>     10.8.3.4.39925 > 10.8.3.1.8000: Flags [S], cksum 0xdad9 (correct), seq 2570439277, win 5840, options [mss 1460,sackOK,TS val 138417450 ecr 0,nop,wscale 6], length 0
> 10:47:07.058015 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
>     10.8.3.1.8000 > 10.8.3.4.39925: Flags [S.], cksum 0x3851 (correct), seq 1066199564, ack 2570439278, win 5792, options [mss 1460,sackOK,TS val 295921514 ecr 138417450,nop,wscale 7], length 0
> 10:47:07.058071 IP (tos 0x0, ttl 64, id 57160, offset 0, flags [DF], proto TCP (6), length 52)
>     10.8.3.4.39925 > 10.8.3.1.8000: Flags [.], cksum 0x7d60 (correct), seq 2570439278, ack 1066199565, win 92, options [nop,nop,TS val 138417451 ecr 295921514], length 0
> 10:47:07.058213 IP (tos 0x0, ttl 64, id 57161, offset 0, flags [DF], proto TCP (6), length 57)
>     10.8.3.4.39925 > 10.8.3.1.8000: Flags [P.], cksum 0x1a40 (incorrect -> 0x8fbc), seq 2570439278:2570439283, ack 1066199565, win 92, options [nop,nop,TS val 138417451 ecr 295921514], length 5
> 10:47:07.058410 IP (tos 0x0, ttl 64, id 36199, offset 0, flags [DF], proto TCP (6), length 52)
>     10.8.3.1.8000 > 10.8.3.4.39925: Flags [.], cksum 0x7d89 (correct), seq 1066199565, ack 2570439283, win 46, options [nop,nop,TS val 295921514 ecr 138417451], length 0
> 10:47:07.253294 IP (tos 0x0, ttl 64, id 57162, offset 0, flags [DF], proto TCP (6), length 53)
>     10.8.3.4.39925 > 10.8.3.1.8000: Flags [P.], cksum 0x1a3c (incorrect -> 0x7321), seq 2570439283:2570439284, ack 1066199565, win 92, options [nop,nop,TS val 138417500 ecr 295921514], length 1
> 10:47:07.253468 IP (tos 0x0, ttl 64, id 36200, offset 0, flags [DF], proto TCP (6), length 52)
>     10.8.3.1.8000 > 10.8.3.4.39925: Flags [.], cksum 0x7d27 (correct), seq 1066199565, ack 2570439284, win 46, options [nop,nop,TS val 295921562 ecr 138417500], length 0
> 10:47:08.060213 IP (tos 0x0, ttl 64, id 36201, offset 0, flags [DF], proto TCP (6), length 59)
>     10.8.3.1.8000 > 10.8.3.4.39925: Flags [P.], cksum 0x354c (correct), seq 1066199565:1066199572, ack 2570439284, win 46, options [nop,nop,TS val 295921765 ecr 138417500], length 7
> 10:47:08.060270 IP (tos 0x0, ttl 64, id 57163, offset 0, flags [DF], proto TCP (6), length 52)
>     10.8.3.4.39925 > 10.8.3.1.8000: Flags [.], cksum 0x7b5e (correct), seq 2570439284, ack 1066199572, win 92, options [nop,nop,TS val 138417701 ecr 295921765], length 0
> 10:47:08.060298 IP (tos 0x0, ttl 64, id 36202, offset 0, flags [DF], proto TCP (6), length 52)
>     10.8.3.1.8000 > 10.8.3.4.39925: Flags [R.], cksum 0x7c51 (correct), seq 1066199572, ack 2570439284, win 46, options [nop,nop,TS val 295921765 ecr 138417500], length 0
> 10:47:08.060613 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
>     10.8.3.1.8000 > 10.8.3.4.39925: Flags [R], cksum 0xb0f5 (correct), seq 1066199572, win 0, length 0
> .
> 
> The connection should in theory become an orphan. I'm saying "in theory",
> because since the following test was added to tcp_close(), if the client
> happens to send any data between the last recv() and the close(), we
> immediately send an RST to it, regardless of any pending outgoing data :
> 
>         /* As outlined in RFC 2525, section 2.17, we send a RST here because
>          * data was lost. To witness the awful effects of the old behavior of
>          * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
>          * GET in an FTP client, suspend the process, wait for the client to
>          * advertise a zero window, then kill -9 the FTP client, wheee...
>          * Note: timeout is always zero in such a case.
>          */
>         if (data_was_unread) {
>                 /* Unread data was tossed, zap the connection. */
>                 NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
>                 tcp_set_state(sk, TCP_CLOSE);
>                 tcp_send_active_reset(sk, sk->sk_allocation);
> 	}
> 
> The immediate effect then is that the client receives an abort before it
> even gets the last data that were scheduled for being sent.
> 
> I've read RFC 2525 #2.17 and it shows quite interesting examples of what
> it wanted to protect against. However, the recommendation did not consider
> the fact that there could be some unacked pending data in the outgoing
> buffers.
> 
> What is even more more embarrassing is that the HTTP working group is
> trying to encourage browsers to enable pipelining by default. That means
> that the situation above can become much more common, where two requests
> will be pipeline, the first one will cause a short response followed by
> a close(), and the simple presence of the second one will kill the first
> one's data.
> 
> I tried to think about a finer way to process those unwanted data. Ideally,
> we should just ignore until the ACK indicates that our last segment was
> properly received. Then we could emit the RST.
> 
> I made a few attempts by first changing the test above like this :
> 
> -        if (data_was_unread) {
> +        if (data_was_unread && !tcp_sk(sk)->packets_out) {
> 
> then fiddling a little bit in tcp_input.c:tcp_rcv_state_process() for
> the TCP_FIN_WAIT1 state, but I'm not satisfied with my experimentations,
> they were a bit too much experimental for the results to be considered
> reliable.
> 
> What I was looking for was a way to only send an RST when the socket is
> an orphan and all of its outgoing data has been ACKed. This would cover
> the situations that RFC 2525 #2.17 tries to fix without rendering orphans
> unusable.
> 
> Has anyone an opinion on this, or even could suggest a patch to relax
> the conditions in which we send an RST ?

How could we delay the close() ? We must either send a FIN or RST.

I would say, fix the program, so that RST is avoided ?

The program does :

recv() // read the request
send() // queue the answer
close() // could work if world was perfect...

Change it to

recv()
send()
shutdown()
recv() // read & flush in excess data
close()

This for sure will send FIN after all queued data is sent.
I am not sure the final rcv() is even needed, its Sunday after all ;)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 17:02 ` Eric Dumazet
@ 2010-09-26 17:40   ` Willy Tarreau
  2010-09-26 18:35     ` Eric Dumazet
  2010-09-26 19:16     ` Willy Tarreau
  0 siblings, 2 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 17:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

On Sun, Sep 26, 2010 at 07:02:47PM +0200, Eric Dumazet wrote:
> How could we delay the close() ? We must either send a FIN or RST.

I don't mean to delay the close(), but I'm aware that my description
was not very clear.

Here's what I would find normal :

1) upon close(), we send a FIN, whether there are incoming pending
   data or not (after all, the only difference is only a timing
   issue, as the data in the rx buffer might very well come just
   after the FIN, as it almost always does, BTW). The connection
   then becomes FIN_WAIT1 just as now.

2) mark the socket as orphaned

3) when an ACK comes from the other side, either it's below our last
   seq, and we simply ignore it, just as if we were in TIME_WAIT, or
   it is equal to the last seq and indicates that it's now safe to
   reset ; we would then just send the RST to notify the other side
   that the data it sent were not read. The connection can then either
   be destroyed or put in TIME_WAIT. It's the point where the connection
   normally switches from FIN_WAIT1 to FIN_WAIT2, since the FIN has been
   acked. The only difference is that we don't need a FIN_WAIT2 state
   for an orphan.

> I would say, fix the program, so that RST is avoided ?

Not that easy, see below.

> The program does :
> 
> recv() // read the request
> send() // queue the answer
> close() // could work if world was perfect...
> 
> Change it to
> 
> recv()
> send()
> shutdown()
> recv() // read & flush in excess data

New data arrives now, close() below will cause an RST again.

> close()
> 
> This for sure will send FIN after all queued data is sent.
> I am not sure the final rcv() is even needed, its Sunday after all ;)

Currently the real code (ie: not the poc I posted) does :

   recv()
   send()
   shutdown()
   close()

The extra CRLF almost always happens between the recv() and send(). What
I intend to do as a workaround is exactly what you described above, but
I'm well aware it's not enough. It will only reduce the rate at which this
case happens. Well, in fact, in 10 years of production at many sites, it's
the first time such an issue is reported and it could be tracked down to
these two extra bytes. But the workaround will not prevent the two extra
bytes from coming after the last recv().

Also, the issue remains when processing large POST requests. Let's suppose
the application is receiving a massive POST (eg: 10 MB) but the request is
not authenticated, so the application returns an HTTP 401 response to
require the client to authenticate. There's no way for the application to
be notified that the small response was completely read by the client and
that it's safe to close().

For these reasons, I concluded that the application can't get everything
right and needs help from the kernel (said differently, I think that the
RFC2525 fix is causing harm in addition to goods). In my opinion, this
section in the RFC was added based on a few observations of trivial cases
but was but its impact was not completely explored.

I'm willing to experiment, but I'm not much familiar with the code itself
and sometimes I'm not sure about what I'm doing, probably that some help
would be welcome. What I'd like to do is to implement the step 3 above,
which is to only send the RST upon receipt of an ACK on an orphan that
would switch a normal socket from FIN_WAIT1 to FIN_WAIT2.

Also, I'm not sure about what other OSes are doing. For instance, I tried
on Solaris and did not observe the issue at all, though I think that
Solaris simply does not implement the RFC2525 recommendation.

Have a nice sunday evening ;-)
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 17:40   ` Willy Tarreau
@ 2010-09-26 18:35     ` Eric Dumazet
  2010-09-26 18:49       ` Willy Tarreau
  2010-09-26 19:16     ` Willy Tarreau
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2010-09-26 18:35 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

Le dimanche 26 septembre 2010 à 19:40 +0200, Willy Tarreau a écrit :
> Hi Eric,
> 
> On Sun, Sep 26, 2010 at 07:02:47PM +0200, Eric Dumazet wrote:
> > How could we delay the close() ? We must either send a FIN or RST.
> 
> I don't mean to delay the close(), but I'm aware that my description
> was not very clear.
> 
> Here's what I would find normal :
> 
> 1) upon close(), we send a FIN, whether there are incoming pending
>    data or not (after all, the only difference is only a timing
>    issue, as the data in the rx buffer might very well come just
>    after the FIN, as it almost always does, BTW). The connection
>    then becomes FIN_WAIT1 just as now.
> 
> 2) mark the socket as orphaned
> 
> 3) when an ACK comes from the other side, either it's below our last
>    seq, and we simply ignore it, just as if we were in TIME_WAIT, or
>    it is equal to the last seq and indicates that it's now safe to
>    reset ; we would then just send the RST to notify the other side
>    that the data it sent were not read. The connection can then either
>    be destroyed or put in TIME_WAIT. It's the point where the connection
>    normally switches from FIN_WAIT1 to FIN_WAIT2, since the FIN has been
>    acked. The only difference is that we don't need a FIN_WAIT2 state
>    for an orphan.
> 
> > I would say, fix the program, so that RST is avoided ?
> 
> Not that easy, see below.
> 
> > The program does :
> > 
> > recv() // read the request
> > send() // queue the answer
> > close() // could work if world was perfect...
> > 
> > Change it to
> > 
> > recv()
> > send()
> > shutdown()
> > recv() // read & flush in excess data
> 
> New data arrives now, close() below will cause an RST again.
> 
> > close()
> > 
> > This for sure will send FIN after all queued data is sent.
> > I am not sure the final rcv() is even needed, its Sunday after all ;)
> 
> Currently the real code (ie: not the poc I posted) does :
> 
>    recv()
>    send()
>    shutdown()
>    close()
> 
> The extra CRLF almost always happens between the recv() and send(). What
> I intend to do as a workaround is exactly what you described above, but
> I'm well aware it's not enough. It will only reduce the rate at which this
> case happens. Well, in fact, in 10 years of production at many sites, it's
> the first time such an issue is reported and it could be tracked down to
> these two extra bytes. But the workaround will not prevent the two extra
> bytes from coming after the last recv().
> 
> Also, the issue remains when processing large POST requests. Let's suppose
> the application is receiving a massive POST (eg: 10 MB) but the request is
> not authenticated, so the application returns an HTTP 401 response to
> require the client to authenticate. There's no way for the application to
> be notified that the small response was completely read by the client and
> that it's safe to close().
> 
> For these reasons, I concluded that the application can't get everything
> right and needs help from the kernel (said differently, I think that the
> RFC2525 fix is causing harm in addition to goods). In my opinion, this
> section in the RFC was added based on a few observations of trivial cases
> but was but its impact was not completely explored.
> 
> I'm willing to experiment, but I'm not much familiar with the code itself
> and sometimes I'm not sure about what I'm doing, probably that some help
> would be welcome. What I'd like to do is to implement the step 3 above,
> which is to only send the RST upon receipt of an ACK on an orphan that
> would switch a normal socket from FIN_WAIT1 to FIN_WAIT2.
> 
> Also, I'm not sure about what other OSes are doing. For instance, I tried
> on Solaris and did not observe the issue at all, though I think that
> Solaris simply does not implement the RFC2525 recommendation.
> 
> Have a nice sunday evening ;-)
> Willy
> 

I was referring to this code. It works well for me.

shutdown(fd, SHUT_RDWR);
while (recv(fd, buff, sizeof(buff), 0) > 0)
	;
close(fd);




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 18:35     ` Eric Dumazet
@ 2010-09-26 18:49       ` Willy Tarreau
  2010-09-26 21:01         ` Eric Dumazet
  2010-09-26 22:10         ` David Miller
  0 siblings, 2 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 18:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Sun, Sep 26, 2010 at 08:35:15PM +0200, Eric Dumazet wrote:
> I was referring to this code. It works well for me.
> 
> shutdown(fd, SHUT_RDWR);
> while (recv(fd, buff, sizeof(buff), 0) > 0)
> 	;
> close(fd);

Ah this one yes, but it's overkill. We're actively pumping data from the
other side to drop it on the floor until it finally closes while we only
need to know when it has ACKed the FIN. In practice, doing that on a POST
request which causes a 302 or 401 will result in the whole amount of data
being transferred twice. Not only this is bad for the bandwidth, this is
also bad for the user, as we're causing him to experience a complete upload
twice, just to be sure it has received the FIN, while it's pretty obvious
that it's not necessary in 99.9% of the cases.

Since this method is the least efficient one and clearly not acceptable
for practical cases, I wanted to dig at the root, where the information
is known. And the TCP recv code is precisely the place where we know
exactly when it's safe to reset.

Also there's another issue in doing this. It requires polling of the
receive side for all requests, which adds one epoll_ctl() syscall and
one recv() call, which have a much noticeable negative performance
impact at high rates (at 100000 connections per second, every syscall
counts). For now I could very well consider that I do this only for
POST requests, which currently are the ones exhibiting the issue the
most, but since HTTP browsers will try to enable pipelining again
soon, the problem will generalize to all types of requests. Hence my
attempts to do it the optimal way.

Regards,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 17:40   ` Willy Tarreau
  2010-09-26 18:35     ` Eric Dumazet
@ 2010-09-26 19:16     ` Willy Tarreau
  2010-09-26 22:14       ` David Miller
  1 sibling, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 19:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Sun, Sep 26, 2010 at 07:40:15PM +0200, Willy Tarreau wrote:
> 3) when an ACK comes from the other side, either it's below our last
>    seq, and we simply ignore it, just as if we were in TIME_WAIT, or
>    it is equal to the last seq and indicates that it's now safe to
>    reset ; we would then just send the RST to notify the other side
>    that the data it sent were not read. The connection can then either
>    be destroyed or put in TIME_WAIT. It's the point where the connection
>    normally switches from FIN_WAIT1 to FIN_WAIT2, since the FIN has been
>    acked. The only difference is that we don't need a FIN_WAIT2 state
>    for an orphan.

In fact, the following patch seems to provide the expected result :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f1813bc..ee5fa5d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1879,7 +1880,7 @@ void tcp_close(struct sock *sk, long timeout)
 	 * advertise a zero window, then kill -9 the FTP client, wheee...
 	 * Note: timeout is always zero in such a case.
 	 */
-	if (data_was_unread) {
+	if (data_was_unread && !tcp_sk(sk)->packets_out) {
 		/* Unread data was tossed, zap the connection. */
 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
 		tcp_set_state(sk, TCP_CLOSE);


Here's what I get if I call it that way :

$ (echo "req1";usleep 200000;echo "req2") | nc6 10.8.3.4 8000

(excuse me for the long lines, I've disabled timestamps for better readability
but they're still long).

21:11:43.379465 IP 10.8.3.1.59244 > 10.8.3.4.8000: Flags [S], seq 210901267, win 5840, options [mss 1460,sackOK,TS val 305301003 ecr 0,nop,wscale 7], length 0
21:11:43.379514 IP 10.8.3.4.8000 > 10.8.3.1.59244: Flags [S.], seq 3131672819, ack 210901268, win 5840, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
21:11:43.379866 IP 10.8.3.1.59244 > 10.8.3.4.8000: Flags [.], ack 3131672820, win 46, length 0
21:11:43.379922 IP 10.8.3.1.59244 > 10.8.3.4.8000: Flags [P.], ack 3131672820, win 46, length 5
21:11:43.379946 IP 10.8.3.4.8000 > 10.8.3.1.59244: Flags [.], ack 210901273, win 92, length 0
21:11:43.581305 IP 10.8.3.1.59244 > 10.8.3.4.8000: Flags [P.], ack 3131672820, win 46, length 5
21:11:43.581331 IP 10.8.3.4.8000 > 10.8.3.1.59244: Flags [.], ack 210901278, win 92, length 0
21:11:44.380442 IP 10.8.3.4.8000 > 10.8.3.1.59244: Flags [P.], ack 210901278, win 92, length 7
21:11:44.380506 IP 10.8.3.4.8000 > 10.8.3.1.59244: Flags [FP.], seq 3131672827:3131672834, ack 210901278, win 92, length 7
21:11:44.380649 IP 10.8.3.1.59244 > 10.8.3.4.8000: Flags [.], ack 3131672827, win 46, length 0
21:11:44.380672 IP 10.8.3.1.59244 > 10.8.3.4.8000: Flags [F.], seq 210901278, ack 3131672835, win 46, length 0
21:11:44.380685 IP 10.8.3.4.8000 > 10.8.3.1.59244: Flags [.], ack 210901279, win 92, length 0

Here we have correctly got a FIN+PUSH for the last data instead of the reset.
The other side ACKs and after the last ACK, the socket is correctly in
TIME_WAIT state.

If I try to push more data after the connection closes, I correctly get
a reset to be sent :

$ (echo "req1";usleep 1200000;echo "req2"; usleep 200000; echo "req3") | nc6 --half-close 10.8.3.4 8000

21:15:00.068376 IP 10.8.3.1.60849 > 10.8.3.4.8000: Flags [S], seq 3284326660, win 5840, options [mss 1460,sackOK,TS val 305350241 ecr 0,nop,wscale 7], length 0
21:15:00.068428 IP 10.8.3.4.8000 > 10.8.3.1.60849: Flags [S.], seq 1933854189, ack 3284326661, win 5840, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
21:15:00.068779 IP 10.8.3.1.60849 > 10.8.3.4.8000: Flags [.], ack 1933854190, win 46, length 0
21:15:00.068830 IP 10.8.3.1.60849 > 10.8.3.4.8000: Flags [P.], ack 1933854190, win 46, length 5
21:15:00.068854 IP 10.8.3.4.8000 > 10.8.3.1.60849: Flags [.], ack 3284326666, win 92, length 0
21:15:01.069238 IP 10.8.3.4.8000 > 10.8.3.1.60849: Flags [P.], ack 3284326666, win 92, length 7
21:15:01.069300 IP 10.8.3.4.8000 > 10.8.3.1.60849: Flags [FP.], seq 1933854197:1933854204, ack 3284326666, win 92, length 7
21:15:01.069448 IP 10.8.3.1.60849 > 10.8.3.4.8000: Flags [.], ack 1933854197, win 46, length 0
21:15:01.109938 IP 10.8.3.1.60849 > 10.8.3.4.8000: Flags [.], ack 1933854205, win 46, length 0
21:15:01.270509 IP 10.8.3.1.60849 > 10.8.3.4.8000: Flags [P.], ack 1933854205, win 46, length 5
21:15:01.270542 IP 10.8.3.4.8000 > 10.8.3.1.60849: Flags [R], seq 1933854205, win 0, length 0

I had already tested that previously but unfortunately the test was
combined with an attempt to zero the rx window, which had corrupted my
tests.

Would you have any problem with the proposed change above ?

Thanks,
Willy


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 18:49       ` Willy Tarreau
@ 2010-09-26 21:01         ` Eric Dumazet
  2010-09-26 21:46           ` Willy Tarreau
  2010-09-26 22:10         ` David Miller
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2010-09-26 21:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev

Le dimanche 26 septembre 2010 à 20:49 +0200, Willy Tarreau a écrit :
> On Sun, Sep 26, 2010 at 08:35:15PM +0200, Eric Dumazet wrote:
> > I was referring to this code. It works well for me.
> > 
> > shutdown(fd, SHUT_RDWR);
> > while (recv(fd, buff, sizeof(buff), 0) > 0)
> > 	;
> > close(fd);
> 
> Ah this one yes, but it's overkill. We're actively pumping data from the
> other side to drop it on the floor until it finally closes while we only
> need to know when it has ACKed the FIN. In practice, doing that on a POST
> request which causes a 302 or 401 will result in the whole amount of data
> being transferred twice. Not only this is bad for the bandwidth, this is
> also bad for the user, as we're causing him to experience a complete upload
> twice, just to be sure it has received the FIN, while it's pretty obvious
> that it's not necessary in 99.9% of the cases.
> 

I dont understand how recv() could transfert data twice.

Once you issued shutdown(RDWR), socket rx buffer is freezed and no more
incoming data should be accepted/queued.

You only read from the socket receive queue to null buffer, and in most
cases a single recv() call will be enough to drain the queue.


> Since this method is the least efficient one and clearly not acceptable
> for practical cases, I wanted to dig at the root, where the information
> is known. And the TCP recv code is precisely the place where we know
> exactly when it's safe to reset.
> 

And its safe to reset exactly when application just does close(), if
unread data was not drained. Not only its safe, but required. A new RFC
might be needed ?


> Also there's another issue in doing this. It requires polling of the
> receive side for all requests, which adds one epoll_ctl() syscall and
> one recv() call, which have a much noticeable negative performance
> impact at high rates (at 100000 connections per second, every syscall
> counts). For now I could very well consider that I do this only for
> POST requests, which currently are the ones exhibiting the issue the
> most, but since HTTP browsers will try to enable pipelining again
> soon, the problem will generalize to all types of requests. Hence my
> attempts to do it the optimal way.

This might be overkill but is a portable way of doing this, on all
versions of linux.

Sure, you could patch kernel to add a new system call

close_proper(fd);

As shutdown() only uses two bits, you can eventually add another bit to
flush receive queue as well (to avoid the copy of it)

Another question, is : why the server closes the socket, if you believe
more pipelining is coming soon ?




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 21:01         ` Eric Dumazet
@ 2010-09-26 21:46           ` Willy Tarreau
  2010-09-26 22:19             ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 21:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Sun, Sep 26, 2010 at 11:01:11PM +0200, Eric Dumazet wrote:
> Le dimanche 26 septembre 2010 à 20:49 +0200, Willy Tarreau a écrit :
> > On Sun, Sep 26, 2010 at 08:35:15PM +0200, Eric Dumazet wrote:
> > > I was referring to this code. It works well for me.
> > > 
> > > shutdown(fd, SHUT_RDWR);
> > > while (recv(fd, buff, sizeof(buff), 0) > 0)
> > > 	;
> > > close(fd);
> > 
> > Ah this one yes, but it's overkill. We're actively pumping data from the
> > other side to drop it on the floor until it finally closes while we only
> > need to know when it has ACKed the FIN. In practice, doing that on a POST
> > request which causes a 302 or 401 will result in the whole amount of data
> > being transferred twice. Not only this is bad for the bandwidth, this is
> > also bad for the user, as we're causing him to experience a complete upload
> > twice, just to be sure it has received the FIN, while it's pretty obvious
> > that it's not necessary in 99.9% of the cases.
> > 
> 
> I dont understand how recv() could transfert data twice.

That's not what I said, I said the client would have to retransfer. Here's
what typically happens, for instance with an authentication requirement :

Client                                Server
           SYN ----------->
               <----------- SYN/ACK
           ACK ----------->

POST /some_url HTTP/1.1
Host: xxx
Content-length: 10 meg
             ------ headers are sent ---->
xxxxxxxxxxxxx
xxxxxxxxxxxxx ----- data are being sent ->  HTTP/1.1 401 forbidden
xxxxxxxxxxxxx                               WWW-Authenticate: basic realm="xxx"
...                                         Connection: close
             <----------------------------
xxxxxxxxxxxxx
             <---------------------------- FIN
xxxxxxxxxxxxx
...
10 megs of data being sent and drained by the server
xxxxxxxxxxxxx
             FIN --------------->
                 <--------------  ACK

second attempt with credentials this time

           SYN ----------->
               <----------- SYN/ACK
           ACK ----------->

POST /some_url HTTP/1.1
Host: x
authorization: basic xyz
Content-length: 10 meg
             ------ headers are sent ---->
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
etc...

So in this case the data is effectively transmitted twice. With an
RST once the client acks the FIN, the first transfer aborts very
early instead, saving half of the bandwidth.

> You only read from the socket receive queue to null buffer, and in most
> cases a single recv() call will be enough to drain the queue.

Indeed, in *most* cases, and just as right now in most cases there is no
problem. As I said, that's the first reported issue in 10 years and hundreds
of billions of connections cumulated on various sites. But instead of really
fixing the issue, it just reduces its occurrences. Also, it does only work
for low bandwidth clients (most common case too). That's what I'm going to
implement anyway, but this is an unreliable workaround. All I know is that
it will probably divide by 10 the number of times this problem is encountered
but it will not fix it.

> > Since this method is the least efficient one and clearly not acceptable
> > for practical cases, I wanted to dig at the root, where the information
> > is known. And the TCP recv code is precisely the place where we know
> > exactly when it's safe to reset.
> > 
> 
> And its safe to reset exactly when application just does close(), if
> unread data was not drained. Not only its safe, but required. A new RFC
> might be needed ?

I'm not requesting a new RFC, but I'm just trying to make a correct use
of orphans as implemented in the Linux stack, and I'm realizing that
since RFC2525 was implemented, orphans cannot be relied on at all anymore.
We can simply delete all the orphans code and emit an RST immediately upon
close(), there is no safe use of them now. And that's my concern. In my
opinion, the code is there and was written precisely for that usage. Since
I'm seeing that it can't be used for what it's designed for, I'm naturally
interested in trying to get it usable again. And in fact, when I really
want an RST, I can already have one by disabling lingering before the
close(). This too shows that the default close() by default protects
orphaned data.

> > Also there's another issue in doing this. It requires polling of the
> > receive side for all requests, which adds one epoll_ctl() syscall and
> > one recv() call, which have a much noticeable negative performance
> > impact at high rates (at 100000 connections per second, every syscall
> > counts). For now I could very well consider that I do this only for
> > POST requests, which currently are the ones exhibiting the issue the
> > most, but since HTTP browsers will try to enable pipelining again
> > soon, the problem will generalize to all types of requests. Hence my
> > attempts to do it the optimal way.
> 
> This might be overkill but is a portable way of doing this, on all
> versions of linux.

I'm not discussing the portability at all. You see, right now, Linux is
by some margin the fastest platform to run haproxy, and the one I always
recommend for that. Some people experience good performance on FreeBSD
too, but the fine-grained control we have on Linux helps maintaining a
high level of performance by avoiding many unnecessary steps when we
can trust the OS to do some things correctly. Having workarounds for
some versions that we know don't work as expected is not an issue, and
it's even a good reason to sometimes make people upgrade. But having to
cut performance in half under Linux on some workloads because the user
land can't know something obvious that the OS knows is a bit of a waste.

> Sure, you could patch kernel to add a new system call
> 
> close_proper(fd);

That would just be the normal close() (the one with lingering enabled)
in theory.

> As shutdown() only uses two bits, you can eventually add another bit to
> flush receive queue as well (to avoid the copy of it)

This is a good idea, but it will still leave some incorrectly handled
cases where the other side has the time to send a few packets between
the shutdown() and the close().

> Another question, is : why the server closes the socket, if you believe
> more pipelining is coming soon ?

There are quite a bunch of situations in HTTP where you have no other
solution than closing. All responses that don't advertise a length must
terminated by a close. Haproxy is a reverse-proxy, so it sits between
the client and the server. When a server sends such responses, haproxy
must forward the close to the client, regardless of what's in the request
buffer. Also, some response codes require a close. The 400 bad request
for instance, implies a mandatory close (as well as many 4xx or 5xx).
All redirects (301/302/303/307) should lead to a close if the target is
another site.

Eventhough we optimize for the most common cases, that doesn't save us
from having to support the legacy cases.

Regards,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 18:49       ` Willy Tarreau
  2010-09-26 21:01         ` Eric Dumazet
@ 2010-09-26 22:10         ` David Miller
  1 sibling, 0 replies; 39+ messages in thread
From: David Miller @ 2010-09-26 22:10 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

TCP is a reliable transport.

If you choose to close before the data from the peer has been fully
received, you've subverted the reliability of the data transport.

Reset is the only reasonable action in this situation.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 13:17 TCP: orphans broken by RFC 2525 #2.17 Willy Tarreau
  2010-09-26 17:02 ` Eric Dumazet
@ 2010-09-26 22:13 ` David Miller
  2010-09-26 22:34   ` Willy Tarreau
  2010-09-27  8:02 ` Herbert Xu
  2 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-26 22:13 UTC (permalink / raw)
  To: w; +Cc: netdev

From: Willy Tarreau <w@1wt.eu>
Date: Sun, 26 Sep 2010 15:17:17 +0200

> I've read RFC 2525 #2.17 and it shows quite interesting examples of what
> it wanted to protect against. However, the recommendation did not consider
> the fact that there could be some unacked pending data in the outgoing
> buffers.

It doesn't matter if there is any pending data still outgoing when
we received this data after close().

The issue is that the reliable transport nature of TCP has been
violated, and as such the entire connection's reliability has
been compromised.

The only appropriate response is a full reset.

As Eric said, your only option is to fully sync the data coming
from the peer.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 19:16     ` Willy Tarreau
@ 2010-09-26 22:14       ` David Miller
  0 siblings, 0 replies; 39+ messages in thread
From: David Miller @ 2010-09-26 22:14 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Sun, 26 Sep 2010 21:16:32 +0200

> On Sun, Sep 26, 2010 at 07:40:15PM +0200, Willy Tarreau wrote:
>> 3) when an ACK comes from the other side, either it's below our last
>>    seq, and we simply ignore it, just as if we were in TIME_WAIT, or
>>    it is equal to the last seq and indicates that it's now safe to
>>    reset ; we would then just send the RST to notify the other side
>>    that the data it sent were not read. The connection can then either
>>    be destroyed or put in TIME_WAIT. It's the point where the connection
>>    normally switches from FIN_WAIT1 to FIN_WAIT2, since the FIN has been
>>    acked. The only difference is that we don't need a FIN_WAIT2 state
>>    for an orphan.
> 
> In fact, the following patch seems to provide the expected result :

There is no way I'm applying this.

Pending outgoing data means nothing, and not sending out the RST is
wrong because there will be no notification to the peer that you
decided to close and ignore the data stream the peer sent, and thus
data was lost.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 21:46           ` Willy Tarreau
@ 2010-09-26 22:19             ` David Miller
  0 siblings, 0 replies; 39+ messages in thread
From: David Miller @ 2010-09-26 22:19 UTC (permalink / raw)
  To: w; +Cc: eric.dumazet, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Sun, 26 Sep 2010 23:46:14 +0200

> Eventhough we optimize for the most common cases, that doesn't save us
> from having to support the legacy cases.

There is nothing "legacy" about performing a proper reset when data
was lost.

Otherwise the peer has every right to believe that the data it sent
was sinked all the way to the remote application.

Which it wasn't.

Reset is the only appropriate action in this situation.

If the application layer protocol you're dealing with is so broken
that a multi-megabyte transfer happens even when it gets an error
indication, that really isn't the kernels problem and it is very clear
where the "fix" lies and that in the application layer proptocol and
handling.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 22:13 ` David Miller
@ 2010-09-26 22:34   ` Willy Tarreau
  2010-09-26 22:38     ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 22:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi David,

On Sun, Sep 26, 2010 at 03:13:46PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Sun, 26 Sep 2010 15:17:17 +0200
> 
> > I've read RFC 2525 #2.17 and it shows quite interesting examples of what
> > it wanted to protect against. However, the recommendation did not consider
> > the fact that there could be some unacked pending data in the outgoing
> > buffers.
> 
> It doesn't matter if there is any pending data still outgoing when
> we received this data after close().
> 
> The issue is that the reliable transport nature of TCP has been
> violated, and as such the entire connection's reliability has
> been compromised.

I don't see what is being violated nor what reliability has been
compromised. Some data has reliably been delivered to the kernel,
a shutdown() has reliably been performed, followed by a close()
which then makes use of the orphans mechanism to deliver the data
to the other side.

> The only appropriate response is a full reset.

I'm not against the full reset, I really want it, but I don't want
it to be sent before the previous data. The reset here is being sent
out of order and kills the data that were previously going to be
delivered. It's all about a timing issue BTW. If the other side sends
its data slightly later, it reliably receives its response followed by
a reset it understands and respects. So this is a valid working
mechanism. I'm just trying to ensure that the reset does not break
ordering and is delivered after the pending data.

> As Eric said, your only option is to fully sync the data coming
> from the peer.

The why are we keeping the orphans feature and not get rid of it then ?
It becomes completely useless, and we can as well disable all the
lingering options which have no use beyond that.

The lingering controls how long unacked data remains in an orphaned
socket, which precisely is my case. If the application has no way to
know whether it can close, all this becomes useless.

I can easily admit I'm doing something wrong, but here we have a
feature that I think I'm correctly using and it does not always
work, and I can't know when I can use it, so I should not use it.
Obviously, either I'm missing something or there's something wrong.
And that's what I'd like to find out.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 22:34   ` Willy Tarreau
@ 2010-09-26 22:38     ` David Miller
  2010-09-26 22:54       ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-26 22:38 UTC (permalink / raw)
  To: w; +Cc: netdev

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 27 Sep 2010 00:34:48 +0200

> I don't see what is being violated nor what reliability has been
> compromised.

The TCP protcol's obligation to reliably deliver data between
two applications, that is what has been violated.

You don't have to like this, but you certainly must cope with it
in your applications :-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 22:38     ` David Miller
@ 2010-09-26 22:54       ` Willy Tarreau
  2010-09-26 23:08         ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 22:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, Sep 26, 2010 at 03:38:32PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 27 Sep 2010 00:34:48 +0200
> 
> > I don't see what is being violated nor what reliability has been
> > compromised.
> 
> The TCP protcol's obligation to reliably deliver data between
> two applications, that is what has been violated.

Once again, I don't see why, due to the orphans mechanism. Please
consider for a minute that the application-level close() is distinct
from the protocol-level close. The application-level close() just
instructs the lower layer to turn the connection into an orphan. Any
pending data is sent. Incoming data may accumulate until the receive
buffer is full, but at least the other end regularly acks what is
being sent. Then once the other end has acked the orphaned data, we
perform the protocol-level close, which consists in either switching
the connection to TIME_WAIT if there are no pending data, or to send
an RST if there are pending data.

> You don't have to like this, but you certainly must cope with it
> in your applications :-)

It's not a matter of liking it or not, but to use something reliable.
Orphans are supposed to be (as long as there aren't any memory shortage).
But they're not. And my concern is that the information that my app
could make use for a reliable close exists in the kernel but is not
usable from the application level.

In fact, the end result is that we're observing exactly the opposite
of what RFC2525 wanted to achieve : their goal was to perform early
resets on aborts in order not to have to transfer too many data, but
the way we have it makes it mandatory to read everything because some
information is lost between the kernel and the application.

Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 22:54       ` Willy Tarreau
@ 2010-09-26 23:08         ` David Miller
  2010-09-26 23:25           ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-26 23:08 UTC (permalink / raw)
  To: w; +Cc: netdev

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 27 Sep 2010 00:54:40 +0200

> On Sun, Sep 26, 2010 at 03:38:32PM -0700, David Miller wrote:
>> From: Willy Tarreau <w@1wt.eu>
>> Date: Mon, 27 Sep 2010 00:34:48 +0200
>> 
>> > I don't see what is being violated nor what reliability has been
>> > compromised.
>> 
>> The TCP protcol's obligation to reliably deliver data between
>> two applications, that is what has been violated.
> 
> Once again, I don't see why, due to the orphans mechanism. Please
> consider for a minute that the application-level close() is distinct
> from the protocol-level close. The application-level close() just
> instructs the lower layer to turn the connection into an orphan.

A close() is equivalent to a shutdown() with both the send and
receive masks set.

You are telling TCP that you expect no more data to be received.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 23:08         ` David Miller
@ 2010-09-26 23:25           ` Willy Tarreau
  2010-09-27  1:12             ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-26 23:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, Sep 26, 2010 at 04:08:38PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 27 Sep 2010 00:54:40 +0200
> 
> > On Sun, Sep 26, 2010 at 03:38:32PM -0700, David Miller wrote:
> >> From: Willy Tarreau <w@1wt.eu>
> >> Date: Mon, 27 Sep 2010 00:34:48 +0200
> >> 
> >> > I don't see what is being violated nor what reliability has been
> >> > compromised.
> >> 
> >> The TCP protcol's obligation to reliably deliver data between
> >> two applications, that is what has been violated.
> > 
> > Once again, I don't see why, due to the orphans mechanism. Please
> > consider for a minute that the application-level close() is distinct
> > from the protocol-level close. The application-level close() just
> > instructs the lower layer to turn the connection into an orphan.
> 
> A close() is equivalent to a shutdown() with both the send and
> receive masks set.
> 
> You are telling TCP that you expect no more data to be received.

Agreed. But that's not a reason for killing outgoing data that is
being sent when there are some data left in the rcv buffer.

Honnestly David, after some thinking, could you still find a valid use
of the orphans as they are now ? I personally fail to do so. And what
drove me to the kernel on this issue is that I found the behaviour
inconsistent with the principle of the orphan itself.

Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 23:25           ` Willy Tarreau
@ 2010-09-27  1:12             ` David Miller
  2010-09-27  5:39               ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-27  1:12 UTC (permalink / raw)
  To: w; +Cc: netdev

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 27 Sep 2010 01:25:30 +0200

> Agreed. But that's not a reason for killing outgoing data that is
> being sent when there are some data left in the rcv buffer.

What alternative notification to the peer do you suggest other than a
reset, then?  TCP gives us no other.

That's the thing, data integrity is full duplex, thus once it has been
compromised in one direction everything currently in flight must be
zapped.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  1:12             ` David Miller
@ 2010-09-27  5:39               ` Willy Tarreau
  2010-09-27  5:48                 ` Eric Dumazet
  2010-09-27  6:42                 ` David Miller
  0 siblings, 2 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27  5:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, Sep 26, 2010 at 06:12:02PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 27 Sep 2010 01:25:30 +0200
> 
> > Agreed. But that's not a reason for killing outgoing data that is
> > being sent when there are some data left in the rcv buffer.
> 
> What alternative notification to the peer do you suggest other than a
> reset, then?  TCP gives us no other.

I know, and I agree to send the reset, but after the data are correctly
transferred. This reset's purpose is only to inform the other side that
the data it sent were destroyed. It is not a requirement to tell it they
were destroyed earlier or later. What matters is that it's informed they
were destroyed.

That's why I think that it is perfectly reasonable to either destroy them
after the ACK or simply notify about their destruction after the ACK.

Instead of having :

        A                                               B

       --->     <SEQ=100><ACK=300>                     --->
       <---     <SEQ=300><ACK=100><DATA=10>            <---
       --->     <SEQ=100><ACK=310>                     --->
       send(100)
       shutdown()
       close()
       --->     <SEQ=100><CTL=RST>                     --->

We would just have :

        A                                               B

       --->     <SEQ=100><ACK=300>                     --->
       <---     <SEQ=300><ACK=100><DATA=10>            <---
       --->     <SEQ=100><ACK=310>                     --->
       send(100)
       shutdown()
       close()
       --->     <SEQ=100><ACK=310><DATA=100><CTL=FIN>  --->
       <---     <SEQ=300><ACK=111>                     <---
       --->     <SEQ=111><CTL=RST>                     --->

Note that the notification is exactly the same as if we wanted
to notify B about the destruction of data that were sent just
after the close, because the RST only carries a SEQ field and
no ACK indicating what it destroyed :

        A                                               B

       --->     <SEQ=100><ACK=300>                     --->
       send(100)
       shutdown()
       --->     <SEQ=100><ACK=310><DATA=100><CTL=FIN>  --->
       <---     <SEQ=300><ACK=111><DATA=10>            <---
       close()
       --->     <SEQ=111><CTL=RST>                     --->

In my opinion, last two examples are perfectly valid, they just mean
"after that, I close and don't want to hear about you again".

> That's the thing, data integrity is full duplex, thus once it has been
> compromised in one direction everything currently in flight must be
> zapped.

I'm well aware of that, and even though that's an annoying method, we
must live with it, it's probably one of the things that contribute TCP
its well known reliability. But I think that RFC 2525 abused the TCP
use based on traces showing a bad behaviour and overlooked all impacts
(nothing there talks about the case of data being sent or in flight at
the moment of the close).

Regards,
Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  5:39               ` Willy Tarreau
@ 2010-09-27  5:48                 ` Eric Dumazet
  2010-09-27  6:04                   ` Willy Tarreau
  2010-09-27  6:44                   ` David Miller
  2010-09-27  6:42                 ` David Miller
  1 sibling, 2 replies; 39+ messages in thread
From: Eric Dumazet @ 2010-09-27  5:48 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: David Miller, netdev

Le lundi 27 septembre 2010 à 07:39 +0200, Willy Tarreau a écrit :
> On Sun, Sep 26, 2010 at 06:12:02PM -0700, David Miller wrote:
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Mon, 27 Sep 2010 01:25:30 +0200
> > 
> > > Agreed. But that's not a reason for killing outgoing data that is
> > > being sent when there are some data left in the rcv buffer.
> > 
> > What alternative notification to the peer do you suggest other than a
> > reset, then?  TCP gives us no other.
> 
> I know, and I agree to send the reset, but after the data are correctly
> transferred. This reset's purpose is only to inform the other side that
> the data it sent were destroyed. It is not a requirement to tell it they
> were destroyed earlier or later. What matters is that it's informed they
> were destroyed.
> 
> That's why I think that it is perfectly reasonable to either destroy them
> after the ACK or simply notify about their destruction after the ACK.
> 
> Instead of having :
> 
>         A                                               B
> 
>        --->     <SEQ=100><ACK=300>                     --->
>        <---     <SEQ=300><ACK=100><DATA=10>            <---
>        --->     <SEQ=100><ACK=310>                     --->
>        send(100)
>        shutdown()
>        close()
>        --->     <SEQ=100><CTL=RST>                     --->
> 
> We would just have :
> 
>         A                                               B
> 
>        --->     <SEQ=100><ACK=300>                     --->
>        <---     <SEQ=300><ACK=100><DATA=10>            <---
>        --->     <SEQ=100><ACK=310>                     --->
>        send(100)
>        shutdown()
>        close()
>        --->     <SEQ=100><ACK=310><DATA=100><CTL=FIN>  --->
>        <---     <SEQ=300><ACK=111>                     <---
>        --->     <SEQ=111><CTL=RST>                     --->
> 
> Note that the notification is exactly the same as if we wanted
> to notify B about the destruction of data that were sent just
> after the close, because the RST only carries a SEQ field and
> no ACK indicating what it destroyed :
> 
>         A                                               B
> 
>        --->     <SEQ=100><ACK=300>                     --->
>        send(100)
>        shutdown()
>        --->     <SEQ=100><ACK=310><DATA=100><CTL=FIN>  --->
>        <---     <SEQ=300><ACK=111><DATA=10>            <---
>        close()
>        --->     <SEQ=111><CTL=RST>                     --->
> 
> In my opinion, last two examples are perfectly valid, they just mean
> "after that, I close and don't want to hear about you again".
> 
> > That's the thing, data integrity is full duplex, thus once it has been
> > compromised in one direction everything currently in flight must be
> > zapped.
> 
> I'm well aware of that, and even though that's an annoying method, we
> must live with it, it's probably one of the things that contribute TCP
> its well known reliability. But I think that RFC 2525 abused the TCP
> use based on traces showing a bad behaviour and overlooked all impacts
> (nothing there talks about the case of data being sent or in flight at
> the moment of the close).

If you can cook a patch that makes sure the RST is sent, just do so.

Your previous attempt was wrong, since the RST was sent only if client
sent "req3".

If it sent "req1", "req2" only, req2 was unread and still no RST sent.

This is an RFC violation.

Its a bit tricky, because you cannot send the FIN flag on the last
segment, but have to wait for the final ACK coming from client, to
finally send an RST.




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  5:48                 ` Eric Dumazet
@ 2010-09-27  6:04                   ` Willy Tarreau
  2010-09-27  6:44                   ` David Miller
  1 sibling, 0 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27  6:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, Sep 27, 2010 at 07:48:24AM +0200, Eric Dumazet wrote:
> Le lundi 27 septembre 2010 à 07:39 +0200, Willy Tarreau a écrit :
> > On Sun, Sep 26, 2010 at 06:12:02PM -0700, David Miller wrote:
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Mon, 27 Sep 2010 01:25:30 +0200
> > > 
> > > > Agreed. But that's not a reason for killing outgoing data that is
> > > > being sent when there are some data left in the rcv buffer.
> > > 
> > > What alternative notification to the peer do you suggest other than a
> > > reset, then?  TCP gives us no other.
> > 
> > I know, and I agree to send the reset, but after the data are correctly
> > transferred. This reset's purpose is only to inform the other side that
> > the data it sent were destroyed. It is not a requirement to tell it they
> > were destroyed earlier or later. What matters is that it's informed they
> > were destroyed.
> > 
> > That's why I think that it is perfectly reasonable to either destroy them
> > after the ACK or simply notify about their destruction after the ACK.
> > 
> > Instead of having :
> > 
> >         A                                               B
> > 
> >        --->     <SEQ=100><ACK=300>                     --->
> >        <---     <SEQ=300><ACK=100><DATA=10>            <---
> >        --->     <SEQ=100><ACK=310>                     --->
> >        send(100)
> >        shutdown()
> >        close()
> >        --->     <SEQ=100><CTL=RST>                     --->
> > 
> > We would just have :
> > 
> >         A                                               B
> > 
> >        --->     <SEQ=100><ACK=300>                     --->
> >        <---     <SEQ=300><ACK=100><DATA=10>            <---
> >        --->     <SEQ=100><ACK=310>                     --->
> >        send(100)
> >        shutdown()
> >        close()
> >        --->     <SEQ=100><ACK=310><DATA=100><CTL=FIN>  --->
> >        <---     <SEQ=300><ACK=111>                     <---
> >        --->     <SEQ=111><CTL=RST>                     --->
> > 
> > Note that the notification is exactly the same as if we wanted
> > to notify B about the destruction of data that were sent just
> > after the close, because the RST only carries a SEQ field and
> > no ACK indicating what it destroyed :
> > 
> >         A                                               B
> > 
> >        --->     <SEQ=100><ACK=300>                     --->
> >        send(100)
> >        shutdown()
> >        --->     <SEQ=100><ACK=310><DATA=100><CTL=FIN>  --->
> >        <---     <SEQ=300><ACK=111><DATA=10>            <---
> >        close()
> >        --->     <SEQ=111><CTL=RST>                     --->
> > 
> > In my opinion, last two examples are perfectly valid, they just mean
> > "after that, I close and don't want to hear about you again".
> > 
> > > That's the thing, data integrity is full duplex, thus once it has been
> > > compromised in one direction everything currently in flight must be
> > > zapped.
> > 
> > I'm well aware of that, and even though that's an annoying method, we
> > must live with it, it's probably one of the things that contribute TCP
> > its well known reliability. But I think that RFC 2525 abused the TCP
> > use based on traces showing a bad behaviour and overlooked all impacts
> > (nothing there talks about the case of data being sent or in flight at
> > the moment of the close).
> 
> If you can cook a patch that makes sure the RST is sent, just do so.
> 
> Your previous attempt was wrong, since the RST was sent only if client
> sent "req3".
> 
> If it sent "req1", "req2" only, req2 was unread and still no RST sent.
> 
> This is an RFC violation.

OK now I see your point and you're right. However, req3 is not required
in my tests. The simple fact of acknowledging the response causes the
RST to be emitted. However, if the client sends the FIN first, then it's
true that there won't be an RST.

> Its a bit tricky, because you cannot send the FIN flag on the last
> segment, but have to wait for the final ACK coming from client, to
> finally send an RST.

Yes, that's what I was initially looking for then I thought its was
OK to send the FIN too, but you're right, we don't want to send it
if the client had already sent one, otherwise it won't be informed
about the error.

So basically that means not to send the FIN when in CLOSE_WAIT or
LAST_ACK with unread data so that we can send it to the client once
it ACKs our data.

I'll think about it, thanks for the brainstorming.

Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  5:39               ` Willy Tarreau
  2010-09-27  5:48                 ` Eric Dumazet
@ 2010-09-27  6:42                 ` David Miller
  2010-09-27  7:34                   ` Willy Tarreau
  1 sibling, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-27  6:42 UTC (permalink / raw)
  To: w; +Cc: netdev

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 27 Sep 2010 07:39:01 +0200

> On Sun, Sep 26, 2010 at 06:12:02PM -0700, David Miller wrote:
>> From: Willy Tarreau <w@1wt.eu>
>> Date: Mon, 27 Sep 2010 01:25:30 +0200
>> 
>> > Agreed. But that's not a reason for killing outgoing data that is
>> > being sent when there are some data left in the rcv buffer.
>> 
>> What alternative notification to the peer do you suggest other than a
>> reset, then?  TCP gives us no other.
> 
> I know, and I agree to send the reset, but after the data are correctly
> transferred. This reset's purpose is only to inform the other side that
> the data it sent were destroyed. It is not a requirement to tell it they
> were destroyed earlier or later. What matters is that it's informed they
> were destroyed.

So you want us to hold onto to the full connection state for however
long it takes to send the pending data just because your application
doesn't want to wait around to sink a pending newline character?

Is that what this boils down to?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  5:48                 ` Eric Dumazet
  2010-09-27  6:04                   ` Willy Tarreau
@ 2010-09-27  6:44                   ` David Miller
  1 sibling, 0 replies; 39+ messages in thread
From: David Miller @ 2010-09-27  6:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: w, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Sep 2010 07:48:24 +0200

> Its a bit tricky, because you cannot send the FIN flag on the last
> segment, but have to wait for the final ACK coming from client, to
> finally send an RST.

This is going to be incredibly complicated, for something I see
next to zero value.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  6:42                 ` David Miller
@ 2010-09-27  7:34                   ` Willy Tarreau
  2010-09-27  7:42                     ` David Miller
  2010-09-27  9:12                     ` Julian Anastasov
  0 siblings, 2 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27  7:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, Sep 26, 2010 at 11:42:02PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 27 Sep 2010 07:39:01 +0200
> 
> > On Sun, Sep 26, 2010 at 06:12:02PM -0700, David Miller wrote:
> >> From: Willy Tarreau <w@1wt.eu>
> >> Date: Mon, 27 Sep 2010 01:25:30 +0200
> >> 
> >> > Agreed. But that's not a reason for killing outgoing data that is
> >> > being sent when there are some data left in the rcv buffer.
> >> 
> >> What alternative notification to the peer do you suggest other than a
> >> reset, then?  TCP gives us no other.
> > 
> > I know, and I agree to send the reset, but after the data are correctly
> > transferred. This reset's purpose is only to inform the other side that
> > the data it sent were destroyed. It is not a requirement to tell it they
> > were destroyed earlier or later. What matters is that it's informed they
> > were destroyed.
> 
> So you want us to hold onto to the full connection state for however
> long it takes to send the pending data

Not for however long it takes, just as we do right now with orphans, nothing
more, nothing less.

> just because your application
> doesn't want to wait around to sink a pending newline character?

it's not that it *doesn't want* to wait for the pending newline character,
it's that this character has no reason to be there and cannot be predicted,
and even when you find it, nothing tells the application that it's the last
one.

> Is that what this boils down to?

No, it's the opposite in fact, the goal is to ensure we can reliably
release the whole connection ASAP instead of being forced to sink any
possible data that may come from it and that will not be consumed nor
will lead to a reset. Look :

case A (current one) :
   we send the response to the client from an orphaned connection.
   Most of the times, the client won't have any issue and will get the
   response. In some rare circumstances, some data sent by the client
   after the response causes an RST to be emitted, which may destroy
   in flight data. But those issues are extremely rare, still they
   happen.

case B (my proposal, and was the case before the RFC2525 fix) :
   we send the response to the client.
   it acks it
   we send an RST. End of the transfer. Total time: 50ms (avg RTT over ADSL).

case C (alternative) :
   we send the response to the client.
   the application can't know it has acked it, and must maintain the
   connection open for however long is necessary to get the only form
   of ACK the application can detect: the FIN from the client, which
   is 6 minutes on my ADSL line for 10 meg.

In case C, not only the state remains *a lot* longer, but the bandwidth
usage is much worse, and in the end the client does not even get the reset
that we're trying to ensure it gets to indicate that the data were dropped.

So while case C is a reliable workaround, it's the least efficient method
and the most expensive one in terms of memory, CPU, network bandwidth,
socket usage, file descriptor usage and perceived time.

You see, I'm not trying to make dirty dangerous things to save a few
lines of code. I'm even OK to have a lot of linux-specific code to make
use of the features the linux stack provides that makes it more efficient
than other implementations. I'm just seeking reliability.

Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  7:34                   ` Willy Tarreau
@ 2010-09-27  7:42                     ` David Miller
  2010-09-27 19:21                       ` Willy Tarreau
  2010-09-27  9:12                     ` Julian Anastasov
  1 sibling, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-27  7:42 UTC (permalink / raw)
  To: w; +Cc: netdev

From: Willy Tarreau <w@1wt.eu>
Date: Mon, 27 Sep 2010 09:34:43 +0200

> On Sun, Sep 26, 2010 at 11:42:02PM -0700, David Miller wrote:
>> just because your application
>> doesn't want to wait around to sink a pending newline character?
> 
> it's not that it *doesn't want* to wait for the pending newline character,
> it's that this character has no reason to be there and cannot be predicted,
> and even when you find it, nothing tells the application that it's the last
> one.
> 
>> Is that what this boils down to?
> 
> No, it's the opposite in fact, the goal is to ensure we can reliably
> release the whole connection ASAP instead of being forced to sink any
> possible data that may come from it and that will not be consumed nor
> will lead to a reset. Look :

I still think it's completely broken that you want to close a
connection for which data is still going to arrive.

And I really can't think of why this can't be solved at the
application level.

Either there is an application level ACK that you have to wait for
anyways, or there isn't and you receive the entire request packet
before you start sending the data.

If there is some kind of unpredictable "dribbling" after the request
arrives, you really have to fix that.

I honestly have no sympathy for an application level protocol that
works this way, and I don't think the kernel is the place where this
should be handled.

Please don't exhaust me any further on this issue, thank you.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-26 13:17 TCP: orphans broken by RFC 2525 #2.17 Willy Tarreau
  2010-09-26 17:02 ` Eric Dumazet
  2010-09-26 22:13 ` David Miller
@ 2010-09-27  8:02 ` Herbert Xu
  2010-09-27 20:00   ` Willy Tarreau
  2 siblings, 1 reply; 39+ messages in thread
From: Herbert Xu @ 2010-09-27  8:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev, eric.dumazet, David S. Miller

Willy Tarreau <w@1wt.eu> wrote:
>
> Looking more closely, I noticed that in traces showing the issue,
> the client was sending an additional CRLF after the data in a
> separate packet (permitted eventhough not recommended).

Where is this permitted? RFC2616 says:

	Certain buggy HTTP/1.0 client implementations generate
	extra CRLF's after a POST request. To restate what is
	explicitly forbidden by the BNF, an HTTP/1.1 client MUST
	NOT preface or follow a request with an extra CRLF. 

Now if you want to support these broken clients it should be
as simple as doing the read that Eric suggested but with the
proviso that you only have to read one CRLF before closing.

This workaround for broken HTTP clients definitely does not belong
in the TCP stack.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  7:34                   ` Willy Tarreau
  2010-09-27  7:42                     ` David Miller
@ 2010-09-27  9:12                     ` Julian Anastasov
  2010-09-27 19:24                       ` Willy Tarreau
  1 sibling, 1 reply; 39+ messages in thread
From: Julian Anastasov @ 2010-09-27  9:12 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: David Miller, netdev


 	Hello,

On Mon, 27 Sep 2010, Willy Tarreau wrote:

> case A (current one) :
>   we send the response to the client from an orphaned connection.
>   Most of the times, the client won't have any issue and will get the
>   response. In some rare circumstances, some data sent by the client
>   after the response causes an RST to be emitted, which may destroy
>   in flight data. But those issues are extremely rare, still they
>   happen.
>
> case B (my proposal, and was the case before the RFC2525 fix) :
>   we send the response to the client.
>   it acks it
>   we send an RST. End of the transfer. Total time: 50ms (avg RTT over ADSL).
>
> case C (alternative) :
>   we send the response to the client.
>   the application can't know it has acked it, and must maintain the
>   connection open for however long is necessary to get the only form
>   of ACK the application can detect: the FIN from the client, which
>   is 6 minutes on my ADSL line for 10 meg.

 	If it is not already mentioned, the application can
know if sent data is acked. I think, ioctl SIOCOUTQ is for
this purpose. May be the application that wants to send
reliably HTTP error response before closing should do something
like:

- add this FD in some list for monitoring instead of keeping
large connection state
- use shutdown SHUT_WR to add FIN after response
- use setsockopt SO_RCVBUF with some low value to close the
RX window, we do not want the body
- wait for POLLHUP (FIN), not for POLLIN because we want to
ignore data, not to read it. Still, data can be read and
dropped if needed to release the socket memory
- use timer to limit the time we wait our data to be acked
- use SIOCOUTQ to know if everything is received in peer and
then close the fd

> In case C, not only the state remains *a lot* longer, but the bandwidth
> usage is much worse, and in the end the client does not even get the reset
> that we're trying to ensure it gets to indicate that the data were dropped.
>
> So while case C is a reliable workaround, it's the least efficient method
> and the most expensive one in terms of memory, CPU, network bandwidth,
> socket usage, file descriptor usage and perceived time.
>
> You see, I'm not trying to make dirty dangerous things to save a few
> lines of code. I'm even OK to have a lot of linux-specific code to make
> use of the features the linux stack provides that makes it more efficient
> than other implementations. I'm just seeking reliability.
>
> Willy

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  7:42                     ` David Miller
@ 2010-09-27 19:21                       ` Willy Tarreau
  2010-09-27 23:28                         ` Herbert Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27 19:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, Sep 27, 2010 at 12:42:39AM -0700, David Miller wrote:
> I still think it's completely broken that you want to close a
> connection for which data is still going to arrive.
> 
> And I really can't think of why this can't be solved at the
> application level.
> 
> Either there is an application level ACK that you have to wait for
> anyways, or there isn't and you receive the entire request packet
> before you start sending the data.
> 
> If there is some kind of unpredictable "dribbling" after the request
> arrives, you really have to fix that.
> 
> I honestly have no sympathy for an application level protocol that
> works this way, and I don't think the kernel is the place where this
> should be handled.
> 
> Please don't exhaust me any further on this issue, thank you.

David,

I don't want to exhaust you on the issue and I really understand that
you quickly read my explanations and don't get the issues.

Two quick facts :
  - HTTP allows the client to send whatever it wants whenever it wants
    and allows the server to close after whatever response it wants.
    Thus the server cannot predict that the client will talk.

  - orphans don't work anymore, period. Why not remove that whole code
    if you pretend it must not be used ? You still did not reply to this
    point, and I'm sure you will still not respond to this, probably
    because you've realized that there is indeed a bug which is probably
    not easy to solve, given the limited use it has.

Please, wait for a moment when you have a bit more spare time and check
what the orphans may be used for with this issue => nothing. That's why
I'm trying to discuss a possible fix.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  9:12                     ` Julian Anastasov
@ 2010-09-27 19:24                       ` Willy Tarreau
  2010-09-27 20:00                         ` Eric Dumazet
  2010-09-28  9:01                         ` Julian Anastasov
  0 siblings, 2 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27 19:24 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

Hi Julian

[removed Davem from the CC upon his request, not to pollute him]

On Mon, Sep 27, 2010 at 12:12:24PM +0300, Julian Anastasov wrote:
> 	If it is not already mentioned, the application can
> know if sent data is acked. I think, ioctl SIOCOUTQ is for
> this purpose. May be the application that wants to send
> reliably HTTP error response before closing should do something
> like:
> 
> - add this FD in some list for monitoring instead of keeping
> large connection state
> - use shutdown SHUT_WR to add FIN after response
> - use setsockopt SO_RCVBUF with some low value to close the
> RX window, we do not want the body
> - wait for POLLHUP (FIN), not for POLLIN because we want to
> ignore data, not to read it. Still, data can be read and
> dropped if needed to release the socket memory
> - use timer to limit the time we wait our data to be acked
> - use SIOCOUTQ to know if everything is received in peer and
> then close the fd

Thanks very much for this suggestion. I was looking for something
like this and even looked at the tcp_info struct, but it did not
look very easy to use.

Still, I think that polling on POLLIN and checking with SIOCOUTQ
on every read to see if the out queue is now empty would do the
trick, without forcing to read huge amounts of unnecessary data.

I'll simply enclose that inside a #ifdef LINUX and that should be
OK. It kinda sucks to be able to workaround low level issues at the
application level but at least this workaround is acceptable.

Regards,
Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27  8:02 ` Herbert Xu
@ 2010-09-27 20:00   ` Willy Tarreau
  2010-09-27 20:08     ` Rick Jones
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27 20:00 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

Hi Herbert,

On Mon, Sep 27, 2010 at 04:02:22PM +0800, Herbert Xu wrote:
> Willy Tarreau <w@1wt.eu> wrote:
> >
> > Looking more closely, I noticed that in traces showing the issue,
> > the client was sending an additional CRLF after the data in a
> > separate packet (permitted eventhough not recommended).
> 
> Where is this permitted? RFC2616 says:
> 
> 	Certain buggy HTTP/1.0 client implementations generate
> 	extra CRLF's after a POST request. To restate what is
> 	explicitly forbidden by the BNF, an HTTP/1.1 client MUST
> 	NOT preface or follow a request with an extra CRLF. 

And the paragraph just before says :

   In the interest of robustness, servers SHOULD ignore any empty
   line(s) received where a Request-Line is expected. In other words, if
   the server is reading the protocol stream at the beginning of a
   message and receives a CRLF first, it should ignore the CRLF.

That's the usual principle : be strict with what you send and be liberal
with what you accept. Also, clients are encouraged to pipeline requests
over a connection and may very legally send a new request before the
current response is completely received.

> This workaround for broken HTTP clients definitely does not belong
> in the TCP stack.

I'm not trying to workaround broken HTTP clients using the TCP stack,
that's contrary to my principles. I want to ensure that the orphans code
that should legitimately be used can be used. And if orphans work again
as advertised, then the HTTP issue that revealed the issue automatically
gets fixed.

Regards,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27 19:24                       ` Willy Tarreau
@ 2010-09-27 20:00                         ` Eric Dumazet
  2010-09-28  9:01                         ` Julian Anastasov
  1 sibling, 0 replies; 39+ messages in thread
From: Eric Dumazet @ 2010-09-27 20:00 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Julian Anastasov, netdev

Le lundi 27 septembre 2010 à 21:24 +0200, Willy Tarreau a écrit :
> Hi Julian
> 
> [removed Davem from the CC upon his request, not to pollute him]
> 
> On Mon, Sep 27, 2010 at 12:12:24PM +0300, Julian Anastasov wrote:
> > 	If it is not already mentioned, the application can
> > know if sent data is acked. I think, ioctl SIOCOUTQ is for
> > this purpose. May be the application that wants to send
> > reliably HTTP error response before closing should do something
> > like:
> > 
> > - add this FD in some list for monitoring instead of keeping
> > large connection state
> > - use shutdown SHUT_WR to add FIN after response
> > - use setsockopt SO_RCVBUF with some low value to close the
> > RX window, we do not want the body
> > - wait for POLLHUP (FIN), not for POLLIN because we want to
> > ignore data, not to read it. Still, data can be read and
> > dropped if needed to release the socket memory
> > - use timer to limit the time we wait our data to be acked
> > - use SIOCOUTQ to know if everything is received in peer and
> > then close the fd
> 
> Thanks very much for this suggestion. I was looking for something
> like this and even looked at the tcp_info struct, but it did not
> look very easy to use.
> 
> Still, I think that polling on POLLIN and checking with SIOCOUTQ
> on every read to see if the out queue is now empty would do the
> trick, without forcing to read huge amounts of unnecessary data.
> 
> I'll simply enclose that inside a #ifdef LINUX and that should be
> OK. It kinda sucks to be able to workaround low level issues at the
> application level but at least this workaround is acceptable.

Just a point :

RFC1122 :

	A host MAY implement a "half-duplex" TCP close sequence, so
        that an application that has called CLOSE cannot continue to
        read data from the connection.  If such a host issues a
        CLOSE call while received data is still pending in TCP, or
        if new data is received after CLOSE is called, its TCP
        SHOULD send a RST to show that data was lost.


Maybe only linux respects the RFC ? ;)




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27 20:00   ` Willy Tarreau
@ 2010-09-27 20:08     ` Rick Jones
  2010-09-27 20:20       ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: Rick Jones @ 2010-09-27 20:08 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Herbert Xu, netdev

Willy Tarreau wrote:
> Hi Herbert,
> 
> On Mon, Sep 27, 2010 at 04:02:22PM +0800, Herbert Xu wrote:
> 
>>Willy Tarreau <w@1wt.eu> wrote:
>>
>>>Looking more closely, I noticed that in traces showing the issue,
>>>the client was sending an additional CRLF after the data in a
>>>separate packet (permitted eventhough not recommended).
>>
>>Where is this permitted? RFC2616 says:
>>
>>	Certain buggy HTTP/1.0 client implementations generate
>>	extra CRLF's after a POST request. To restate what is
>>	explicitly forbidden by the BNF, an HTTP/1.1 client MUST
>>	NOT preface or follow a request with an extra CRLF. 
> 
> 
> And the paragraph just before says :
> 
>    In the interest of robustness, servers SHOULD ignore any empty
>    line(s) received where a Request-Line is expected. In other words, if
>    the server is reading the protocol stream at the beginning of a
>    message and receives a CRLF first, it should ignore the CRLF.

It is the HTTP server code being addressed there, not the underlying TCP stack 
is it not?


rick jones

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27 20:08     ` Rick Jones
@ 2010-09-27 20:20       ` Willy Tarreau
  0 siblings, 0 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-27 20:20 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, netdev

On Mon, Sep 27, 2010 at 01:08:13PM -0700, Rick Jones wrote:
> Willy Tarreau wrote:
> >Hi Herbert,
> >
> >On Mon, Sep 27, 2010 at 04:02:22PM +0800, Herbert Xu wrote:
> >
> >>Willy Tarreau <w@1wt.eu> wrote:
> >>
> >>>Looking more closely, I noticed that in traces showing the issue,
> >>>the client was sending an additional CRLF after the data in a
> >>>separate packet (permitted eventhough not recommended).
> >>
> >>Where is this permitted? RFC2616 says:
> >>
> >>	Certain buggy HTTP/1.0 client implementations generate
> >>	extra CRLF's after a POST request. To restate what is
> >>	explicitly forbidden by the BNF, an HTTP/1.1 client MUST
> >>	NOT preface or follow a request with an extra CRLF. 
> >
> >
> >And the paragraph just before says :
> >
> >   In the interest of robustness, servers SHOULD ignore any empty
> >   line(s) received where a Request-Line is expected. In other words, if
> >   the server is reading the protocol stream at the beginning of a
> >   message and receives a CRLF first, it should ignore the CRLF.
> 
> It is the HTTP server code being addressed there, not the underlying TCP 
> stack is it not?

yes, precisely.

regards,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27 19:21                       ` Willy Tarreau
@ 2010-09-27 23:28                         ` Herbert Xu
  2010-09-28  5:12                           ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: Herbert Xu @ 2010-09-27 23:28 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: davem, netdev

Willy Tarreau <w@1wt.eu> wrote:
> 
> Two quick facts :
>  - HTTP allows the client to send whatever it wants whenever it wants
>    and allows the server to close after whatever response it wants.
>    Thus the server cannot predict that the client will talk.

No it does not.  Only buggy HTTP clients do that.  Also, have
you ever observed any buggy HTTP client that sends more than
one CRLF? If not then you only have to deal with the case of
a single extra CRLF.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27 23:28                         ` Herbert Xu
@ 2010-09-28  5:12                           ` Willy Tarreau
  2010-09-28  5:32                             ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2010-09-28  5:12 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, netdev

On Tue, Sep 28, 2010 at 08:28:57AM +0900, Herbert Xu wrote:
> Willy Tarreau <w@1wt.eu> wrote:
> > 
> > Two quick facts :
> >  - HTTP allows the client to send whatever it wants whenever it wants
> >    and allows the server to close after whatever response it wants.
> >    Thus the server cannot predict that the client will talk.
> 
> No it does not.  Only buggy HTTP clients do that.

I don't agree Herbert. Pipelining consists exactly in that, and is an
encouraged practice. If you say that only buggy clients send an extra
CRLF, then yes I agree. But you might very well get a new request you
can't predict at all, and the arrival of this new request must not 
destroy the end of last response's data.

Once again, for now, I will only handle this precise case of the extra
CRLF, but this will not fix the issue with pipelined requests.

Regards,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-28  5:12                           ` Willy Tarreau
@ 2010-09-28  5:32                             ` David Miller
  2010-09-28  5:37                               ` Willy Tarreau
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2010-09-28  5:32 UTC (permalink / raw)
  To: w; +Cc: herbert, netdev

From: Willy Tarreau <w@1wt.eu>
Date: Tue, 28 Sep 2010 07:12:58 +0200

> Once again, for now, I will only handle this precise case of the extra
> CRLF, but this will not fix the issue with pipelined requests.

You might want to read server/connection.c in the apache httpd
sources.

It does exactly what Eric Dumazet and I have suggested to you.

But, I guess the entire world is wrong and you're right Willy.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-28  5:32                             ` David Miller
@ 2010-09-28  5:37                               ` Willy Tarreau
  0 siblings, 0 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-28  5:37 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Mon, Sep 27, 2010 at 10:32:31PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Tue, 28 Sep 2010 07:12:58 +0200
> 
> > Once again, for now, I will only handle this precise case of the extra
> > CRLF, but this will not fix the issue with pipelined requests.
> 
> You might want to read server/connection.c in the apache httpd
> sources.
>
> It does exactly what Eric Dumazet and I have suggested to you.

I've seen that. Though nginx, squid, thttpd, haproxy and lighttpd all
make use of the standard orphans and are all affected by the same issue.

> But, I guess the entire world is wrong and you're right Willy.

Given the figures above, that's not how I analyze it, I'm sorry.

Willy


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-27 19:24                       ` Willy Tarreau
  2010-09-27 20:00                         ` Eric Dumazet
@ 2010-09-28  9:01                         ` Julian Anastasov
  2010-09-28  9:26                           ` Willy Tarreau
  1 sibling, 1 reply; 39+ messages in thread
From: Julian Anastasov @ 2010-09-28  9:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev


 	Hello,

On Mon, 27 Sep 2010, Willy Tarreau wrote:

>> - add this FD in some list for monitoring instead of keeping
>> large connection state
>> - use shutdown SHUT_WR to add FIN after response
>> - use setsockopt SO_RCVBUF with some low value to close the
>> RX window, we do not want the body
>> - wait for POLLHUP (FIN), not for POLLIN because we want to
>> ignore data, not to read it. Still, data can be read and
>> dropped if needed to release the socket memory
>> - use timer to limit the time we wait our data to be acked
>> - use SIOCOUTQ to know if everything is received in peer and
>> then close the fd
>
> Thanks very much for this suggestion. I was looking for something
> like this and even looked at the tcp_info struct, but it did not
> look very easy to use.
>
> Still, I think that polling on POLLIN and checking with SIOCOUTQ
> on every read to see if the out queue is now empty would do the
> trick, without forcing to read huge amounts of unnecessary data.
>
> I'll simply enclose that inside a #ifdef LINUX and that should be
> OK. It kinda sucks to be able to workaround low level issues at the
> application level but at least this workaround is acceptable.

 	I think for another option but I don't know the TCP details
well:

- If SO_RCVBUF=0 really closes RX window

and

- while (read() > 0) {} gets all unread data

- just call close() to convert socket to orphan without a risk
of RST

 	Now when we are orphan socket I'm not sure what has
priority:

- if new DATA is flying to us, is it considered out of window,
do we send RST in FIN_WAIT1/CLOSING/LAST_ACK state in this case?
If we do not send RST then our goal is achieved: send everything
reliably without accepting data that needs RST. And we do not
need to keep fd in user space.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: TCP: orphans broken by RFC 2525 #2.17
  2010-09-28  9:01                         ` Julian Anastasov
@ 2010-09-28  9:26                           ` Willy Tarreau
  0 siblings, 0 replies; 39+ messages in thread
From: Willy Tarreau @ 2010-09-28  9:26 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev

Hello Julian,

On Tue, Sep 28, 2010 at 12:01:42PM +0300, Julian Anastasov wrote:
> 	I think for another option but I don't know the TCP details
> well:
> 
> - If SO_RCVBUF=0 really closes RX window

I've been wondering what happens to an incoming data after the RX window
is closed. Is the packet just dropped or may it still be queued. In fact,
my concerns are : if SO_RCVBUF=0 just advertises a zero window, we can
only be sure about the flush after an RTT. However, if it really prevents
the receive side from accepting new data, then I agree we're safe.

> and
> 
> - while (read() > 0) {} gets all unread data

During my tests on other products, I found that squid does that without
the zero window, possibly resulting in an infinite loop when the sender
is faster than the receiver. In our case, if the zero window works, it
should not be an issue.

> - just call close() to convert socket to orphan without a risk
> of RST
> 
> 	Now when we are orphan socket I'm not sure what has
> priority:
> 
> - if new DATA is flying to us, is it considered out of window,
> do we send RST in FIN_WAIT1/CLOSING/LAST_ACK state in this case?
> If we do not send RST then our goal is achieved: send everything
> reliably without accepting data that needs RST. And we do not
> need to keep fd in user space.

>From what I have observed, the issue is only for data present before
the close. Upon FIN_WAIT1/CLOSING/LAST_ACK, we get the expected reset
once the ACK acknowledges the last data sent, but I've not observed
it for new data.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2010-09-28  9:26 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-26 13:17 TCP: orphans broken by RFC 2525 #2.17 Willy Tarreau
2010-09-26 17:02 ` Eric Dumazet
2010-09-26 17:40   ` Willy Tarreau
2010-09-26 18:35     ` Eric Dumazet
2010-09-26 18:49       ` Willy Tarreau
2010-09-26 21:01         ` Eric Dumazet
2010-09-26 21:46           ` Willy Tarreau
2010-09-26 22:19             ` David Miller
2010-09-26 22:10         ` David Miller
2010-09-26 19:16     ` Willy Tarreau
2010-09-26 22:14       ` David Miller
2010-09-26 22:13 ` David Miller
2010-09-26 22:34   ` Willy Tarreau
2010-09-26 22:38     ` David Miller
2010-09-26 22:54       ` Willy Tarreau
2010-09-26 23:08         ` David Miller
2010-09-26 23:25           ` Willy Tarreau
2010-09-27  1:12             ` David Miller
2010-09-27  5:39               ` Willy Tarreau
2010-09-27  5:48                 ` Eric Dumazet
2010-09-27  6:04                   ` Willy Tarreau
2010-09-27  6:44                   ` David Miller
2010-09-27  6:42                 ` David Miller
2010-09-27  7:34                   ` Willy Tarreau
2010-09-27  7:42                     ` David Miller
2010-09-27 19:21                       ` Willy Tarreau
2010-09-27 23:28                         ` Herbert Xu
2010-09-28  5:12                           ` Willy Tarreau
2010-09-28  5:32                             ` David Miller
2010-09-28  5:37                               ` Willy Tarreau
2010-09-27  9:12                     ` Julian Anastasov
2010-09-27 19:24                       ` Willy Tarreau
2010-09-27 20:00                         ` Eric Dumazet
2010-09-28  9:01                         ` Julian Anastasov
2010-09-28  9:26                           ` Willy Tarreau
2010-09-27  8:02 ` Herbert Xu
2010-09-27 20:00   ` Willy Tarreau
2010-09-27 20:08     ` Rick Jones
2010-09-27 20:20       ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).