[PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data
@ 2018-05-18 19:01 Debabrata Banerjee
  2018-05-18 19:01 ` [PATCH RFC net-next 1/1] " Debabrata Banerjee
  0 siblings, 1 reply; 5+ messages in thread
From: Debabrata Banerjee @ 2018-05-18 19:01 UTC (permalink / raw)
  To: David S . Miller, netdev; +Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, dbanerje

There is a basic problem with TCP sockets, where sending and closing of
data is unreliable. One good example of this is a web server that wants
to send an error back on a HTTP POST and close the socket, however
assuming the POST was of any significant size what really happens is
that the browser gets a broken socket while it is trying to post, and
never reads the error, possible retrying the whole POST a number of
times. This has been well documented by other people, for example this
blog post:

https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

Without this patch, our server application has to hang on to a socket
sink all of the POST data, eating up memory and cpu. With this patch
the task is offloaded to the kernel, which uses only a timewait socket
to efficiently ack and discard any incoming data. We've been using a
similar patch internally for years, I think it has applications for
everyone.

Debabrata Banerjee (1):
  tcp: close socket without reset on incoming data

 include/linux/tcp.h      |  4 +++-
 include/uapi/linux/tcp.h |  2 +-
 net/ipv4/tcp.c           | 23 +++++++++++++++++++++--
 net/ipv4/tcp_input.c     | 16 ++++++++++++----
 net/ipv4/tcp_minisocks.c | 15 +++++++++++++++
 5 files changed, 52 insertions(+), 8 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data
  2018-05-18 19:01 [PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data Debabrata Banerjee
@ 2018-05-18 19:01 ` Debabrata Banerjee
  2018-05-19 21:00   ` David Miller
  0 siblings, 1 reply; 5+ messages in thread
From: Debabrata Banerjee @ 2018-05-18 19:01 UTC (permalink / raw)
  To: David S . Miller, netdev; +Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, dbanerje

When TCP_CLOSE_NORST is set before a close(), offload sinking of
unwanted data to the kernel with low resource usage, with a timeout of
TCP_LINGER2. The socket will transition to FIN_WAIT1 and then FIN_WAIT2
where it will ack data until either the timeout is hit, or a RST or FIN
is received.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 include/linux/tcp.h      |  4 +++-
 include/uapi/linux/tcp.h |  2 +-
 net/ipv4/tcp.c           | 23 +++++++++++++++++++++--
 net/ipv4/tcp_input.c     | 16 ++++++++++++----
 net/ipv4/tcp_minisocks.c | 15 +++++++++++++++
 5 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 72705eaf4b84..bd44bc99b480 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -226,7 +226,8 @@ struct tcp_sock {
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		unused:2;
+		norst:1,	/* Don't send RST on shutdown() socket */
+		unused:1;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
@@ -429,6 +430,7 @@ struct tcp_timewait_sock {
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key	  *tw_md5_key;
 #endif
+	int			  tw_norst;
 };
 
 static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 29eb659aa77a..369f3402b669 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -124,8 +124,8 @@ enum {
 #define TCP_FASTOPEN_NO_COOKIE	34	/* Enable TFO without a TFO cookie */
 #define TCP_ZEROCOPY_RECEIVE	35
 #define TCP_INQ			36	/* Notify bytes available to read as a cmsg on read */
-
 #define TCP_CM_INQ		TCP_INQ
+#define TCP_CLOSE_NORST		37	/* Don't send RST on close()'d socket */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0a2ea0bbf867..29fe763002e5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2318,8 +2318,10 @@ void tcp_close(struct sock *sk, long timeout)
 	struct sk_buff *skb;
 	int data_was_unread = 0;
 	int state;
+	struct tcp_sock *tp;
 
 	lock_sock(sk);
+	tp = tcp_sk(sk);
 	sk->sk_shutdown = SHUTDOWN_MASK;
 
 	if (sk->sk_state == TCP_LISTEN) {
@@ -2362,8 +2364,19 @@ void tcp_close(struct sock *sk, long timeout)
 	} else if (data_was_unread) {
 		/* Unread data was tossed, zap the connection. */
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
-		tcp_set_state(sk, TCP_CLOSE);
-		tcp_send_active_reset(sk, sk->sk_allocation);
+
+		if (unlikely(tp->norst)) {
+			if (tcp_close_state(sk)) {
+				/* We will discard all new incoming data
+				 * set window to max of current or init.
+				 */
+				tp->rcv_wnd = max(tp->rcv_wnd, MAX_TCP_WINDOW);
+				tcp_send_fin(sk);
+			}
+		} else {
+			tcp_set_state(sk, TCP_CLOSE);
+			tcp_send_active_reset(sk, sk->sk_allocation);
+		}
 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
 		/* Check zero linger _after_ checking for unread data. */
 		sk->sk_prot->disconnect(sk, 0);
@@ -3040,6 +3053,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		else
 			tp->recvmsg_inq = val;
 		break;
+	case TCP_CLOSE_NORST:
+		tp->norst = !!val;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -3523,6 +3539,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		return err;
 	}
 #endif
+	case TCP_CLOSE_NORST:
+		val = tp->norst;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index aebb29ab2fdf..e0aa6e126700 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6054,7 +6054,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 			break;
 		}
 
-		if (tp->linger2 < 0) {
+		if (likely(!tp->norst) && tp->linger2 < 0) {
 			tcp_done(sk);
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
 			return 1;
@@ -6064,9 +6064,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 			/* Receive out of order FIN after close() */
 			if (tp->syn_fastopen && th->fin)
 				tcp_fastopen_active_disable(sk);
-			tcp_done(sk);
-			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
-			return 1;
+
+			if (likely(!tp->norst)) {
+				tcp_done(sk);
+				NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+				return 1;
+			}
 		}
 
 		tmo = tcp_fin_time(sk);
@@ -6123,6 +6126,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		if (sk->sk_shutdown & RCV_SHUTDOWN) {
 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
 			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
+				if (unlikely(tp->norst)) {
+					tcp_send_ack(sk);
+					goto discard;
+				}
+
 				NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
 				tcp_reset(sk);
 				return 1;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f867658b4b30..48a9d5351478 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -133,6 +133,20 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 			return TCP_TW_SUCCESS;
 		}
 
+		if (tcptw->tw_norst) {
+			/* ack and discard new data */
+			tcptw->tw_rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+			if (tmp_opt.saw_tstamp) {
+				tcptw->tw_ts_recent_stamp = get_seconds();
+				tcptw->tw_ts_recent	  = tmp_opt.rcv_tsval;
+			}
+
+			if (th->fin) /* active remote close, we can die now */
+				inet_twsk_deschedule_put(tw);
+
+			return TCP_TW_ACK;
+		}
+
 		/* New data or FIN. If new data arrive after half-duplex close,
 		 * reset.
 		 */
@@ -272,6 +286,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
 		tcptw->tw_ts_offset	= tp->tsoffset;
 		tcptw->tw_last_oow_ack_time = 0;
+		tcptw->tw_norst		= tp->norst;
 
 #if IS_ENABLED(CONFIG_IPV6)
 		if (tw->tw_family == PF_INET6) {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data
  2018-05-18 19:01 ` [PATCH RFC net-next 1/1] " Debabrata Banerjee
@ 2018-05-19 21:00   ` David Miller
  2018-05-19 22:57     ` Banerjee, Debabrata
  0 siblings, 1 reply; 5+ messages in thread
From: David Miller @ 2018-05-19 21:00 UTC (permalink / raw)
  To: dbanerje; +Cc: netdev, kuznet, yoshfuji

From: Debabrata Banerjee <dbanerje@akamai.com>
Date: Fri, 18 May 2018 15:01:41 -0400

> When TCP_CLOSE_NORST is set before a close(), offload sinking of
> unwanted data to the kernel with low resource usage, with a timeout of
> TCP_LINGER2. The socket will transition to FIN_WAIT1 and then FIN_WAIT2
> where it will ack data until either the timeout is hit, or a RST or FIN
> is received.
> 
> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

This is a very serious protocol violation.

You're telling the remote end that you received the data even though
the socket was closed and nothing actually "sunk" the bytes.

This doesn't even go into the issues of sending cumulative ACKs in
response to data which is arbitrarily out-of-order.

The whole problem is that the post data is sent before the client
looks to see if the server is willing to accept the post data or not.

A: I'd like to send you 200MB of crap
   [ 200MB of craaaa...
B: Sorry I won't be accepting that, please don't send it.

   CLOSE, send reset since some of crap is queued up and
   was never read

A: aaaaapp... received RESET
A: Why didn't B accept my 200MB of crap?

Sorry, you'll need to deal with this issue in another way.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data
  2018-05-19 21:00   ` David Miller
@ 2018-05-19 22:57     ` Banerjee, Debabrata
  2018-05-20  2:52       ` David Miller
  0 siblings, 1 reply; 5+ messages in thread
From: Banerjee, Debabrata @ 2018-05-19 22:57 UTC (permalink / raw)
  To: 'David Miller'
  Cc: netdev@vger.kernel.org, kuznet@ms2.inr.ac.ru,
	yoshfuji@linux-ipv6.org

> From: David Miller [mailto:davem@davemloft.net]
> From: Debabrata Banerjee <dbanerje@akamai.com>
> 
> > When TCP_CLOSE_NORST is set before a close(), offload sinking of
> > unwanted data to the kernel with low resource usage, with a timeout of
> > TCP_LINGER2. The socket will transition to FIN_WAIT1 and then
> > FIN_WAIT2 where it will ack data until either the timeout is hit, or a
> > RST or FIN is received.
> >
> 
> This is a very serious protocol violation.

Actually I don't believe it violates the protocol. rfc1122 4.2.2.13 reads:

"A host MAY implement a "half-duplex" TCP close sequence, so
            that an application that has called CLOSE cannot continue to
            read data from the connection.  If such a host issues a
            CLOSE call while received data is still pending in TCP, or
            if new data is received after CLOSE is called, its TCP
            SHOULD send a RST to show that data was lost."

Keyword SHOULD according to rfc2119 means recommended but optional, versus MUST.

> You're telling the remote end that you received the data even though the
> socket was closed and nothing actually "sunk" the bytes.

It does the same thing the application would do, but with much less overhead. The application called close() because it no longer cares about new data, but it still expected send() prior to close() to actually  send.

> This doesn't even go into the issues of sending cumulative ACKs in response
> to data which is arbitrarily out-of-order.

Not sure what issues we'd run into here, out of order and duplicate acks can happen normally.

> The whole problem is that the post data is sent before the client looks to see
> if the server is willing to accept the post data or not.
> 
> A: I'd like to send you 200MB of crap
>    [ 200MB of craaaa...
> B: Sorry I won't be accepting that, please don't send it.
> 
>    CLOSE, send reset since some of crap is queued up and
>    was never read
> 
> A: aaaaapp... received RESET
> A: Why didn't B accept my 200MB of crap?

That's correct. But it's not just limited to POST. Could be any data transfer over TCP sockets. Of course in the 200MB POST case, that's lots of resources and copying to userspace.

> Sorry, you'll need to deal with this issue in another way.

Well if the intersection with the definition of the close() spooks you something similar could be implemented as a setsockopt(TCP_SINK_DATA) around shutdown(), to instruct the socket to immediately dump data, but with higher resource usage. However as above, I don't currently believe this patch violates the protocol.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data
  2018-05-19 22:57     ` Banerjee, Debabrata
@ 2018-05-20  2:52       ` David Miller
  0 siblings, 0 replies; 5+ messages in thread
From: David Miller @ 2018-05-20  2:52 UTC (permalink / raw)
  To: dbanerje; +Cc: netdev, kuznet, yoshfuji

From: "Banerjee, Debabrata" <dbanerje@akamai.com>
Date: Sat, 19 May 2018 22:57:48 +0000

> It does the same thing the application would do, but with much less
> overhead. The application called close() because it no longer cares
> about new data, but it still expected send() prior to close() to
> actually send.

It's not the same.

If you just sink the data in the protocol stack, the sender has no
way whatsoever to know that the application did not see the data.

The sender must have a way to know that the application at the other
end received the data, whether they used it or not.  And what breaks
the ambiguity is that reset.

This is critcially important.

> Well if the intersection with the definition of the close() spooks
> you something similar could be implemented as a
> setsockopt(TCP_SINK_DATA) around shutdown(), to instruct the socket
> to immediately dump data, but with higher resource usage. However as
> above, I don't currently believe this patch violates the protocol.

The SHOULD you quoted in RFC 1122 is explicitly listed in a another
RFC as an explicitly recommended behavior.  I know, because that text
is what led me to implement the current behavior.

Please see RFC 2525, section 2.16: Failure to send a RST after Half
Duplex Close.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-05-20  2:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-18 19:01 [PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data Debabrata Banerjee
2018-05-18 19:01 ` [PATCH RFC net-next 1/1] " Debabrata Banerjee
2018-05-19 21:00   ` David Miller
2018-05-19 22:57     ` Banerjee, Debabrata
2018-05-20  2:52       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).