* [PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data @ 2018-05-18 19:01 Debabrata Banerjee 2018-05-18 19:01 ` [PATCH RFC net-next 1/1] " Debabrata Banerjee 0 siblings, 1 reply; 5+ messages in thread From: Debabrata Banerjee @ 2018-05-18 19:01 UTC (permalink / raw) To: David S . Miller, netdev; +Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, dbanerje There is a basic problem with TCP sockets, where sending and closing of data is unreliable. One good example of this is a web server that wants to send an error back on a HTTP POST and close the socket, however assuming the POST was of any significant size what really happens is that the browser gets a broken socket while it is trying to post, and never reads the error, possible retrying the whole POST a number of times. This has been well documented by other people, for example this blog post: https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable Without this patch, our server application has to hang on to a socket sink all of the POST data, eating up memory and cpu. With this patch the task is offloaded to the kernel, which uses only a timewait socket to efficiently ack and discard any incoming data. We've been using a similar patch internally for years, I think it has applications for everyone. Debabrata Banerjee (1): tcp: close socket without reset on incoming data include/linux/tcp.h | 4 +++- include/uapi/linux/tcp.h | 2 +- net/ipv4/tcp.c | 23 +++++++++++++++++++++-- net/ipv4/tcp_input.c | 16 ++++++++++++---- net/ipv4/tcp_minisocks.c | 15 +++++++++++++++ 5 files changed, 52 insertions(+), 8 deletions(-) -- 2.17.0 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data 2018-05-18 19:01 [PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data Debabrata Banerjee @ 2018-05-18 19:01 ` Debabrata Banerjee 2018-05-19 21:00 ` David Miller 0 siblings, 1 reply; 5+ messages in thread From: Debabrata Banerjee @ 2018-05-18 19:01 UTC (permalink / raw) To: David S . Miller, netdev; +Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, dbanerje When TCP_CLOSE_NORST is set before a close(), offload sinking of unwanted data to the kernel with low resource usage, with a timeout of TCP_LINGER2. The socket will transition to FIN_WAIT1 and then FIN_WAIT2 where it will ack data until either the timeout is hit, or a RST or FIN is received. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> --- include/linux/tcp.h | 4 +++- include/uapi/linux/tcp.h | 2 +- net/ipv4/tcp.c | 23 +++++++++++++++++++++-- net/ipv4/tcp_input.c | 16 ++++++++++++---- net/ipv4/tcp_minisocks.c | 15 +++++++++++++++ 5 files changed, 52 insertions(+), 8 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 72705eaf4b84..bd44bc99b480 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -226,7 +226,8 @@ struct tcp_sock { fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */ fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */ is_sack_reneg:1, /* in recovery from loss with SACK reneg? */ - unused:2; + norst:1, /* Don't send RST on shutdown() socket */ + unused:1; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto : 1,/* Use linear timeouts for thin streams */ recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */ @@ -429,6 +430,7 @@ struct tcp_timewait_sock { #ifdef CONFIG_TCP_MD5SIG struct tcp_md5sig_key *tw_md5_key; #endif + int tw_norst; }; static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 29eb659aa77a..369f3402b669 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -124,8 +124,8 @@ enum { #define TCP_FASTOPEN_NO_COOKIE 34 /* Enable TFO without a TFO cookie */ #define TCP_ZEROCOPY_RECEIVE 35 #define TCP_INQ 36 /* Notify bytes available to read as a cmsg on read */ - #define TCP_CM_INQ TCP_INQ +#define TCP_CLOSE_NORST 37 /* Don't send RST on close()'d socket */ struct tcp_repair_opt { __u32 opt_code; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0a2ea0bbf867..29fe763002e5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2318,8 +2318,10 @@ void tcp_close(struct sock *sk, long timeout) struct sk_buff *skb; int data_was_unread = 0; int state; + struct tcp_sock *tp; lock_sock(sk); + tp = tcp_sk(sk); sk->sk_shutdown = SHUTDOWN_MASK; if (sk->sk_state == TCP_LISTEN) { @@ -2362,8 +2364,19 @@ void tcp_close(struct sock *sk, long timeout) } else if (data_was_unread) { /* Unread data was tossed, zap the connection. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE); - tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, sk->sk_allocation); + + if (unlikely(tp->norst)) { + if (tcp_close_state(sk)) { + /* We will discard all new incoming data + * set window to max of current or init. + */ + tp->rcv_wnd = max(tp->rcv_wnd, MAX_TCP_WINDOW); + tcp_send_fin(sk); + } + } else { + tcp_set_state(sk, TCP_CLOSE); + tcp_send_active_reset(sk, sk->sk_allocation); + } } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { /* Check zero linger _after_ checking for unread data. */ sk->sk_prot->disconnect(sk, 0); @@ -3040,6 +3053,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level, else tp->recvmsg_inq = val; break; + case TCP_CLOSE_NORST: + tp->norst = !!val; + break; default: err = -ENOPROTOOPT; break; @@ -3523,6 +3539,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level, return err; } #endif + case TCP_CLOSE_NORST: + val = tp->norst; + break; default: return -ENOPROTOOPT; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index aebb29ab2fdf..e0aa6e126700 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6054,7 +6054,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) break; } - if (tp->linger2 < 0) { + if (likely(!tp->norst) && tp->linger2 < 0) { tcp_done(sk); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA); return 1; @@ -6064,9 +6064,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) /* Receive out of order FIN after close() */ if (tp->syn_fastopen && th->fin) tcp_fastopen_active_disable(sk); - tcp_done(sk); - NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA); - return 1; + + if (likely(!tp->norst)) { + tcp_done(sk); + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA); + return 1; + } } tmo = tcp_fin_time(sk); @@ -6123,6 +6126,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) if (sk->sk_shutdown & RCV_SHUTDOWN) { if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq && after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) { + if (unlikely(tp->norst)) { + tcp_send_ack(sk); + goto discard; + } + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONDATA); tcp_reset(sk); return 1; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index f867658b4b30..48a9d5351478 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -133,6 +133,20 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, return TCP_TW_SUCCESS; } + if (tcptw->tw_norst) { + /* ack and discard new data */ + tcptw->tw_rcv_nxt = TCP_SKB_CB(skb)->end_seq; + if (tmp_opt.saw_tstamp) { + tcptw->tw_ts_recent_stamp = get_seconds(); + tcptw->tw_ts_recent = tmp_opt.rcv_tsval; + } + + if (th->fin) /* active remote close, we can die now */ + inet_twsk_deschedule_put(tw); + + return TCP_TW_ACK; + } + /* New data or FIN. If new data arrive after half-duplex close, * reset. */ @@ -272,6 +286,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp; tcptw->tw_ts_offset = tp->tsoffset; tcptw->tw_last_oow_ack_time = 0; + tcptw->tw_norst = tp->norst; #if IS_ENABLED(CONFIG_IPV6) if (tw->tw_family == PF_INET6) { -- 2.17.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data 2018-05-18 19:01 ` [PATCH RFC net-next 1/1] " Debabrata Banerjee @ 2018-05-19 21:00 ` David Miller 2018-05-19 22:57 ` Banerjee, Debabrata 0 siblings, 1 reply; 5+ messages in thread From: David Miller @ 2018-05-19 21:00 UTC (permalink / raw) To: dbanerje; +Cc: netdev, kuznet, yoshfuji From: Debabrata Banerjee <dbanerje@akamai.com> Date: Fri, 18 May 2018 15:01:41 -0400 > When TCP_CLOSE_NORST is set before a close(), offload sinking of > unwanted data to the kernel with low resource usage, with a timeout of > TCP_LINGER2. The socket will transition to FIN_WAIT1 and then FIN_WAIT2 > where it will ack data until either the timeout is hit, or a RST or FIN > is received. > > Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> This is a very serious protocol violation. You're telling the remote end that you received the data even though the socket was closed and nothing actually "sunk" the bytes. This doesn't even go into the issues of sending cumulative ACKs in response to data which is arbitrarily out-of-order. The whole problem is that the post data is sent before the client looks to see if the server is willing to accept the post data or not. A: I'd like to send you 200MB of crap [ 200MB of craaaa... B: Sorry I won't be accepting that, please don't send it. CLOSE, send reset since some of crap is queued up and was never read A: aaaaapp... received RESET A: Why didn't B accept my 200MB of crap? Sorry, you'll need to deal with this issue in another way. Thanks. ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data 2018-05-19 21:00 ` David Miller @ 2018-05-19 22:57 ` Banerjee, Debabrata 2018-05-20 2:52 ` David Miller 0 siblings, 1 reply; 5+ messages in thread From: Banerjee, Debabrata @ 2018-05-19 22:57 UTC (permalink / raw) To: 'David Miller' Cc: netdev@vger.kernel.org, kuznet@ms2.inr.ac.ru, yoshfuji@linux-ipv6.org > From: David Miller [mailto:davem@davemloft.net] > From: Debabrata Banerjee <dbanerje@akamai.com> > > > When TCP_CLOSE_NORST is set before a close(), offload sinking of > > unwanted data to the kernel with low resource usage, with a timeout of > > TCP_LINGER2. The socket will transition to FIN_WAIT1 and then > > FIN_WAIT2 where it will ack data until either the timeout is hit, or a > > RST or FIN is received. > > > > This is a very serious protocol violation. Actually I don't believe it violates the protocol. rfc1122 4.2.2.13 reads: "A host MAY implement a "half-duplex" TCP close sequence, so that an application that has called CLOSE cannot continue to read data from the connection. If such a host issues a CLOSE call while received data is still pending in TCP, or if new data is received after CLOSE is called, its TCP SHOULD send a RST to show that data was lost." Keyword SHOULD according to rfc2119 means recommended but optional, versus MUST. > You're telling the remote end that you received the data even though the > socket was closed and nothing actually "sunk" the bytes. It does the same thing the application would do, but with much less overhead. The application called close() because it no longer cares about new data, but it still expected send() prior to close() to actually send. > This doesn't even go into the issues of sending cumulative ACKs in response > to data which is arbitrarily out-of-order. Not sure what issues we'd run into here, out of order and duplicate acks can happen normally. > The whole problem is that the post data is sent before the client looks to see > if the server is willing to accept the post data or not. > > A: I'd like to send you 200MB of crap > [ 200MB of craaaa... > B: Sorry I won't be accepting that, please don't send it. > > CLOSE, send reset since some of crap is queued up and > was never read > > A: aaaaapp... received RESET > A: Why didn't B accept my 200MB of crap? That's correct. But it's not just limited to POST. Could be any data transfer over TCP sockets. Of course in the 200MB POST case, that's lots of resources and copying to userspace. > Sorry, you'll need to deal with this issue in another way. Well if the intersection with the definition of the close() spooks you something similar could be implemented as a setsockopt(TCP_SINK_DATA) around shutdown(), to instruct the socket to immediately dump data, but with higher resource usage. However as above, I don't currently believe this patch violates the protocol. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC net-next 1/1] tcp: close socket without reset on incoming data 2018-05-19 22:57 ` Banerjee, Debabrata @ 2018-05-20 2:52 ` David Miller 0 siblings, 0 replies; 5+ messages in thread From: David Miller @ 2018-05-20 2:52 UTC (permalink / raw) To: dbanerje; +Cc: netdev, kuznet, yoshfuji From: "Banerjee, Debabrata" <dbanerje@akamai.com> Date: Sat, 19 May 2018 22:57:48 +0000 > It does the same thing the application would do, but with much less > overhead. The application called close() because it no longer cares > about new data, but it still expected send() prior to close() to > actually send. It's not the same. If you just sink the data in the protocol stack, the sender has no way whatsoever to know that the application did not see the data. The sender must have a way to know that the application at the other end received the data, whether they used it or not. And what breaks the ambiguity is that reset. This is critcially important. > Well if the intersection with the definition of the close() spooks > you something similar could be implemented as a > setsockopt(TCP_SINK_DATA) around shutdown(), to instruct the socket > to immediately dump data, but with higher resource usage. However as > above, I don't currently believe this patch violates the protocol. The SHOULD you quoted in RFC 1122 is explicitly listed in a another RFC as an explicitly recommended behavior. I know, because that text is what led me to implement the current behavior. Please see RFC 2525, section 2.16: Failure to send a RST after Half Duplex Close. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-05-20 2:52 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-05-18 19:01 [PATCH RFC net-next 0/1] tcp: close socket without reset on incoming data Debabrata Banerjee 2018-05-18 19:01 ` [PATCH RFC net-next 1/1] " Debabrata Banerjee 2018-05-19 21:00 ` David Miller 2018-05-19 22:57 ` Banerjee, Debabrata 2018-05-20 2:52 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).