From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexei Starovoitov Subject: Re: [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog Date: Thu, 28 Apr 2016 21:43:15 -0700 Message-ID: <5722E663.8080304@fb.com> References: <1461899449-8096-1-git-send-email-edumazet@google.com> <1461899449-8096-8-git-send-email-edumazet@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Cc: netdev , Soheil Hassas Yeganeh , Marcelo Ricardo Leitner , Eric Dumazet To: Eric Dumazet , "David S . Miller" Return-path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:30297 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750940AbcD2EnU (ORCPT ); Fri, 29 Apr 2016 00:43:20 -0400 In-Reply-To: <1461899449-8096-8-git-send-email-edumazet@google.com> Sender: netdev-owner@vger.kernel.org List-ID: On 4/28/16 8:10 PM, Eric Dumazet wrote: > Large sendmsg()/write() hold socket lock for the duration of the call, > unless sk->sk_sndbuf limit is hit. This is bad because incoming packets > are parked into socket backlog for a long time. > Critical decisions like fast retransmit might be delayed. > Receivers have to maintain a big out of order queue with additional cpu > overhead, and also possible stalls in TX once windows are full. > > Bidirectional flows are particularly hurt since the backlog can become > quite big if the copy from user space triggers IO (page faults) > > Some applications learnt to use sendmsg() (or sendmmsg()) with small > chunks to avoid this issue. > > Kernel should know better, right ? > > Add a generic sk_flush_backlog() helper and use it right > before a new skb is allocated. Typically we put 64KB of payload > per skb (unless MSG_EOR is requested) and checking socket backlog > every 64KB gives good results. > > As a matter of fact, tests with TSO/GSO disabled give very nice > results, as we manage to keep a small write queue and smaller > perceived rtt. > > Note that sk_flush_backlog() maintains socket ownership, > so is not equivalent to a {release_sock(sk); lock_sock(sk);}, > to ensure implicit atomicity rules that sendmsg() was > giving to (possibly buggy) applications. > > In this simple implementation, I chose to not call tcp_release_cb(), > but we might consider this later. > > Signed-off-by: Eric Dumazet > Cc: Soheil Hassas Yeganeh > Cc: Alexei Starovoitov > Cc: Marcelo Ricardo Leitner > --- > include/net/sock.h | 11 +++++++++++ > net/core/sock.c | 7 +++++++ > net/ipv4/tcp.c | 8 ++++++-- > 3 files changed, 24 insertions(+), 2 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 3df778ccaa82..1dbb1f9f7c1b 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -926,6 +926,17 @@ void sk_stream_kill_queues(struct sock *sk); > void sk_set_memalloc(struct sock *sk); > void sk_clear_memalloc(struct sock *sk); > > +void __sk_flush_backlog(struct sock *sk); > + > +static inline bool sk_flush_backlog(struct sock *sk) > +{ > + if (unlikely(READ_ONCE(sk->sk_backlog.tail))) { > + __sk_flush_backlog(sk); > + return true; > + } > + return false; > +} > + > int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb); > > struct request_sock_ops; > diff --git a/net/core/sock.c b/net/core/sock.c > index 70744dbb6c3f..f615e9391170 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -2048,6 +2048,13 @@ static void __release_sock(struct sock *sk) > sk->sk_backlog.len = 0; > } > > +void __sk_flush_backlog(struct sock *sk) > +{ > + spin_lock_bh(&sk->sk_lock.slock); > + __release_sock(sk); > + spin_unlock_bh(&sk->sk_lock.slock); > +} > + > /** > * sk_wait_data - wait for data to arrive at sk_receive_queue > * @sk: sock to wait on > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 4787f86ae64c..b945c2b046c5 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1136,11 +1136,12 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) > /* This should be in poll */ > sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); > > - mss_now = tcp_send_mss(sk, &size_goal, flags); > - > /* Ok commence sending. */ > copied = 0; > > +restart: > + mss_now = tcp_send_mss(sk, &size_goal, flags); > + > err = -EPIPE; > if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) > goto out_err; > @@ -1166,6 +1167,9 @@ new_segment: > if (!sk_stream_memory_free(sk)) > goto wait_for_sndbuf; > > + if (sk_flush_backlog(sk)) > + goto restart; I don't understand the logic completely, but isn't it safer to do 'goto wait_for_memory;' here if we happened to hit this in the middle of the loop? Also does it make sense to rename __release_sock to something like _ _ _sk_flush_backlog, since that's what it's doing and not doing any 'release' ? Ack for patches 2 and 6. Great improvement!