From: Alexei Starovoitov <ast@fb.com>
To: Eric Dumazet <edumazet@google.com>,
"David S . Miller" <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>,
Soheil Hassas Yeganeh <soheil@google.com>,
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>,
Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog
Date: Thu, 28 Apr 2016 21:43:15 -0700 [thread overview]
Message-ID: <5722E663.8080304@fb.com> (raw)
In-Reply-To: <1461899449-8096-8-git-send-email-edumazet@google.com>
On 4/28/16 8:10 PM, Eric Dumazet wrote:
> Large sendmsg()/write() hold socket lock for the duration of the call,
> unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
> are parked into socket backlog for a long time.
> Critical decisions like fast retransmit might be delayed.
> Receivers have to maintain a big out of order queue with additional cpu
> overhead, and also possible stalls in TX once windows are full.
>
> Bidirectional flows are particularly hurt since the backlog can become
> quite big if the copy from user space triggers IO (page faults)
>
> Some applications learnt to use sendmsg() (or sendmmsg()) with small
> chunks to avoid this issue.
>
> Kernel should know better, right ?
>
> Add a generic sk_flush_backlog() helper and use it right
> before a new skb is allocated. Typically we put 64KB of payload
> per skb (unless MSG_EOR is requested) and checking socket backlog
> every 64KB gives good results.
>
> As a matter of fact, tests with TSO/GSO disabled give very nice
> results, as we manage to keep a small write queue and smaller
> perceived rtt.
>
> Note that sk_flush_backlog() maintains socket ownership,
> so is not equivalent to a {release_sock(sk); lock_sock(sk);},
> to ensure implicit atomicity rules that sendmsg() was
> giving to (possibly buggy) applications.
>
> In this simple implementation, I chose to not call tcp_release_cb(),
> but we might consider this later.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Alexei Starovoitov <ast@fb.com>
> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> include/net/sock.h | 11 +++++++++++
> net/core/sock.c | 7 +++++++
> net/ipv4/tcp.c | 8 ++++++--
> 3 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 3df778ccaa82..1dbb1f9f7c1b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -926,6 +926,17 @@ void sk_stream_kill_queues(struct sock *sk);
> void sk_set_memalloc(struct sock *sk);
> void sk_clear_memalloc(struct sock *sk);
>
> +void __sk_flush_backlog(struct sock *sk);
> +
> +static inline bool sk_flush_backlog(struct sock *sk)
> +{
> + if (unlikely(READ_ONCE(sk->sk_backlog.tail))) {
> + __sk_flush_backlog(sk);
> + return true;
> + }
> + return false;
> +}
> +
> int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb);
>
> struct request_sock_ops;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 70744dbb6c3f..f615e9391170 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2048,6 +2048,13 @@ static void __release_sock(struct sock *sk)
> sk->sk_backlog.len = 0;
> }
>
> +void __sk_flush_backlog(struct sock *sk)
> +{
> + spin_lock_bh(&sk->sk_lock.slock);
> + __release_sock(sk);
> + spin_unlock_bh(&sk->sk_lock.slock);
> +}
> +
> /**
> * sk_wait_data - wait for data to arrive at sk_receive_queue
> * @sk: sock to wait on
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4787f86ae64c..b945c2b046c5 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1136,11 +1136,12 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> /* This should be in poll */
> sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>
> - mss_now = tcp_send_mss(sk, &size_goal, flags);
> -
> /* Ok commence sending. */
> copied = 0;
>
> +restart:
> + mss_now = tcp_send_mss(sk, &size_goal, flags);
> +
> err = -EPIPE;
> if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
> goto out_err;
> @@ -1166,6 +1167,9 @@ new_segment:
> if (!sk_stream_memory_free(sk))
> goto wait_for_sndbuf;
>
> + if (sk_flush_backlog(sk))
> + goto restart;
I don't understand the logic completely, but isn't it
safer to do 'goto wait_for_memory;' here if we happened
to hit this in the middle of the loop?
Also does it make sense to rename __release_sock to
something like _ _ _sk_flush_backlog, since that's
what it's doing and not doing any 'release' ?
Ack for patches 2 and 6. Great improvement!
next prev parent reply other threads:[~2016-04-29 4:43 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-29 3:10 [PATCH v2 net-next 0/7] net: make TCP preemptible Eric Dumazet
2016-04-29 3:10 ` [PATCH v2 net-next 1/7] tcp: do not assume TCP code is non preemptible Eric Dumazet
2016-04-29 13:18 ` Soheil Hassas Yeganeh
2016-04-29 14:37 ` Eric Dumazet
2016-04-29 14:41 ` Soheil Hassas Yeganeh
2016-04-29 3:10 ` [PATCH v2 net-next 2/7] tcp: do not block bh during prequeue processing Eric Dumazet
2016-04-29 13:20 ` Soheil Hassas Yeganeh
2016-04-29 3:10 ` [PATCH v2 net-next 3/7] dccp: do not assume DCCP code is non preemptible Eric Dumazet
2016-04-29 13:21 ` Soheil Hassas Yeganeh
2016-04-29 3:10 ` [PATCH v2 net-next 4/7] udp: prepare for non BH masking at backlog processing Eric Dumazet
2016-04-29 13:23 ` Soheil Hassas Yeganeh
2016-04-29 3:10 ` [PATCH v2 net-next 5/7] sctp: prepare for socket backlog behavior change Eric Dumazet
2016-04-29 3:10 ` [PATCH v2 net-next 6/7] net: do not block BH while processing socket backlog Eric Dumazet
2016-04-29 13:37 ` Soheil Hassas Yeganeh
2016-04-29 3:10 ` [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of " Eric Dumazet
2016-04-29 4:43 ` Alexei Starovoitov [this message]
2016-04-29 5:05 ` Eric Dumazet
2016-04-29 5:19 ` Alexei Starovoitov
2016-04-29 13:13 ` Soheil Hassas Yeganeh
2016-04-29 20:39 ` [PATCH v2 net-next 0/7] net: make TCP preemptible David Miller
2016-04-29 20:53 ` Eric Dumazet
2016-04-30 9:57 ` Julian Anastasov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5722E663.8080304@fb.com \
--to=ast@fb.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=marcelo.leitner@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=soheil@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).