netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <ast@fb.com>
To: Eric Dumazet <edumazet@google.com>,
	"David S . Miller" <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of socket backlog
Date: Thu, 28 Apr 2016 21:43:15 -0700	[thread overview]
Message-ID: <5722E663.8080304@fb.com> (raw)
In-Reply-To: <1461899449-8096-8-git-send-email-edumazet@google.com>

On 4/28/16 8:10 PM, Eric Dumazet wrote:
> Large sendmsg()/write() hold socket lock for the duration of the call,
> unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
> are parked into socket backlog for a long time.
> Critical decisions like fast retransmit might be delayed.
> Receivers have to maintain a big out of order queue with additional cpu
> overhead, and also possible stalls in TX once windows are full.
>
> Bidirectional flows are particularly hurt since the backlog can become
> quite big if the copy from user space triggers IO (page faults)
>
> Some applications learnt to use sendmsg() (or sendmmsg()) with small
> chunks to avoid this issue.
>
> Kernel should know better, right ?
>
> Add a generic sk_flush_backlog() helper and use it right
> before a new skb is allocated. Typically we put 64KB of payload
> per skb (unless MSG_EOR is requested) and checking socket backlog
> every 64KB gives good results.
>
> As a matter of fact, tests with TSO/GSO disabled give very nice
> results, as we manage to keep a small write queue and smaller
> perceived rtt.
>
> Note that sk_flush_backlog() maintains socket ownership,
> so is not equivalent to a {release_sock(sk); lock_sock(sk);},
> to ensure implicit atomicity rules that sendmsg() was
> giving to (possibly buggy) applications.
>
> In this simple implementation, I chose to not call tcp_release_cb(),
> but we might consider this later.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Alexei Starovoitov <ast@fb.com>
> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>   include/net/sock.h | 11 +++++++++++
>   net/core/sock.c    |  7 +++++++
>   net/ipv4/tcp.c     |  8 ++++++--
>   3 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 3df778ccaa82..1dbb1f9f7c1b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -926,6 +926,17 @@ void sk_stream_kill_queues(struct sock *sk);
>   void sk_set_memalloc(struct sock *sk);
>   void sk_clear_memalloc(struct sock *sk);
>
> +void __sk_flush_backlog(struct sock *sk);
> +
> +static inline bool sk_flush_backlog(struct sock *sk)
> +{
> +	if (unlikely(READ_ONCE(sk->sk_backlog.tail))) {
> +		__sk_flush_backlog(sk);
> +		return true;
> +	}
> +	return false;
> +}
> +
>   int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb);
>
>   struct request_sock_ops;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 70744dbb6c3f..f615e9391170 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2048,6 +2048,13 @@ static void __release_sock(struct sock *sk)
>   	sk->sk_backlog.len = 0;
>   }
>
> +void __sk_flush_backlog(struct sock *sk)
> +{
> +	spin_lock_bh(&sk->sk_lock.slock);
> +	__release_sock(sk);
> +	spin_unlock_bh(&sk->sk_lock.slock);
> +}
> +
>   /**
>    * sk_wait_data - wait for data to arrive at sk_receive_queue
>    * @sk:    sock to wait on
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4787f86ae64c..b945c2b046c5 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1136,11 +1136,12 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>   	/* This should be in poll */
>   	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
>
> -	mss_now = tcp_send_mss(sk, &size_goal, flags);
> -
>   	/* Ok commence sending. */
>   	copied = 0;
>
> +restart:
> +	mss_now = tcp_send_mss(sk, &size_goal, flags);
> +
>   	err = -EPIPE;
>   	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
>   		goto out_err;
> @@ -1166,6 +1167,9 @@ new_segment:
>   			if (!sk_stream_memory_free(sk))
>   				goto wait_for_sndbuf;
>
> +			if (sk_flush_backlog(sk))
> +				goto restart;

I don't understand the logic completely, but isn't it
safer to do 'goto wait_for_memory;' here if we happened
to hit this in the middle of the loop?
Also does it make sense to rename __release_sock to
something like _ _ _sk_flush_backlog, since that's
what it's doing and not doing any 'release' ?

Ack for patches 2 and 6. Great improvement!

  reply	other threads:[~2016-04-29  4:43 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-29  3:10 [PATCH v2 net-next 0/7] net: make TCP preemptible Eric Dumazet
2016-04-29  3:10 ` [PATCH v2 net-next 1/7] tcp: do not assume TCP code is non preemptible Eric Dumazet
2016-04-29 13:18   ` Soheil Hassas Yeganeh
2016-04-29 14:37     ` Eric Dumazet
2016-04-29 14:41       ` Soheil Hassas Yeganeh
2016-04-29  3:10 ` [PATCH v2 net-next 2/7] tcp: do not block bh during prequeue processing Eric Dumazet
2016-04-29 13:20   ` Soheil Hassas Yeganeh
2016-04-29  3:10 ` [PATCH v2 net-next 3/7] dccp: do not assume DCCP code is non preemptible Eric Dumazet
2016-04-29 13:21   ` Soheil Hassas Yeganeh
2016-04-29  3:10 ` [PATCH v2 net-next 4/7] udp: prepare for non BH masking at backlog processing Eric Dumazet
2016-04-29 13:23   ` Soheil Hassas Yeganeh
2016-04-29  3:10 ` [PATCH v2 net-next 5/7] sctp: prepare for socket backlog behavior change Eric Dumazet
2016-04-29  3:10 ` [PATCH v2 net-next 6/7] net: do not block BH while processing socket backlog Eric Dumazet
2016-04-29 13:37   ` Soheil Hassas Yeganeh
2016-04-29  3:10 ` [PATCH v2 net-next 7/7] tcp: make tcp_sendmsg() aware of " Eric Dumazet
2016-04-29  4:43   ` Alexei Starovoitov [this message]
2016-04-29  5:05     ` Eric Dumazet
2016-04-29  5:19       ` Alexei Starovoitov
2016-04-29 13:13   ` Soheil Hassas Yeganeh
2016-04-29 20:39 ` [PATCH v2 net-next 0/7] net: make TCP preemptible David Miller
2016-04-29 20:53   ` Eric Dumazet
2016-04-30  9:57     ` Julian Anastasov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5722E663.8080304@fb.com \
    --to=ast@fb.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=marcelo.leitner@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=soheil@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).