From: Eric Dumazet <eric.dumazet@gmail.com>
To: Kieran Mansley <kmansley@solarflare.com>
Cc: netdev@vger.kernel.org
Subject: Re: TCPBacklogDrops during aggressive bursts of traffic
Date: Tue, 15 May 2012 16:56:16 +0200 [thread overview]
Message-ID: <1337093776.8512.1089.camel@edumazet-glaptop> (raw)
In-Reply-To: <1337092718.1689.45.camel@kjm-desktop.uk.level5networks.com>
On Tue, 2012-05-15 at 15:38 +0100, Kieran Mansley wrote:
> I've been investigating an issue with TCPBacklogDrops being reported
> (and relatively poor performance as a result). The problem is most
> easily observed on slightly older kernels (e.g 3.0.13) but is still
> present in 3.3.6, although harder to reproduce. I've also seen it in
> 2.6 series kernels, so it's not a recent issue.
>
> The problem occurs at the receiver when a TCP sender with a large
> congestion window is sending at a high rate and the receiving
> application has blocked in a recv() or similar call. During the stream
> ACKs are being returned to the sender keeping the receive window open
> and so allowing it to carry on sending. The local socket receive buffer
> gets dynamically increased, and the advertised receive window increases
> similarly.
>
> [As an aside, it appears as though the total bytes that the receiver
> commits to receiving - i.e. the point at which it stops advertising new
> sequence space - is around double the receive socket buffer. I'm
> guessing it is committing to receiving the current socket buffer
> (perhaps as there is a pending recv() it knows it will be able to
> immediately empty this) and the next one, but I've not looked into this
> in detail]
>
> As the socket buffer is approaching full the kernel decides to satisfy
> the recv() call and wake the application. It will have to copy the data
> to application address space etc. At this point there is a switch in
> tcp_v4_rcv():
>
> http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726
>
> Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
> true, but once it has decided to wake the application I think it will
> evaluate to false and it will drop through to:
>
> 1739 else if (unlikely(sk_add_backlog(sk, skb))) {
> 1740 bh_unlock_sock(sk);
> 1741 NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
> 1742 goto discard_and_relse;
> 1743 }
>
> In sk_add_backlog() there is a test to see if the socket's receive
> buffer is full, and if there is the kernel drops the packets, reporting
> them through netstat as TCPBacklogDrop. This is despite there being
> potentially megabytes of unused advertised receive window space at this
> point.
>
> Very shortly afterwards the socket buffer will be empty again (as its
> contents will have been transferred to the user) so this is essentially
> a race and depends on a fast sender to demonstrate it. It shows up as a
> acute period of drops that are quickly retransmitted and then
> accepted.
>
> There are two ways of thinking about this problem: either the receiver
> should be more conservative about the receive window it advertises
> (limiting it to the available receive socket buffer size); or the
> receiver should be more generous with what it will accept on to the
> backlog (matching it to the advertised receive window). It is the
> discrepancy between advertised receive window and what can be put on the
> backlog that is the root of the problem. I would be tempted by the
> latter and say that as the backlog is likely to soon make it into the
> receive buffer, it should be allowed to contain a full receive buffer of
> bytes on top of what is currently being removed from the receive buffer
> into the application.
>
> It is harder to reproduce on recent kernels because the pending recv()
> call gets satisfied very close to the start of a burst, and at this time
> the receive buffer will be mostly empty and so it is less likely that
> any packets in flight will overflow the backlog. On earlier kernels it
> is easier to reproduce because the pending recv() call didn't return
> until the socket's receive buffer was nearly full, and so it would only
> take a few extra packets to overflow the backlog.
>
> I have a packet capture to illustrate the problem (taken on 3.0.13) if
> that would be of help. As I can easily reproduce it I'm also happy to
> make changes and test to see if they improve matters.
Please try latest kernels, this is probably 'fixed'
What network driver are you using ?
next prev parent reply other threads:[~2012-05-15 14:56 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-15 14:38 TCPBacklogDrops during aggressive bursts of traffic Kieran Mansley
2012-05-15 14:56 ` Eric Dumazet [this message]
2012-05-15 15:00 ` Eric Dumazet
2012-05-15 16:29 ` Kieran Mansley
2012-05-15 16:34 ` Eric Dumazet
2012-05-15 16:47 ` Ben Hutchings
2012-05-15 17:01 ` Eric Dumazet
2012-05-15 17:23 ` Eric Dumazet
2012-05-17 16:31 ` Kieran Mansley
2012-05-17 16:37 ` Eric Dumazet
2012-05-18 15:45 ` Kieran Mansley
2012-05-18 15:49 ` Eric Dumazet
2012-05-18 15:53 ` Kieran Mansley
2012-05-18 18:40 ` Eric Dumazet
2012-05-22 8:20 ` Kieran Mansley
2012-05-22 9:25 ` Eric Dumazet
2012-05-22 9:30 ` Eric Dumazet
2012-05-22 15:09 ` Kieran Mansley
2012-05-22 16:12 ` Eric Dumazet
2012-05-22 16:32 ` Kieran Mansley
2012-05-22 16:45 ` Eric Dumazet
2012-05-22 20:54 ` Eric Dumazet
2012-05-23 9:44 ` Eric Dumazet
2012-05-23 12:09 ` Eric Dumazet
2012-05-23 16:04 ` Alexander Duyck
2012-05-23 16:12 ` Eric Dumazet
2012-05-23 16:39 ` Eric Dumazet
2012-05-23 17:10 ` Alexander Duyck
2012-05-23 21:19 ` Alexander Duyck
2012-05-23 21:37 ` Eric Dumazet
2012-05-23 22:03 ` Alexander Duyck
2012-05-23 16:58 ` Alexander Duyck
2012-05-23 17:24 ` Eric Dumazet
2012-05-23 17:57 ` Alexander Duyck
2012-05-23 17:34 ` David Miller
2012-05-23 17:46 ` Eric Dumazet
2012-05-23 17:57 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1337093776.8512.1089.camel@edumazet-glaptop \
--to=eric.dumazet@gmail.com \
--cc=kmansley@solarflare.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox