public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Kieran Mansley <kmansley@solarflare.com>
Cc: netdev@vger.kernel.org
Subject: Re: TCPBacklogDrops during aggressive bursts of traffic
Date: Tue, 15 May 2012 16:56:16 +0200	[thread overview]
Message-ID: <1337093776.8512.1089.camel@edumazet-glaptop> (raw)
In-Reply-To: <1337092718.1689.45.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-15 at 15:38 +0100, Kieran Mansley wrote:
> I've been investigating an issue with TCPBacklogDrops being reported
> (and relatively poor performance as a result).  The problem is most
> easily observed on slightly older kernels (e.g 3.0.13) but is still
> present in 3.3.6, although harder to reproduce.  I've also seen it in
> 2.6 series kernels, so it's not a recent issue.
> 
> The problem occurs at the receiver when a TCP sender with a large
> congestion window is sending at a high rate and the receiving
> application has blocked in a recv() or similar call.  During the stream
> ACKs are being returned to the sender keeping the receive window open
> and so allowing it to carry on sending.  The local socket receive buffer
> gets dynamically increased, and the advertised receive window increases
> similarly.
> 
> [As an aside, it appears as though the total bytes that the receiver
> commits to receiving - i.e. the point at which it stops advertising new
> sequence space - is around double the receive socket buffer.  I'm
> guessing it is committing to receiving the current socket buffer
> (perhaps as there is a pending recv() it knows it will be able to
> immediately empty this) and the next one, but I've not looked into this
> in detail]
> 
> As the socket buffer is approaching full the kernel decides to satisfy
> the recv() call and wake the application.  It will have to copy the data
> to application address space etc.  At this point there is a switch in
> tcp_v4_rcv():
> 
> http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726
> 
> Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
> true, but once it has decided to wake the application I think it will
> evaluate to false and it will drop through to:
> 
> 1739        else if (unlikely(sk_add_backlog(sk, skb))) {
> 1740                bh_unlock_sock(sk);
> 1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
> 1742                goto discard_and_relse;
> 1743        }
> 
> In sk_add_backlog() there is a test to see if the socket's receive
> buffer is full, and if there is the kernel drops the packets, reporting
> them through netstat as TCPBacklogDrop.  This is despite there being
> potentially megabytes of unused advertised receive window space at this
> point.
> 
> Very shortly afterwards the socket buffer will be empty again (as its
> contents will have been transferred to the user) so this is essentially
> a race and depends on a fast sender to demonstrate it.  It shows up as a
> acute period of drops that are quickly retransmitted and then
> accepted.  
> 
> There are two ways of thinking about this problem: either the receiver
> should be more conservative about the receive window it advertises
> (limiting it to the available receive socket buffer size); or the
> receiver should be more generous with what it will accept on to the
> backlog (matching it to the advertised receive window).  It is the
> discrepancy between advertised receive window and what can be put on the
> backlog that is the root of the problem.  I would be tempted by the
> latter and say that as the backlog is likely to soon make it into the
> receive buffer, it should be allowed to contain a full receive buffer of
> bytes on top of what is currently being removed from the receive buffer
> into the application.
> 
> It is harder to reproduce on recent kernels because the pending recv()
> call gets satisfied very close to the start of a burst, and at this time
> the receive buffer will be mostly empty and so it is less likely that
> any packets in flight will overflow the backlog.  On earlier kernels it
> is easier to reproduce because the pending recv() call didn't return
> until the socket's receive buffer was nearly full, and so it would only
> take a few extra packets to overflow the backlog.
> 
> I have a packet capture to illustrate the problem (taken on 3.0.13) if
> that would be of help.  As I can easily reproduce it I'm also happy to
> make changes and test to see if they improve matters.


Please try latest kernels, this is probably 'fixed'

What network driver are you using ?

  reply	other threads:[~2012-05-15 14:56 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-15 14:38 TCPBacklogDrops during aggressive bursts of traffic Kieran Mansley
2012-05-15 14:56 ` Eric Dumazet [this message]
2012-05-15 15:00   ` Eric Dumazet
2012-05-15 16:29   ` Kieran Mansley
2012-05-15 16:34     ` Eric Dumazet
2012-05-15 16:47       ` Ben Hutchings
2012-05-15 17:01         ` Eric Dumazet
2012-05-15 17:23           ` Eric Dumazet
2012-05-17 16:31           ` Kieran Mansley
2012-05-17 16:37             ` Eric Dumazet
2012-05-18 15:45               ` Kieran Mansley
2012-05-18 15:49                 ` Eric Dumazet
2012-05-18 15:53                   ` Kieran Mansley
2012-05-18 18:40                 ` Eric Dumazet
2012-05-22  8:20               ` Kieran Mansley
2012-05-22  9:25                 ` Eric Dumazet
2012-05-22  9:30                   ` Eric Dumazet
2012-05-22 15:09                     ` Kieran Mansley
2012-05-22 16:12                       ` Eric Dumazet
2012-05-22 16:32                         ` Kieran Mansley
2012-05-22 16:45                           ` Eric Dumazet
2012-05-22 20:54                             ` Eric Dumazet
2012-05-23  9:44                               ` Eric Dumazet
2012-05-23 12:09                                 ` Eric Dumazet
2012-05-23 16:04                                   ` Alexander Duyck
2012-05-23 16:12                                     ` Eric Dumazet
2012-05-23 16:39                                       ` Eric Dumazet
2012-05-23 17:10                                         ` Alexander Duyck
2012-05-23 21:19                                           ` Alexander Duyck
2012-05-23 21:37                                             ` Eric Dumazet
2012-05-23 22:03                                               ` Alexander Duyck
2012-05-23 16:58                                       ` Alexander Duyck
2012-05-23 17:24                                         ` Eric Dumazet
2012-05-23 17:57                                           ` Alexander Duyck
2012-05-23 17:34                                 ` David Miller
2012-05-23 17:46                                   ` Eric Dumazet
2012-05-23 17:57                                     ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1337093776.8512.1089.camel@edumazet-glaptop \
    --to=eric.dumazet@gmail.com \
    --cc=kmansley@solarflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox