From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kieran Mansley Subject: TCPBacklogDrops during aggressive bursts of traffic Date: Tue, 15 May 2012 15:38:34 +0100 Message-ID: <1337092718.1689.45.camel@kjm-desktop.uk.level5networks.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit To: Return-path: Received: from webmail.solarflare.com ([12.187.104.25]:32714 "EHLO ocex02.SolarFlarecom.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932846Ab2EOOij (ORCPT ); Tue, 15 May 2012 10:38:39 -0400 Sender: netdev-owner@vger.kernel.org List-ID: I've been investigating an issue with TCPBacklogDrops being reported (and relatively poor performance as a result). The problem is most easily observed on slightly older kernels (e.g 3.0.13) but is still present in 3.3.6, although harder to reproduce. I've also seen it in 2.6 series kernels, so it's not a recent issue. The problem occurs at the receiver when a TCP sender with a large congestion window is sending at a high rate and the receiving application has blocked in a recv() or similar call. During the stream ACKs are being returned to the sender keeping the receive window open and so allowing it to carry on sending. The local socket receive buffer gets dynamically increased, and the advertised receive window increases similarly. [As an aside, it appears as though the total bytes that the receiver commits to receiving - i.e. the point at which it stops advertising new sequence space - is around double the receive socket buffer. I'm guessing it is committing to receiving the current socket buffer (perhaps as there is a pending recv() it knows it will be able to immediately empty this) and the next one, but I've not looked into this in detail] As the socket buffer is approaching full the kernel decides to satisfy the recv() call and wake the application. It will have to copy the data to application address space etc. At this point there is a switch in tcp_v4_rcv(): http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726 Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to true, but once it has decided to wake the application I think it will evaluate to false and it will drop through to: 1739 else if (unlikely(sk_add_backlog(sk, skb))) { 1740 bh_unlock_sock(sk); 1741 NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP); 1742 goto discard_and_relse; 1743 } In sk_add_backlog() there is a test to see if the socket's receive buffer is full, and if there is the kernel drops the packets, reporting them through netstat as TCPBacklogDrop. This is despite there being potentially megabytes of unused advertised receive window space at this point. Very shortly afterwards the socket buffer will be empty again (as its contents will have been transferred to the user) so this is essentially a race and depends on a fast sender to demonstrate it. It shows up as a acute period of drops that are quickly retransmitted and then accepted. There are two ways of thinking about this problem: either the receiver should be more conservative about the receive window it advertises (limiting it to the available receive socket buffer size); or the receiver should be more generous with what it will accept on to the backlog (matching it to the advertised receive window). It is the discrepancy between advertised receive window and what can be put on the backlog that is the root of the problem. I would be tempted by the latter and say that as the backlog is likely to soon make it into the receive buffer, it should be allowed to contain a full receive buffer of bytes on top of what is currently being removed from the receive buffer into the application. It is harder to reproduce on recent kernels because the pending recv() call gets satisfied very close to the start of a burst, and at this time the receive buffer will be mostly empty and so it is less likely that any packets in flight will overflow the backlog. On earlier kernels it is easier to reproduce because the pending recv() call didn't return until the socket's receive buffer was nearly full, and so it would only take a few extra packets to overflow the backlog. I have a packet capture to illustrate the problem (taken on 3.0.13) if that would be of help. As I can easily reproduce it I'm also happy to make changes and test to see if they improve matters. Thanks Kieran