netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexander Duyck <alexander.h.duyck@intel.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Kieran Mansley <kmansley@solarflare.com>,
	Jeff Kirsher <jeffrey.t.kirsher@intel.com>,
	Ben Hutchings <bhutchings@solarflare.com>,
	netdev@vger.kernel.org
Subject: Re: TCPBacklogDrops during aggressive bursts of traffic
Date: Wed, 23 May 2012 09:58:40 -0700	[thread overview]
Message-ID: <4FBD1740.1020304@intel.com> (raw)
In-Reply-To: <1337789530.3361.2992.camel@edumazet-glaptop>

On 05/23/2012 09:12 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 09:04 -0700, Alexander Duyck wrote:
>> On 05/23/2012 05:09 AM, Eric Dumazet wrote:
>>> On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
>>>
>>>> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
>>>> bytes overhead per skb, it should go away.
>>> Here is the patch for ixgbe, for reference.
>> I'm confused as to what this is trying to accomplish.
>>
>> Currently the way the ixgbe driver works is that we allocate the
>> skb->head using netdev_alloc_skb, which after your recent changes should
>> be using a head frag.  If the buffer is less than 256 bytes we have
>> pushed the entire buffer into the head frag, and if it is more we only
>> pull everything up to the end of the TCP header.  In either case if we
>> are merging TCP flows we should be able to drop one page or the other
>> along with the sk_buff giving us a total truesize addition after merge
>> of ~1K for less than 256 bytes or 2K for a full sized frame.
>>
>> I'll try to take a look at this today as it is in our interest to have
>> TCP performing as well as possible on ixgbe.
> With current driver, a MTU=1500 frame uses :
>
> sk_buff (256 bytes)
> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> one fragment of 2048 bytes
>
> At skb free time,  one kfree(sk_buff) and two put_page().
>
> After this patch :
>
> sk_buff (256 bytes)
> skb->head : 2048 bytes 
>
> At skb free time, one kfree(sk_buff) and only one put_page().
>
> Note that my patch doesnt change the 256 bytes threshold: Small frames
> wont have one fragment and their use is :
>
> sk_buff (256 bytes)
> skb->head : 512 + 384 bytes 
>
>
Right, but the problem is that in order to make this work the we are
dropping the padding for head and hoping to have room for shared info. 
This is going to kill performance for things like routing workloads
since the entire head is going to have to be copied over to make space
for NET_SKB_PAD.  Also this assumes no RSC being enabled.  RSC is
normally enabled by default.  If it is turned on we are going to start
receiving full 2K buffers which will cause even more issues since there
wouldn't be any room for shared info in the 2K frame.

The way the driver is currently written probably provides the optimal
setup for truesize given the circumstances.  In order to support
receiving at least 1 full 1500 byte frame per fragment, and supporting
RSC I have to support receiving up to 2K of data.  If we try to make it
all part of one paged receive we would then have to either reduce the
receive buffer size to 1K in hardware and span multiple fragments for a
1.5K frame or allocate a 3K buffer so we would have room to add
NET_SKB_PAD and the shared info on the end.  At which point we are back
to the extra 1K again, only in that case we cannot trim it off later via
skb_try_coalesce.  In the 3K buffer case we would be over a 1/2 page
which means we can only get one buffer per page instead of 2 in which
case we might as well just round it up to 4K and be honest.

The reason I am confused is that I thought the skb_try_coalesce function
was supposed to be what addressed these types of issues.  If these
packets go through that function they should be stripping the sk_buff
and possibly even the skb->head if we used the fragment since the only
thing that is going to end up in the head would be the TCP header which
should have been pulled prior to trying to coalesce.

I will need to investigate this further to understand what is going on. 
I realize that dealing with 3K of memory for buffer storage is not
ideal, but all of the alternatives lean more toward 4K when fully
implemented.  I'll try and see what alternative solutions we might have
available.

Thanks,

Alex

  parent reply	other threads:[~2012-05-23 16:58 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-15 14:38 TCPBacklogDrops during aggressive bursts of traffic Kieran Mansley
2012-05-15 14:56 ` Eric Dumazet
2012-05-15 15:00   ` Eric Dumazet
2012-05-15 16:29   ` Kieran Mansley
2012-05-15 16:34     ` Eric Dumazet
2012-05-15 16:47       ` Ben Hutchings
2012-05-15 17:01         ` Eric Dumazet
2012-05-15 17:23           ` Eric Dumazet
2012-05-17 16:31           ` Kieran Mansley
2012-05-17 16:37             ` Eric Dumazet
2012-05-18 15:45               ` Kieran Mansley
2012-05-18 15:49                 ` Eric Dumazet
2012-05-18 15:53                   ` Kieran Mansley
2012-05-18 18:40                 ` Eric Dumazet
2012-05-22  8:20               ` Kieran Mansley
2012-05-22  9:25                 ` Eric Dumazet
2012-05-22  9:30                   ` Eric Dumazet
2012-05-22 15:09                     ` Kieran Mansley
2012-05-22 16:12                       ` Eric Dumazet
2012-05-22 16:32                         ` Kieran Mansley
2012-05-22 16:45                           ` Eric Dumazet
2012-05-22 20:54                             ` Eric Dumazet
2012-05-23  9:44                               ` Eric Dumazet
2012-05-23 12:09                                 ` Eric Dumazet
2012-05-23 16:04                                   ` Alexander Duyck
2012-05-23 16:12                                     ` Eric Dumazet
2012-05-23 16:39                                       ` Eric Dumazet
2012-05-23 17:10                                         ` Alexander Duyck
2012-05-23 21:19                                           ` Alexander Duyck
2012-05-23 21:37                                             ` Eric Dumazet
2012-05-23 22:03                                               ` Alexander Duyck
2012-05-23 16:58                                       ` Alexander Duyck [this message]
2012-05-23 17:24                                         ` Eric Dumazet
2012-05-23 17:57                                           ` Alexander Duyck
2012-05-23 17:34                                 ` David Miller
2012-05-23 17:46                                   ` Eric Dumazet
2012-05-23 17:57                                     ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FBD1740.1020304@intel.com \
    --to=alexander.h.duyck@intel.com \
    --cc=bhutchings@solarflare.com \
    --cc=eric.dumazet@gmail.com \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=kmansley@solarflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).