netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Florian Westphal <fw@strlen.de>,
	netdev@vger.kernel.org, Pablo Neira Ayuso <pablo@netfilter.org>,
	Thomas Graf <tgraf@suug.ch>, Cong Wang <amwang@redhat.com>,
	Patrick McHardy <kaber@trash.net>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Herbert Xu <herbert@gondor.hengli.com.au>
Subject: Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
Date: Sun, 25 Nov 2012 09:53:47 +0100	[thread overview]
Message-ID: <1353833627.11754.134.camel@localhost> (raw)
In-Reply-To: <1353810665.2590.4774.camel@edumazet-glaptop>

On Sat, 2012-11-24 at 18:31 -0800, Eric Dumazet wrote:
> On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> > This patchset implements significant performance improvements for
> > fragmentation handling in the kernel, with a focus on NUMA and SMP
> > based systems.
> > 
> > Review:
> > 
> >  Please review these patches.  I have on purpose added comments in the
> >  code with the "//" comments style.  These comments are to be removed
> >  before applying.  They serve as a questions to, you, the reviewer.
> > 
> > The fragmentation code today:
> > 
> >  The fragmentation code "protects" kernel resources, by implementing
> >  some memory resource limitation code.  This is centered around a
> >  global readers-writer lock, and (per network namespace) an atomic mem
> >  counter and a LRU (Least-Recently-Used) list.  (Although separate
> >  global variables and namespace resources, are kept for IPv4, IPv6
> >  and Netfilter reassembly.)
> > 
> >  The code tries to keep the memory usage between a high and low
> >  threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh).  The
> >  "evictor" code cleans up fragments, when the high threshold is
> >  exceeded, and stops only, when the low threshold is reached.
> > 
> > The scalability problem:
> > 
> >  Having a global/central variable for a resource limit is obviously a
> >  scalability issue on SMP systems, and even amplified on a NUMA based
> >  system.
> > 
> 
> 
> But ... , what practical workload even use fragments ?

(1) DNS (default for Bind) will use up-to 3 UDP fragments before
switching to TCP.  This is getting more and more relevant after the
introduction of DNSSEC.  That's why I'm explicit testing the 3xUDP
fragments so heavily.

(2) For IPVS (load-balancing) we have recently allowed fragmentation in
tunnel mode, towards the realservers (to hide the MTU reduction for the
clients).  Thus, we need better frag performance in this case.

(3) I also have a customer that have a usage scenario/application (at
4x10G) that needs this... but I'm trying to convince them to fix/change
their application...

Scenario (1) is the real reason I want to fix this scalability issue in
the code.


> Sure, netperf -t UDP_STREAM uses frags, but its a benchmark.

Yes, for the default large 64k packets size, its just a "fake"
benchmark.  And notice with my fixes, we are even faster than the
none-frag/single-UDP packet case... but its because we are getting a
GSO/GRO effect.

That's why I'm adjusting the UDP "frag" packet size to get a more
realistic use case... to simulate the DNS use-case (1).


> The only heavy user was NFS in the days it was using UDP, a very long
> time ago.
> 
> A single lost fragment means the whole packet is lost.

That is correct, that's why we need the fix in patch-01. 

(It actually reminds me of the problem with ADSL/ATM, where (small) ATM
frame are used for carrying IP packets, and when some (more central) ATM
link gets overloaded and starts to drops ATM frames, not taking the AAL5
packets into account).

> Another problem with fragments is the lack of 4-tuple hashing, as only
> the first frag contains the dst/src ports.
> 
> Also there is the sysctl_ipfrag_max_dist issue...
> 
> Hint : many NIC provide TSO (TCP offload), but none provide UFO,
> probably because there is no demand for it.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2012-11-25  8:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
2012-11-27  9:07   ` Jesper Dangaard Brouer
2012-11-27 15:00   ` Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
2012-11-26  6:03   ` Stephen Hemminger
2012-11-26  9:18   ` Florian Westphal
     [not found] ` <20121123130806.18764.41854.stgit@dragon>
2012-11-23 19:58   ` [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues Florian Westphal
2012-11-24 11:36     ` Jesper Dangaard Brouer
2012-11-25  2:31 ` [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Eric Dumazet
2012-11-25  8:53   ` Jesper Dangaard Brouer [this message]
2012-11-25 16:11     ` Eric Dumazet
2012-11-26 14:42       ` Jesper Dangaard Brouer
2012-11-26 15:15         ` Eric Dumazet
2012-11-26 15:29           ` Jesper Dangaard Brouer
     [not found] ` <20121123130826.18764.66507.stgit@dragon>
2012-11-26  2:54   ` [RFC net-next PATCH V1 5/9] net: frag per CPU mem limit and LRU list accounting Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1353833627.11754.134.camel@localhost \
    --to=brouer@redhat.com \
    --cc=amwang@redhat.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=herbert@gondor.hengli.com.au \
    --cc=kaber@trash.net \
    --cc=netdev@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).