public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>,
	netfilter-devel@vger.kernel.org, netdev <netdev@vger.kernel.org>,
	Tom Herbert <therbert@google.com>,
	Patrick McHardy <kaber@trash.net>
Subject: Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock
Date: Fri, 24 May 2013 06:51:36 -0700	[thread overview]
Message-ID: <1369403496.3301.401.camel@edumazet-glaptop> (raw)
In-Reply-To: <20130524151647.18388e27@redhat.com>

On Fri, 2013-05-24 at 15:16 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 22 May 2013 10:47:48 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > nf_conntrack_lock is a monolithic lock and suffers from huge
> > contention on current generation servers (8 or more core/threads).
> > 
> [...]
> > Results on a 32 threads machine, 200 concurrent instances of "netperf
> > -t TCP_CRR" : 
> > 
> > ~390000 tps instead of ~300000 tps.
> 
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> I gave the patch a quick run in my testlab, and the results are
> amazing, you are amazing Eric! :-)
> 
> Basic testlab setup:
>  I'm generating a 2700 Kpps SYN-flood against port 80 (with trafgen)
> 
> Baseline result from a  3.9.0-rc5 kernel:
> - With nf_conntrack my performance is 749 Kpps.
> 
> If removing all iptables and nf_contrack modules:
> - the performance hits 1095 Kpps.
> But it looks like we are hitting a new spin_lock in ip_send_reply()
> 
> If start a LISTEN process on the port, then we hit the "old" SYN
> scalability issues again, performance drops tp 227 Kpps.
> 
> On a patched net-next (close to 3.10.0-rc1) kernel, with Eric's new
> locking scheme patch:
> - I measured an amazing 2431 Kpps.
> 
>  13.45%  [kernel]                [k] fib_table_lookup
>   9.07%  [nf_conntrack]          [k] __nf_conntrack_alloc
>   6.50%  [nf_conntrack]          [k] nf_conntrack_free
>   5.24%  [ip_tables]             [k] ipt_do_table
>   3.66%  [nf_conntrack]          [k] nf_conntrack_in
>   3.54%  [kernel]                [k] inet_getpeer
>   3.52%  [nf_conntrack]          [k] tcp_packet
>   2.44%  [ixgbe]                 [k] ixgbe_poll
>   2.30%  [kernel]                [k] __ip_route_output_key
>   2.04%  [nf_conntrack]          [k] nf_conntrack_tuple_taken
>   1.98%  [kernel]                [k] icmp_send
> 
> Then, I realized that I didn't have any iptables rules that accepted
> port 80 on my testlab system, thus this were basically a drop packets
> test with a nf_conntrack lookup.
> 
> If I add a rule that accept new connection to that port e.g:
>  iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j
> ACCEPT
> 
> New ruleset:
>  -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT 
>  -A INPUT -p icmp -j ACCEPT 
>  -A INPUT -i lo -j ACCEPT 
>  -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT 
>  -A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT 
>  -A INPUT -j REJECT --reject-with icmp-host-prohibited 
> 
> Then, performance drops again:
> - to approx 883 Kpps.
> 
> Discover that the NAT stuff is to blame:
> 
> -  17.71%        swapper  [kernel.kallsyms]       [k] _raw_spin_lock_bh
>    - _raw_spin_lock_bh
>       + 47.17% nf_nat_cleanup_conntrack
>       + 45.81% nf_nat_setup_info
>       + 6.43% nf_nat_get_offset
> 
> Removing the nat modules, improves the performance:
> - to 1182 Kpps (not listen on port 80)
> 
>  sudo iptables -t nat -F
>  sudo rmmod iptable_nat nf_nat_ipv4
> 
> And the perf output looks more like what I would expect:
> 
> -  14.85%       swapper  [kernel.kallsyms]        [k] _raw_spin_lock
>    - _raw_spin_lock
>       + 82.86% mod_timer
>       + 11.14% nf_conntrack_double_lock
>       + 2.50% nf_ct_del_from_dying_or_unconfirmed_list
>       + 1.48% nf_conntrack_in
>       + 1.30% nf_ct_delete_from_lists
> -  12.78%       swapper  [kernel.kallsyms]        [k]
>   _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>       - 99.44% lock_timer_base
>          + 99.07% del_timer
>          + 0.93% mod_timer
> +   2.69%       swapper  [ip_tables]              [k] ipt_do_table
> +   2.28%   ksoftirqd/0  [kernel.kallsyms]        [k]
>   _raw_spin_lock_irqsave
> +   2.18%       swapper  [nf_conntrack]           [k] tcp_packet
> +   2.16%       swapper  [kernel.kallsyms]        [k] fib_table_lookup
> 
> 
> Again if I start a LISTEN process on the port, performance drops to
> 169Kpps, due to the LISTEN and SYN-cookie scalability issues.
> 
> I'm amazed, this patch will actually make it a viable choice to load
> the conntrack modules on a DDoS based filtering box, and use the
> conntracks to protect against ACK and SYN+ACK attacks.
> 
> Simply by not accepting the ACK or SYN+ACK to create a conntrack entry.
> Via the command:
>  sysctl -w net/netfilter/nf_conntrack_tcp_loose=0
> 
> A quick test show; now I can run a LISTEN process on the port, and
> handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK
> attacks), while running a LISTEN process on the port.
> 
> Thanks for the great work Eric!
> 
> ps. also tested resizing the hash tables, both:
>  /proc/sys/net/netfilter/nf_conntrack_max
> and resizing the buckets via:
>  /sys/module/nf_conntrack/parameters/hashsize
> 

Wow, this is very interesting !

Did you test the thing when expectations are possible ? (say ftp module
loaded)

I think we should add RCU in the fast path, instead of having to lock
the expectation lock. Its totally doable.




  reply	other threads:[~2013-05-24 13:51 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-09  3:04 [PATCH nf-next] netfilter: conntrack: remove the central spinlock Eric Dumazet
2013-05-09  5:43 ` Cong Wang
2013-05-09  6:01   ` Eric Dumazet
2013-05-09  7:46     ` Cong Wang
2013-05-09 13:46       ` Eric Dumazet
2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet
2013-05-22 18:20   ` Joe Perches
2013-05-22 19:26     ` Eric Dumazet
2013-05-22 19:57       ` Joe Perches
2013-05-22 20:16         ` Eric Dumazet
2013-05-22 20:38           ` Joe Perches
2013-05-22 20:48             ` Eric Dumazet
2013-05-22 21:12               ` Joe Perches
2013-05-22 21:29                 ` David Miller
2013-05-22 21:34                 ` Eric Dumazet
2013-05-24 13:16   ` Jesper Dangaard Brouer
2013-05-24 13:51     ` Eric Dumazet [this message]
2013-05-27 12:33       ` Jesper Dangaard Brouer
2013-05-27 12:36         ` Pablo Neira Ayuso
2013-08-23 14:42           ` Jesper Dangaard Brouer
2013-08-26 22:28   ` Pablo Neira Ayuso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1369403496.3301.401.camel@edumazet-glaptop \
    --to=eric.dumazet@gmail.com \
    --cc=jbrouer@redhat.com \
    --cc=kaber@trash.net \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=therbert@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox