32 core net-next stack/netfilter "scaling"

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 32 core net-next stack/netfilter "scaling"
@ 2009-01-26 22:15 Rick Jones
  2009-01-26 23:10 ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2009-01-26 22:15 UTC (permalink / raw)
  To: Linux Network Development list, Netfilter Developers

Folks -

Under:

ftp://ftp.netperf.org/iptable_scaling

can be found netperf results and Caliper profiles for three scenarios on a 
32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call a "cell 
based" system in that it is comprised of "cell boards" on which reside CPU and 
memory resources.  In this case there are four cell boards, each with 4, 
dual-core Montecito processors and 1/4 of the overall RAM.  The system was 
configured with a mix of cell-local and global interleaved memory, where the 
global interleave is on a cacheline (128 byte) boundary (IIRC).  Total RAM in the 
system is 256 GB.  The cells are joined via cross-bar connections. (numactl 
--hardware output is available under the URL above)

There was an "I/O expander" connected to the system.  This meant there were as 
many distinct PCI-X domains as there were cells, and every cell had a "local" set 
of PCI-X slots.

Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - aka 
Neterion XFrame IIs.  These were then connected to an HP ProCurve 5806 switch, 
which was in turn connected to three, 4P/16C, 2.3 GHz HP DL585 G5s, each of which 
had a pair of HP AD386A PCIe 10Gbit Ethernet NICs (Aka Chelsio T3C-based).  They 
were running RHEL 5.2 I think.  Each NIC was in either a PCI-X 2.0 266 MHz slot 
(rx8640) or a PCIe 1.mumble x8 slot (DL585 G5)

The kernel is from DaveM's net-next tree ca last week, multiq enabled.  The s2io 
driver is Neterion's out-of-tree version 2.0.36.15914 to get multiq support.  It 
was loaded into the kernel via:

insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8

There were then 8 tx queues and 8 rx queues per interface in the rx8640.  The 
"setaffinity.txt" script was used to set the IRQ affinities to cores "closest" to 
the physical NIC. In all three tests all 32 cores went to 100% utilization. At 
least for all incense and porpoises. (there was some occasional idle reported by 
top on the full_iptables run)

A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a burst 
mode of 17 were run (ie 17 "transactions" outstanding on a connection at one 
time,) with TCP_NODELAY set and the results gathered, along with a set of Caliper 
profiles.  The script used to launch these can be found in "runemomniagg2.sh.txt 
under the URL above.

I picked an "RR" test to maximize the trips up and down the stack while 
minimizing the bandwidth consumed.

I picked a burst size of 16 because that was sufficient to saturate a single core 
on the rx8640.

I picked 64 concurrent netperfs because I wanted to make sure I had enough 
concurrent connections to get spread across all the cores/queues by the 
algorithms in place.

I picked the combination of 64 and 16 rather than say 1024 and 0 (one tran at a 
time) because I didn't want to run a context switching benchmark :)

The rx8640 was picked because it was available and I was confident it was not 
going to have any hardware scaling issues getting in the way.  I wanted to see SW 
issues, not HW issues. I am ass-u-me-ing the rx8640 is a reasonable analog for 
any "decent or better scaling" 32 core hardware and that while there are 
ia64-specific routines present in the profiles, they are there for 
platform-independent reasons.

The no_iptables/ data was run after a fresh boot, with no iptables commands run 
and so no iptables related modules loaded into the kernel.

The empty_iptables/ data was run after an "iptables --list" command which loaded 
one or two modules into the kernel.

The full_iptables/ data was run after an "iptables-restore" command pointed at 
full_iptables/iptables.txt  which was created from what RH creates by default 
when one enables firewall via their installer, with a port range added by me to 
allow pretty much anything netperf would ask.  As such, while it does excercise 
netfilter functionality, I cannot make any claims as to its "real world" 
applicability.  (while the firewall settings came from an RH setup, FWIW, the 
base bits running on the rx8640 are Debian Lenny, with the net-next kernel on top)

The "cycles" profile is able to grab flat profile hits while interrupts are 
disabled so it can see stuff happening while interrupts are disabled.  The 
"scgprof" profile is an attempt to get some call graphs - it does not have 
visibility into code running with interrupts disabled.  The "cache" profile is a 
profile that looks to get some cache miss information.

So, having said all that, details can be found under the previously mentioned 
URL.  Some quick highlights:

no_iptables - ~22000 transactions/s/netperf.  Top of the cycles profile looks like:

Function Summary
-----------------------------------------------------------------------
% Total
      IP  Cumulat             IP
Samples    % of         Samples
  (ETB)     Total         (ETB)   Function                          File
-----------------------------------------------------------------------
    5.70     5.70         37772   s2io.ko::tx_intr_handler
    5.14    10.84         34012   vmlinux::__ia64_readq
    4.88    15.72         32285   s2io.ko::s2io_msix_ring_handle
    4.63    20.34         30625   s2io.ko::rx_intr_handler
    4.60    24.94         30429   s2io.ko::s2io_xmit
    3.85    28.79         25488   s2io.ko::s2io_poll_msix
    2.87    31.65         18987   vmlinux::dev_queue_xmit
    2.51    34.16         16620   vmlinux::tcp_sendmsg
    2.51    36.67         16588   vmlinux::tcp_ack
    2.15    38.82         14221   vmlinux::__inet_lookup_established
    2.10    40.92         13937   vmlinux::ia64_spinlock_contention

empty_iptables - ~12000 transactions/s/netperf.  Top of the cycles profile looks 
like:

Function Summary
-----------------------------------------------------------------------
% Total
      IP  Cumulat             IP
Samples    % of         Samples
  (ETB)     Total         (ETB)   Function                          File
-----------------------------------------------------------------------
   26.38    26.38        137458   vmlinux::_read_lock_bh
   10.63    37.01         55388   vmlinux::local_bh_enable_ip
    3.42    40.43         17812   s2io.ko::tx_intr_handler
    3.01    43.44         15691   ip_tables.ko::ipt_do_table
    2.90    46.34         15100   vmlinux::__ia64_readq
    2.72    49.06         14179   s2io.ko::rx_intr_handler
    2.55    51.61         13288   s2io.ko::s2io_xmit
    1.98    53.59         10329   s2io.ko::s2io_msix_ring_handle
    1.75    55.34          9104   vmlinux::dev_queue_xmit
    1.64    56.98          8546   s2io.ko::s2io_poll_msix
    1.52    58.50          7943   vmlinux::sock_wfree
    1.40    59.91          7302   vmlinux::tcp_ack

full_iptables - some test instances didn't complete, I think they got starved. 
Of those which did complete, their performance ranged all the way from 330 to 
3100 transactions/s/netperf.  Top of the cycles profile looks like:

Function Summary
-----------------------------------------------------------------------
% Total
      IP  Cumulat             IP
Samples    % of         Samples
  (ETB)     Total         (ETB)   Function                          File
-----------------------------------------------------------------------
   64.71    64.71        582171   vmlinux::_write_lock_bh
   18.43    83.14        165822   vmlinux::ia64_spinlock_contention
    2.86    85.99         25709   nf_conntrack.ko::init_module
    2.36    88.35         21194   nf_conntrack.ko::tcp_packet
    1.78    90.13         16009   vmlinux::_spin_lock_bh
    1.20    91.33         10810   nf_conntrack.ko::nf_conntrack_in
    1.20    92.52         10755   vmlinux::nf_iterate
    1.09    93.62          9833   vmlinux::default_idle
    0.26    93.88          2331   vmlinux::__ia64_readq
    0.25    94.12          2213   vmlinux::__interrupt
    0.24    94.37          2203   s2io.ko::tx_intr_handler

Suggestions as to things to look at/with and/or patches to try are welcome.  I 
should have the HW available to me for at least a little while, but not indefinitely.

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-26 22:15 32 core net-next stack/netfilter "scaling" Rick Jones
@ 2009-01-26 23:10 ` Eric Dumazet
  2009-01-26 23:14   ` Stephen Hemminger
                     ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Eric Dumazet @ 2009-01-26 23:10 UTC (permalink / raw)
  To: Rick Jones
  Cc: Linux Network Development list, Netfilter Developers,
	Stephen Hemminger, Patrick McHardy

Rick Jones a écrit :
> Folks -
> 
> Under:
> 
> ftp://ftp.netperf.org/iptable_scaling
> 
> can be found netperf results and Caliper profiles for three scenarios on
> a 32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call
> a "cell based" system in that it is comprised of "cell boards" on which
> reside CPU and memory resources.  In this case there are four cell
> boards, each with 4, dual-core Montecito processors and 1/4 of the
> overall RAM.  The system was configured with a mix of cell-local and
> global interleaved memory, where the global interleave is on a cacheline
> (128 byte) boundary (IIRC).  Total RAM in the system is 256 GB.  The
> cells are joined via cross-bar connections. (numactl --hardware output
> is available under the URL above)
> 
> There was an "I/O expander" connected to the system.  This meant there
> were as many distinct PCI-X domains as there were cells, and every cell
> had a "local" set of PCI-X slots.
> 
> Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs -
> aka Neterion XFrame IIs.  These were then connected to an HP ProCurve
> 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP
> DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet
> NICs (Aka Chelsio T3C-based).  They were running RHEL 5.2 I think.  Each
> NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble
> x8 slot (DL585 G5)
> 
> The kernel is from DaveM's net-next tree ca last week, multiq enabled. 
> The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get
> multiq support.  It was loaded into the kernel via:
> 
> insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8
> 
> There were then 8 tx queues and 8 rx queues per interface in the
> rx8640.  The "setaffinity.txt" script was used to set the IRQ affinities
> to cores "closest" to the physical NIC. In all three tests all 32 cores
> went to 100% utilization. At least for all incense and porpoises. (there
> was some occasional idle reported by top on the full_iptables run)
> 
> A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a
> burst mode of 17 were run (ie 17 "transactions" outstanding on a
> connection at one time,) with TCP_NODELAY set and the results gathered,
> along with a set of Caliper profiles.  The script used to launch these
> can be found in "runemomniagg2.sh.txt under the URL above.
> 
> I picked an "RR" test to maximize the trips up and down the stack while
> minimizing the bandwidth consumed.
> 
> I picked a burst size of 16 because that was sufficient to saturate a
> single core on the rx8640.
> 
> I picked 64 concurrent netperfs because I wanted to make sure I had
> enough concurrent connections to get spread across all the cores/queues
> by the algorithms in place.
> 
> I picked the combination of 64 and 16 rather than say 1024 and 0 (one
> tran at a time) because I didn't want to run a context switching
> benchmark :)
> 
> The rx8640 was picked because it was available and I was confident it
> was not going to have any hardware scaling issues getting in the way.  I
> wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is
> a reasonable analog for any "decent or better scaling" 32 core hardware
> and that while there are ia64-specific routines present in the profiles,
> they are there for platform-independent reasons.
> 
> The no_iptables/ data was run after a fresh boot, with no iptables
> commands run and so no iptables related modules loaded into the kernel.
> 
> The empty_iptables/ data was run after an "iptables --list" command
> which loaded one or two modules into the kernel.
> 
> The full_iptables/ data was run after an "iptables-restore" command
> pointed at full_iptables/iptables.txt  which was created from what RH
> creates by default when one enables firewall via their installer, with a
> port range added by me to allow pretty much anything netperf would ask. 
> As such, while it does excercise netfilter functionality, I cannot make
> any claims as to its "real world" applicability.  (while the firewall
> settings came from an RH setup, FWIW, the base bits running on the
> rx8640 are Debian Lenny, with the net-next kernel on top)
> 
> The "cycles" profile is able to grab flat profile hits while interrupts
> are disabled so it can see stuff happening while interrupts are
> disabled.  The "scgprof" profile is an attempt to get some call graphs -
> it does not have visibility into code running with interrupts disabled. 
> The "cache" profile is a profile that looks to get some cache miss
> information.
> 
> So, having said all that, details can be found under the previously
> mentioned URL.  Some quick highlights:
> 
> no_iptables - ~22000 transactions/s/netperf.  Top of the cycles profile
> looks like:
> 
> Function Summary
> -----------------------------------------------------------------------
> % Total
>      IP  Cumulat             IP
> Samples    % of         Samples
>  (ETB)     Total         (ETB)   Function                          File
> -----------------------------------------------------------------------
>    5.70     5.70         37772   s2io.ko::tx_intr_handler
>    5.14    10.84         34012   vmlinux::__ia64_readq
>    4.88    15.72         32285   s2io.ko::s2io_msix_ring_handle
>    4.63    20.34         30625   s2io.ko::rx_intr_handler
>    4.60    24.94         30429   s2io.ko::s2io_xmit
>    3.85    28.79         25488   s2io.ko::s2io_poll_msix
>    2.87    31.65         18987   vmlinux::dev_queue_xmit
>    2.51    34.16         16620   vmlinux::tcp_sendmsg
>    2.51    36.67         16588   vmlinux::tcp_ack
>    2.15    38.82         14221   vmlinux::__inet_lookup_established
>    2.10    40.92         13937   vmlinux::ia64_spinlock_contention
> 
> empty_iptables - ~12000 transactions/s/netperf.  Top of the cycles
> profile looks like:
> 
> Function Summary
> -----------------------------------------------------------------------
> % Total
>      IP  Cumulat             IP
> Samples    % of         Samples
>  (ETB)     Total         (ETB)   Function                          File
> -----------------------------------------------------------------------
>   26.38    26.38        137458   vmlinux::_read_lock_bh
>   10.63    37.01         55388   vmlinux::local_bh_enable_ip
>    3.42    40.43         17812   s2io.ko::tx_intr_handler
>    3.01    43.44         15691   ip_tables.ko::ipt_do_table
>    2.90    46.34         15100   vmlinux::__ia64_readq
>    2.72    49.06         14179   s2io.ko::rx_intr_handler
>    2.55    51.61         13288   s2io.ko::s2io_xmit
>    1.98    53.59         10329   s2io.ko::s2io_msix_ring_handle
>    1.75    55.34          9104   vmlinux::dev_queue_xmit
>    1.64    56.98          8546   s2io.ko::s2io_poll_msix
>    1.52    58.50          7943   vmlinux::sock_wfree
>    1.40    59.91          7302   vmlinux::tcp_ack
> 
> full_iptables - some test instances didn't complete, I think they got
> starved. Of those which did complete, their performance ranged all the
> way from 330 to 3100 transactions/s/netperf.  Top of the cycles profile
> looks like:
> 
> Function Summary
> -----------------------------------------------------------------------
> % Total
>      IP  Cumulat             IP
> Samples    % of         Samples
>  (ETB)     Total         (ETB)   Function                          File
> -----------------------------------------------------------------------
>   64.71    64.71        582171   vmlinux::_write_lock_bh
>   18.43    83.14        165822   vmlinux::ia64_spinlock_contention
>    2.86    85.99         25709   nf_conntrack.ko::init_module
>    2.36    88.35         21194   nf_conntrack.ko::tcp_packet
>    1.78    90.13         16009   vmlinux::_spin_lock_bh
>    1.20    91.33         10810   nf_conntrack.ko::nf_conntrack_in
>    1.20    92.52         10755   vmlinux::nf_iterate
>    1.09    93.62          9833   vmlinux::default_idle
>    0.26    93.88          2331   vmlinux::__ia64_readq
>    0.25    94.12          2213   vmlinux::__interrupt
>    0.24    94.37          2203   s2io.ko::tx_intr_handler
> 
> Suggestions as to things to look at/with and/or patches to try are
> welcome.  I should have the HW available to me for at least a little
> while, but not indefinitely.
> 
> rick jones

Hi Rick, nice hardware you have :)

Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
I hit this contention point even with low cost hardware, and quite standard application.

I pinged him few days ago to try to finish the job with him, but it seems Stephen
is busy at the moment.

Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
 that must be write_locked() for basically every handled tcp frame...

How long is "not indefinitely" ? 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-26 23:10 ` Eric Dumazet
@ 2009-01-26 23:14   ` Stephen Hemminger
  2009-01-26 23:19   ` Rick Jones
  2009-02-10 18:44   ` Stephen Hemminger
  2 siblings, 0 replies; 42+ messages in thread
From: Stephen Hemminger @ 2009-01-26 23:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Patrick McHardy

On Tue, 27 Jan 2009 00:10:44 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Rick Jones a écrit :
> > Folks -
> > 
> > Under:
> > 
> > ftp://ftp.netperf.org/iptable_scaling
> > 
> > can be found netperf results and Caliper profiles for three scenarios on
> > a 32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call
> > a "cell based" system in that it is comprised of "cell boards" on which
> > reside CPU and memory resources.  In this case there are four cell
> > boards, each with 4, dual-core Montecito processors and 1/4 of the
> > overall RAM.  The system was configured with a mix of cell-local and
> > global interleaved memory, where the global interleave is on a cacheline
> > (128 byte) boundary (IIRC).  Total RAM in the system is 256 GB.  The
> > cells are joined via cross-bar connections. (numactl --hardware output
> > is available under the URL above)
> > 
> > There was an "I/O expander" connected to the system.  This meant there
> > were as many distinct PCI-X domains as there were cells, and every cell
> > had a "local" set of PCI-X slots.
> > 
> > Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs -
> > aka Neterion XFrame IIs.  These were then connected to an HP ProCurve
> > 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP
> > DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet
> > NICs (Aka Chelsio T3C-based).  They were running RHEL 5.2 I think.  Each
> > NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble
> > x8 slot (DL585 G5)
> > 
> > The kernel is from DaveM's net-next tree ca last week, multiq enabled. 
> > The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get
> > multiq support.  It was loaded into the kernel via:
> > 
> > insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8
> > 
> > There were then 8 tx queues and 8 rx queues per interface in the
> > rx8640.  The "setaffinity.txt" script was used to set the IRQ affinities
> > to cores "closest" to the physical NIC. In all three tests all 32 cores
> > went to 100% utilization. At least for all incense and porpoises. (there
> > was some occasional idle reported by top on the full_iptables run)
> > 
> > A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a
> > burst mode of 17 were run (ie 17 "transactions" outstanding on a
> > connection at one time,) with TCP_NODELAY set and the results gathered,
> > along with a set of Caliper profiles.  The script used to launch these
> > can be found in "runemomniagg2.sh.txt under the URL above.
> > 
> > I picked an "RR" test to maximize the trips up and down the stack while
> > minimizing the bandwidth consumed.
> > 
> > I picked a burst size of 16 because that was sufficient to saturate a
> > single core on the rx8640.
> > 
> > I picked 64 concurrent netperfs because I wanted to make sure I had
> > enough concurrent connections to get spread across all the cores/queues
> > by the algorithms in place.
> > 
> > I picked the combination of 64 and 16 rather than say 1024 and 0 (one
> > tran at a time) because I didn't want to run a context switching
> > benchmark :)
> > 
> > The rx8640 was picked because it was available and I was confident it
> > was not going to have any hardware scaling issues getting in the way.  I
> > wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is
> > a reasonable analog for any "decent or better scaling" 32 core hardware
> > and that while there are ia64-specific routines present in the profiles,
> > they are there for platform-independent reasons.
> > 
> > The no_iptables/ data was run after a fresh boot, with no iptables
> > commands run and so no iptables related modules loaded into the kernel.
> > 
> > The empty_iptables/ data was run after an "iptables --list" command
> > which loaded one or two modules into the kernel.
> > 
> > The full_iptables/ data was run after an "iptables-restore" command
> > pointed at full_iptables/iptables.txt  which was created from what RH
> > creates by default when one enables firewall via their installer, with a
> > port range added by me to allow pretty much anything netperf would ask. 
> > As such, while it does excercise netfilter functionality, I cannot make
> > any claims as to its "real world" applicability.  (while the firewall
> > settings came from an RH setup, FWIW, the base bits running on the
> > rx8640 are Debian Lenny, with the net-next kernel on top)
> > 
> > The "cycles" profile is able to grab flat profile hits while interrupts
> > are disabled so it can see stuff happening while interrupts are
> > disabled.  The "scgprof" profile is an attempt to get some call graphs -
> > it does not have visibility into code running with interrupts disabled. 
> > The "cache" profile is a profile that looks to get some cache miss
> > information.
> > 
> > So, having said all that, details can be found under the previously
> > mentioned URL.  Some quick highlights:
> > 
> > no_iptables - ~22000 transactions/s/netperf.  Top of the cycles profile
> > looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >    5.70     5.70         37772   s2io.ko::tx_intr_handler
> >    5.14    10.84         34012   vmlinux::__ia64_readq
> >    4.88    15.72         32285   s2io.ko::s2io_msix_ring_handle
> >    4.63    20.34         30625   s2io.ko::rx_intr_handler
> >    4.60    24.94         30429   s2io.ko::s2io_xmit
> >    3.85    28.79         25488   s2io.ko::s2io_poll_msix
> >    2.87    31.65         18987   vmlinux::dev_queue_xmit
> >    2.51    34.16         16620   vmlinux::tcp_sendmsg
> >    2.51    36.67         16588   vmlinux::tcp_ack
> >    2.15    38.82         14221   vmlinux::__inet_lookup_established
> >    2.10    40.92         13937   vmlinux::ia64_spinlock_contention
> > 
> > empty_iptables - ~12000 transactions/s/netperf.  Top of the cycles
> > profile looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >   26.38    26.38        137458   vmlinux::_read_lock_bh
> >   10.63    37.01         55388   vmlinux::local_bh_enable_ip
> >    3.42    40.43         17812   s2io.ko::tx_intr_handler
> >    3.01    43.44         15691   ip_tables.ko::ipt_do_table
> >    2.90    46.34         15100   vmlinux::__ia64_readq
> >    2.72    49.06         14179   s2io.ko::rx_intr_handler
> >    2.55    51.61         13288   s2io.ko::s2io_xmit
> >    1.98    53.59         10329   s2io.ko::s2io_msix_ring_handle
> >    1.75    55.34          9104   vmlinux::dev_queue_xmit
> >    1.64    56.98          8546   s2io.ko::s2io_poll_msix
> >    1.52    58.50          7943   vmlinux::sock_wfree
> >    1.40    59.91          7302   vmlinux::tcp_ack
> > 
> > full_iptables - some test instances didn't complete, I think they got
> > starved. Of those which did complete, their performance ranged all the
> > way from 330 to 3100 transactions/s/netperf.  Top of the cycles profile
> > looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >   64.71    64.71        582171   vmlinux::_write_lock_bh
> >   18.43    83.14        165822   vmlinux::ia64_spinlock_contention
> >    2.86    85.99         25709   nf_conntrack.ko::init_module
> >    2.36    88.35         21194   nf_conntrack.ko::tcp_packet
> >    1.78    90.13         16009   vmlinux::_spin_lock_bh
> >    1.20    91.33         10810   nf_conntrack.ko::nf_conntrack_in
> >    1.20    92.52         10755   vmlinux::nf_iterate
> >    1.09    93.62          9833   vmlinux::default_idle
> >    0.26    93.88          2331   vmlinux::__ia64_readq
> >    0.25    94.12          2213   vmlinux::__interrupt
> >    0.24    94.37          2203   s2io.ko::tx_intr_handler
> > 
> > Suggestions as to things to look at/with and/or patches to try are
> > welcome.  I should have the HW available to me for at least a little
> > while, but not indefinitely.
> > 
> > rick jones
> 
> Hi Rick, nice hardware you have :)
> 
> Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
> I hit this contention point even with low cost hardware, and quite standard application.
> 
> I pinged him few days ago to try to finish the job with him, but it seems Stephen
> is busy at the moment.
> 
> Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
>  that must be write_locked() for basically every handled tcp frame...
> 
> How long is "not indefinitely" ? 

Hey, I just got back from Linux Conf Au, haven't had time to catch up yet.
It is on my list, after dealing with the other work related stuff.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-26 23:10 ` Eric Dumazet
  2009-01-26 23:14   ` Stephen Hemminger
@ 2009-01-26 23:19   ` Rick Jones
  2009-01-27  9:10     ` Eric Dumazet
  2009-02-10 18:44   ` Stephen Hemminger
  2 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2009-01-26 23:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Network Development list, Netfilter Developers,
	Stephen Hemminger, Patrick McHardy

> 
> Hi Rick, nice hardware you have :)

Every once in a while the cobbler's children get some shoes to wear :)

> 
> Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
> I hit this contention point even with low cost hardware, and quite standard application.
> 
> I pinged him few days ago to try to finish the job with him, but it seems Stephen
> is busy at the moment.
> 
> Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
>  that must be write_locked() for basically every handled tcp frame...
> 
> How long is "not indefinitely" ? 

The system I am using is being borrowed under an open ended loan.  However, my 
use of it can be thought of as being the "null process" in VMS - once anyone else 
wants it I have to get off of it.

That said, I would guess that the chances of someone else trying to get that 
system are pretty small for the next four+ weeks.  I had a similar system (PCIe 
I/O rather than PCI-X) for quite a few weeks before it got pulled-out from under.

rick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-26 23:19   ` Rick Jones
@ 2009-01-27  9:10     ` Eric Dumazet
  2009-01-27  9:15       ` Patrick McHardy
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-27  9:10 UTC (permalink / raw)
  To: Rick Jones
  Cc: Linux Network Development list, Netfilter Developers,
	Stephen Hemminger, Patrick McHardy

Rick Jones a écrit :
>>
>> Hi Rick, nice hardware you have :)
> 
> Every once in a while the cobbler's children get some shoes to wear :)
> 
>>
>> Stephen had a patch to nuke read_lock() from iptables, using RCU and
>> seqlocks.
>> I hit this contention point even with low cost hardware, and quite
>> standard application.
>>
>> I pinged him few days ago to try to finish the job with him, but it
>> seems Stephen
>> is busy at the moment.
>>
>> Then conntrack (tcp sessions) is awfull, since it uses a single
>> rwlock_t tcp_lock
>>  that must be write_locked() for basically every handled tcp frame...
>>
>> How long is "not indefinitely" ? 
> 
> The system I am using is being borrowed under an open ended loan. 
> However, my use of it can be thought of as being the "null process" in
> VMS - once anyone else wants it I have to get off of it.
> 
> That said, I would guess that the chances of someone else trying to get
> that system are pretty small for the next four+ weeks.  I had a similar
> system (PCIe I/O rather than PCI-X) for quite a few weeks before it got
> pulled-out from under.
> 

As Stephen will probably send the iptable part, I can do the tcp conntrack
improvment.

Let me know how it helps your last test (connection tracking enabled)

Thanks

[PATCH] netfilter: Get rid of central rwlock in tcp conntracking

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"

tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h
index a049df4..3dc72c6 100644
--- a/include/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/linux/netfilter/nf_conntrack_tcp.h
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
 
 struct ip_ct_tcp
 {
+	spinlock_t	lock;
 	struct ip_ct_tcp_state seen[2];	/* connection parameters per direction */
 	u_int8_t	state;		/* state of the connection (enum tcp_conntrack) */
 	/* For detecting stale connections */
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..1bc8fc1 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_log.h>
 
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
 /* "Be conservative in what you do,
     be liberal in what you accept from others."
     If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
 {
 	enum tcp_conntrack state;
 
-	read_lock_bh(&tcp_lock);
 	state = ct->proto.tcp.state;
-	read_unlock_bh(&tcp_lock);
 
 	return seq_printf(s, "%s ", tcp_conntrack_names[state]);
 }
@@ -705,14 +700,14 @@ void nf_conntrack_tcp_update(const struct sk_buff *skb,
 
 	end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(&ct->proto.lock);
 	/*
 	 * We have to worry for the ack in the reply packet only...
 	 */
 	if (after(end, ct->proto.tcp.seen[dir].td_end))
 		ct->proto.tcp.seen[dir].td_end = end;
 	ct->proto.tcp.last_end = end;
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(&ct->proto.lock);
 	pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i "
 		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
 		 sender->td_end, sender->td_maxend, sender->td_maxwin,
@@ -821,7 +816,7 @@ static int tcp_packet(struct nf_conn *ct,
 	th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
 	BUG_ON(th == NULL);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(&ct->proto.tcp.lock);
 	old_state = ct->proto.tcp.state;
 	dir = CTINFO2DIR(ctinfo);
 	index = get_conntrack_index(th);
@@ -851,7 +846,7 @@ static int tcp_packet(struct nf_conn *ct,
 		        && ct->proto.tcp.last_index == TCP_RST_SET)) {
 			/* Attempt to reopen a closed/aborted connection.
 			 * Delete this connection and look up again. */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(&ct->proto.tcp.lock);
 
 			/* Only repeat if we can actually remove the timer.
 			 * Destruction may already be in progress in process
@@ -887,7 +882,7 @@ static int tcp_packet(struct nf_conn *ct,
 			 * that the client cannot but retransmit its SYN and
 			 * thus initiate a clean new session.
 			 */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(&ct->proto.tcp.lock);
 			if (LOG_INVALID(net, IPPROTO_TCP))
 				nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 					  "nf_ct_tcp: killing out of sync session ");
@@ -900,7 +895,7 @@ static int tcp_packet(struct nf_conn *ct,
 		ct->proto.tcp.last_end =
 		    segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);
 
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(&ct->proto.tcp.lock);
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid packet ignored ");
@@ -909,7 +904,7 @@ static int tcp_packet(struct nf_conn *ct,
 		/* Invalid packet */
 		pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
 			 dir, get_conntrack_index(th), old_state);
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(&ct->proto.tcp.lock);
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid state ");
@@ -940,7 +935,7 @@ static int tcp_packet(struct nf_conn *ct,
 
 	if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,
 			   skb, dataoff, th, pf)) {
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(&ct->proto.tcp.lock);
 		return -NF_ACCEPT;
 	}
      in_window:
@@ -969,7 +964,7 @@ static int tcp_packet(struct nf_conn *ct,
 		timeout = nf_ct_tcp_timeout_unacknowledged;
 	else
 		timeout = tcp_timeouts[new_state];
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(&ct->proto.tcp.lock);
 
 	nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct);
 	if (new_state != old_state)
@@ -1022,6 +1017,7 @@ static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
 		pr_debug("nf_ct_tcp: invalid new deleting.\n");
 		return false;
 	}
+	spin_lock_init(&ct->proto.tcp.lock);
 
 	if (new_state == TCP_CONNTRACK_SYN_SENT) {
 		/* SYN packet */
@@ -1092,7 +1088,7 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	struct nlattr *nest_parms;
 	struct nf_ct_tcp_flags tmp = {};
 
-	read_lock_bh(&tcp_lock);
+	spin_lock_bh(&ct->proto.tcp.lock);
 	nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -1112,14 +1108,14 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	tmp.flags = ct->proto.tcp.seen[1].flags;
 	NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY,
 		sizeof(struct nf_ct_tcp_flags), &tmp);
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(&ct->proto.tcp.lock);
 
 	nla_nest_end(skb, nest_parms);
 
 	return 0;
 
 nla_put_failure:
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(&ct->proto.tcp.lock);
 	return -1;
 }
 
@@ -1150,7 +1146,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 	    nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
 		return -EINVAL;
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(&ct->proto.tcp.lock);
 	if (tb[CTA_PROTOINFO_TCP_STATE])
 		ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);
 
@@ -1177,7 +1173,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 		ct->proto.tcp.seen[1].td_scale =
 			nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]);
 	}
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(&ct->proto.tcp.lock);
 
 	return 0;
 }

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27  9:10     ` Eric Dumazet
@ 2009-01-27  9:15       ` Patrick McHardy
  2009-01-27 11:29         ` Eric Dumazet
  2009-01-27 16:23         ` Eric Dumazet
  0 siblings, 2 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-01-27  9:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Stephen Hemminger

Eric Dumazet wrote:
> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
> 
> TCP connection tracking suffers of huge contention on a global rwlock,
> used to protect tcp conntracking state.
> As each tcp conntrack state have no relations between each others, we
> can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
> 
> tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
> so speedup /proc/net/ip_conntrack as well.

Thats an interesting test-case, but one lock per conntrack just for
TCP tracking seems like overkill. We're trying to keep the conntrack
stuctures as small as possible, so I'd prefer an array of spinlocks
or something like that.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27  9:15       ` Patrick McHardy
@ 2009-01-27 11:29         ` Eric Dumazet
  2009-01-27 11:37           ` Patrick McHardy
  2009-01-27 16:23         ` Eric Dumazet
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-27 11:29 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Stephen Hemminger

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>
>> TCP connection tracking suffers of huge contention on a global rwlock,
>> used to protect tcp conntracking state.
>> As each tcp conntrack state have no relations between each others, we
>> can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
>>
>> tcp_print_conntrack() dont need to lock anything to read
>> ct->proto.tcp.state,
>> so speedup /proc/net/ip_conntrack as well.
> 
> Thats an interesting test-case, but one lock per conntrack just for
> TCP tracking seems like overkill. We're trying to keep the conntrack
> stuctures as small as possible, so I'd prefer an array of spinlocks
> or something like that.

Yes, this is wise. Current sizeof(struct nf_conn) is 220 (0xdc) on 32 bits,
probably rounded to 0xE0 by SLAB/SLUB. I will provide a new patch using
an array of say 512 spinlocks. (512 spinlocks use 2048 bytes if non
debuging spinlocks, that spread to 32 x 64bytes cache lines)

However I wonder if for very large number of cpus we should at least ask conntrack 
to use hardware aligned "struct nf_conn" to avoid false sharing

We might also use a generic SLAB_HWCACHE_ALIGN_IFMANYCPUS flag if same tactic
could help other kmem_cache_create() users


diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 90ce9dd..82332ce 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1167,8 +1167,10 @@ static int nf_conntrack_init_init_net(void)
 	       nf_conntrack_max);
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
-						sizeof(struct nf_conn),
-						0, 0, NULL);
+						sizeof(struct nf_conn), 0,
+						num_possible_cpus() >= 32 ?
+							SLAB_HWCACHE_ALIGN : 0,
+						NULL);
 	if (!nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
 		ret = -ENOMEM;

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 11:29         ` Eric Dumazet
@ 2009-01-27 11:37           ` Patrick McHardy
  0 siblings, 0 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-01-27 11:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Stephen Hemminger

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>   
>> Thats an interesting test-case, but one lock per conntrack just for
>> TCP tracking seems like overkill. We're trying to keep the conntrack
>> stuctures as small as possible, so I'd prefer an array of spinlocks
>> or something like that.
>>     
>
> Yes, this is wise. Current sizeof(struct nf_conn) is 220 (0xdc) on 32 bits,
> probably rounded to 0xE0 by SLAB/SLUB. I will provide a new patch using
> an array of say 512 spinlocks. (512 spinlocks use 2048 bytes if non
> debuging spinlocks, that spread to 32 x 64bytes cache lines)
>   

Sounds good, but it should be limited to NR_CPUS I guess.
> However I wonder if for very large number of cpus we should at least ask conntrack 
> to use hardware aligned "struct nf_conn" to avoid false sharing
>   

I'm not sure that is really a problem in practice, you usually have quite a
few inactive conntrack entries and false sharing would only happen when two
consequitive entries are active.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27  9:15       ` Patrick McHardy
  2009-01-27 11:29         ` Eric Dumazet
@ 2009-01-27 16:23         ` Eric Dumazet
  2009-01-27 17:33           ` Patrick McHardy
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-27 16:23 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Stephen Hemminger

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>
>> TCP connection tracking suffers of huge contention on a global rwlock,
>> used to protect tcp conntracking state.
>> As each tcp conntrack state have no relations between each others, we
>> can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
>>
>> tcp_print_conntrack() dont need to lock anything to read
>> ct->proto.tcp.state,
>> so speedup /proc/net/ip_conntrack as well.
> 
> Thats an interesting test-case, but one lock per conntrack just for
> TCP tracking seems like overkill. We're trying to keep the conntrack
> stuctures as small as possible, so I'd prefer an array of spinlocks
> or something like that.

Please find a new version of patch, using an array of spinlocks.

I chose to define an array at nf_conntrack level, since tcp
conntracking is one user of this array but we might find
other users of this array.

Thanks

[PATCH] netfilter: Get rid of central rwlock in tcp conntracking

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.

As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock. Using an array of spinlocks avoids
enlarging size of connection tracking structures, yet giving reasonable
fanout.

tcp_print_conntrack() doesnt need to lock anything to read
ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.

nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
 include/net/netfilter/nf_conntrack.h   |   32 +++++++++++++++++++++
 net/netfilter/nf_conntrack_core.c      |   10 ++++--
 net/netfilter/nf_conntrack_proto_tcp.c |   34 ++++++++++-------------
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 2e0c536..288aff5 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -129,6 +129,38 @@ struct nf_conn
 	struct rcu_head rcu;
 };
 
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_LOCKING)
+
+/*
+ * We reserve 16 locks per cpu (typical cache line size is 64 bytes)
+ * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory.
+ * (on lockdep we have a quite big spinlock_t, so keep the size down there)
+ */
+#ifdef CONFIG_LOCKDEP
+#define CT_HASH_LOCK_SZ 64
+#elif NR_CPUS >= 32
+#define CT_HASH_LOCK_SZ 512
+#else
+#define CT_HASH_LOCK_SZ (roundup_pow_of_two(NR_CPUS) * 16)
+#endif
+
+extern spinlock_t ct_hlock[CT_HASH_LOCK_SZ];
+
+#else
+#define CT_HASH_LOCK_SZ 0
+#endif
+static inline spinlock_t *ct_lock_addr(const struct nf_conn *ct)
+{
+	if (CT_HASH_LOCK_SZ) {
+		unsigned long hash = (unsigned long)ct;
+		hash ^= hash >> 16;
+		hash ^= hash >> 8;
+		return &ct_hlock[hash % CT_HASH_LOCK_SZ];
+	}
+	return NULL;
+}
+
 static inline struct nf_conn *
 nf_ct_tuplehash_to_ctrack(const struct nf_conntrack_tuple_hash *hash)
 {
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 90ce9dd..68822d8 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -61,9 +61,9 @@ struct nf_conn nf_conntrack_untracked __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
 
 static struct kmem_cache *nf_conntrack_cachep __read_mostly;
-
-static int nf_conntrack_hash_rnd_initted;
-static unsigned int nf_conntrack_hash_rnd;
+spinlock_t ct_hlock[CT_HASH_LOCK_SZ];
+static int nf_conntrack_hash_rnd_initted __read_mostly;
+static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
 static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
 				  unsigned int size, unsigned int rnd)
@@ -1141,7 +1141,7 @@ module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint,
 static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
-	int ret;
+	int i, ret;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1174,6 +1174,8 @@ static int nf_conntrack_init_init_net(void)
 		ret = -ENOMEM;
 		goto err_cache;
 	}
+	for (i = 0; i < CT_HASH_LOCK_SZ; i++)
+		spin_lock_init(&ct_hlock[i]);
 
 	ret = nf_conntrack_proto_init();
 	if (ret < 0)
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..247e82f 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_log.h>
 
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
 /* "Be conservative in what you do,
     be liberal in what you accept from others."
     If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
 {
 	enum tcp_conntrack state;
 
-	read_lock_bh(&tcp_lock);
 	state = ct->proto.tcp.state;
-	read_unlock_bh(&tcp_lock);
 
 	return seq_printf(s, "%s ", tcp_conntrack_names[state]);
 }
@@ -705,14 +700,14 @@ void nf_conntrack_tcp_update(const struct sk_buff *skb,
 
 	end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	/*
 	 * We have to worry for the ack in the reply packet only...
 	 */
 	if (after(end, ct->proto.tcp.seen[dir].td_end))
 		ct->proto.tcp.seen[dir].td_end = end;
 	ct->proto.tcp.last_end = end;
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 	pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i "
 		 "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
 		 sender->td_end, sender->td_maxend, sender->td_maxwin,
@@ -821,7 +816,7 @@ static int tcp_packet(struct nf_conn *ct,
 	th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
 	BUG_ON(th == NULL);
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	old_state = ct->proto.tcp.state;
 	dir = CTINFO2DIR(ctinfo);
 	index = get_conntrack_index(th);
@@ -851,7 +846,7 @@ static int tcp_packet(struct nf_conn *ct,
 		        && ct->proto.tcp.last_index == TCP_RST_SET)) {
 			/* Attempt to reopen a closed/aborted connection.
 			 * Delete this connection and look up again. */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(ct_lock_addr(ct));
 
 			/* Only repeat if we can actually remove the timer.
 			 * Destruction may already be in progress in process
@@ -887,7 +882,7 @@ static int tcp_packet(struct nf_conn *ct,
 			 * that the client cannot but retransmit its SYN and
 			 * thus initiate a clean new session.
 			 */
-			write_unlock_bh(&tcp_lock);
+			spin_unlock_bh(ct_lock_addr(ct));
 			if (LOG_INVALID(net, IPPROTO_TCP))
 				nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 					  "nf_ct_tcp: killing out of sync session ");
@@ -900,7 +895,7 @@ static int tcp_packet(struct nf_conn *ct,
 		ct->proto.tcp.last_end =
 		    segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);
 
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(ct_lock_addr(ct));
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid packet ignored ");
@@ -909,7 +904,7 @@ static int tcp_packet(struct nf_conn *ct,
 		/* Invalid packet */
 		pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
 			 dir, get_conntrack_index(th), old_state);
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(ct_lock_addr(ct));
 		if (LOG_INVALID(net, IPPROTO_TCP))
 			nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
 				  "nf_ct_tcp: invalid state ");
@@ -940,7 +935,7 @@ static int tcp_packet(struct nf_conn *ct,
 
 	if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,
 			   skb, dataoff, th, pf)) {
-		write_unlock_bh(&tcp_lock);
+		spin_unlock_bh(ct_lock_addr(ct));
 		return -NF_ACCEPT;
 	}
      in_window:
@@ -969,7 +964,7 @@ static int tcp_packet(struct nf_conn *ct,
 		timeout = nf_ct_tcp_timeout_unacknowledged;
 	else
 		timeout = tcp_timeouts[new_state];
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 
 	nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct);
 	if (new_state != old_state)
@@ -1022,6 +1017,7 @@ static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
 		pr_debug("nf_ct_tcp: invalid new deleting.\n");
 		return false;
 	}
+	spin_lock_init(ct_lock_addr(ct));
 
 	if (new_state == TCP_CONNTRACK_SYN_SENT) {
 		/* SYN packet */
@@ -1092,7 +1088,7 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	struct nlattr *nest_parms;
 	struct nf_ct_tcp_flags tmp = {};
 
-	read_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED);
 	if (!nest_parms)
 		goto nla_put_failure;
@@ -1112,14 +1108,14 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 	tmp.flags = ct->proto.tcp.seen[1].flags;
 	NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY,
 		sizeof(struct nf_ct_tcp_flags), &tmp);
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 
 	nla_nest_end(skb, nest_parms);
 
 	return 0;
 
 nla_put_failure:
-	read_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 	return -1;
 }
 
@@ -1150,7 +1146,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 	    nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
 		return -EINVAL;
 
-	write_lock_bh(&tcp_lock);
+	spin_lock_bh(ct_lock_addr(ct));
 	if (tb[CTA_PROTOINFO_TCP_STATE])
 		ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);
 
@@ -1177,7 +1173,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
 		ct->proto.tcp.seen[1].td_scale =
 			nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]);
 	}
-	write_unlock_bh(&tcp_lock);
+	spin_unlock_bh(ct_lock_addr(ct));
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 16:23         ` Eric Dumazet
@ 2009-01-27 17:33           ` Patrick McHardy
  2009-01-27 18:02             ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick McHardy @ 2009-01-27 17:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Stephen Hemminger

Eric Dumazet wrote:
> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>
> TCP connection tracking suffers of huge contention on a global rwlock,
> used to protect tcp conntracking state.
>
> As each tcp conntrack state have no relations between each others, we
> can switch to fine grained lock. Using an array of spinlocks avoids
> enlarging size of connection tracking structures, yet giving reasonable
> fanout.
>
> tcp_print_conntrack() doesnt need to lock anything to read
> ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
>
> nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly
>   

This looks good to me. Rick, would you like to give it a try?

I'll convert the remaining conntrack protocols when applying it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 17:33           ` Patrick McHardy
@ 2009-01-27 18:02             ` Rick Jones
  2009-01-27 19:09               ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2009-01-27 18:02 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Eric Dumazet, Linux Network Development list,
	Netfilter Developers, Stephen Hemminger

Patrick McHardy wrote:
> Eric Dumazet wrote:
> 
>>[PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>
>>TCP connection tracking suffers of huge contention on a global rwlock,
>>used to protect tcp conntracking state.
>>
>>As each tcp conntrack state have no relations between each others, we
>>can switch to fine grained lock. Using an array of spinlocks avoids
>>enlarging size of connection tracking structures, yet giving reasonable
>>fanout.
>>
>>tcp_print_conntrack() doesnt need to lock anything to read
>>ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
>>
>>nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly
>>  
> 
> 
> This looks good to me. Rick, would you like to give it a try?
> 
> I'll convert the remaining conntrack protocols when applying it.

I will give it a try and let folks know the results - unless told otherwise, I 
will ass-u-me I only need rerun the "full_iptables" test case.

rick jones


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 18:02             ` Rick Jones
@ 2009-01-27 19:09               ` Rick Jones
  2009-01-27 19:24                 ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2009-01-27 19:09 UTC (permalink / raw)
  To: Rick Jones
  Cc: Patrick McHardy, Eric Dumazet, Linux Network Development list,
	Netfilter Developers, Stephen Hemminger

Rick Jones wrote:
> Patrick McHardy wrote:
> 
>> Eric Dumazet wrote:
>>
>>> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking
>>>
>>> TCP connection tracking suffers of huge contention on a global rwlock,
>>> used to protect tcp conntracking state.
>>>
>>> As each tcp conntrack state have no relations between each others, we
>>> can switch to fine grained lock. Using an array of spinlocks avoids
>>> enlarging size of connection tracking structures, yet giving reasonable
>>> fanout.
>>>
>>> tcp_print_conntrack() doesnt need to lock anything to read
>>> ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.
>>>
>>> nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared 
>>> read_mostly
>>>  
>>
>>
>>
>> This looks good to me. Rick, would you like to give it a try?
>>
>> I'll convert the remaining conntrack protocols when applying it.
> 
> 
> I will give it a try and let folks know the results - unless told 
> otherwise, I will ass-u-me I only need rerun the "full_iptables" test case.

The runemomniagg2.sh script is still running, but the initial cycles profile 
suggests that the main change is converting the write_lock time into spinlock 
contention time with 78.39% of the cycles spent in ia64_spinlock_contention. 
When the script completes I'll upload the profiles and the netperf results to the 
same base URL as in the basenote under "contrack01/"

rick jones

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 19:09               ` Rick Jones
@ 2009-01-27 19:24                 ` Rick Jones
  2009-01-27 22:17                   ` Eric Dumazet
  2009-01-28 13:55                   ` Eric Dumazet
  0 siblings, 2 replies; 42+ messages in thread
From: Rick Jones @ 2009-01-27 19:24 UTC (permalink / raw)
  To: Netfilter Developers
  Cc: Patrick McHardy, Eric Dumazet, Linux Network Development list,
	Stephen Hemminger

>> I will give it a try and let folks know the results - unless told 
>> otherwise, I will ass-u-me I only need rerun the "full_iptables" test 
>> case.
> 
> 
> The runemomniagg2.sh script is still running, but the initial cycles 
> profile suggests that the main change is converting the write_lock time 
> into spinlock contention time with 78.39% of the cycles spent in 
> ia64_spinlock_contention. When the script completes I'll upload the 
> profiles and the netperf results to the same base URL as in the basenote 
> under "contrack01/"

The script completed - although at some point I hit an fd limit - I think I have 
an fd leak in netperf somewhere :( .

Anyhow, there are still some netperfs that end-up kicking the bucket during the 
run - I suspect starvation because where in the other configs (no iptables, and 
empty iptables) each netperf seems to consume about 50% of a CPU - stands to 
reason - 64 netperfs, 32 cores - in the "full" case I see many netperfs consuming 
100% of a CPU.  My gut is thinking that one or more netperf contexts gets stuck 
doing something on behalf of others.  There is also ksoftirqd time for a few of 
those processes.

Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which does 
represent an improvement.

rick jones

PS - just to be certain that running-out of fd's didn't skew the results I'm 
rerunning the script with ulimit -n 10240 and will see if that changes the 
results any.  And I suppose I need to go fd leak hunting in netperf omni code :(

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 19:24                 ` Rick Jones
@ 2009-01-27 22:17                   ` Eric Dumazet
  2009-01-27 22:29                     ` Rick Jones
  2009-01-28 13:55                   ` Eric Dumazet
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-27 22:17 UTC (permalink / raw)
  To: Rick Jones
  Cc: Netfilter Developers, Patrick McHardy,
	Linux Network Development list, Stephen Hemminger

Rick Jones a écrit :
>>> I will give it a try and let folks know the results - unless told 
>>> otherwise, I will ass-u-me I only need rerun the "full_iptables" 
>>> test case.
>>
>>
>> The runemomniagg2.sh script is still running, but the initial cycles 
>> profile suggests that the main change is converting the write_lock 
>> time into spinlock contention time with 78.39% of the cycles spent in 
>> ia64_spinlock_contention. When the script completes I'll upload the 
>> profiles and the netperf results to the same base URL as in the 
>> basenote under "contrack01/"
>
> The script completed - although at some point I hit an fd limit - I 
> think I have an fd leak in netperf somewhere :( .
>
> Anyhow, there are still some netperfs that end-up kicking the bucket 
> during the run - I suspect starvation because where in the other 
> configs (no iptables, and empty iptables) each netperf seems to 
> consume about 50% of a CPU - stands to reason - 64 netperfs, 32 cores 
> - in the "full" case I see many netperfs consuming 100% of a CPU.  My 
> gut is thinking that one or more netperf contexts gets stuck doing 
> something on behalf of others.  There is also ksoftirqd time for a few 
> of those processes.
>
> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which 
> does represent an improvement.
>
> rick jones
>
> PS - just to be certain that running-out of fd's didn't skew the 
> results I'm rerunning the script with ulimit -n 10240 and will see if 
> that changes the results any.  And I suppose I need to go fd leak 
> hunting in netperf omni code :(
> -- 
>

Thanks for the report

If you have so much contention on spinlocks, maybe hash function is not 
good at all...

hash = (unsigned long)ct;
hash ^= hash >> 16;
hash ^= hash >> 8;

I ass-u-me you compiled your kernel with NR_CPUS=32 ?

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 22:17                   ` Eric Dumazet
@ 2009-01-27 22:29                     ` Rick Jones
  2009-01-27 22:34                       ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Rick Jones @ 2009-01-27 22:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Netfilter Developers, Patrick McHardy,
	Linux Network Development list, Stephen Hemminger


> Thanks for the report
> 
> If you have so much contention on spinlocks, maybe hash function is not 
> good at all...
> 
> hash = (unsigned long)ct;
> hash ^= hash >> 16;
> hash ^= hash >> 8;
> 
> I ass-u-me you compiled your kernel with NR_CPUS=32 ?

I believe so - CONFIG_NR_CPUS in .config is 64 anyway.  If there is a more 
definitive place to check I'd be happy to look.

rick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 22:29                     ` Rick Jones
@ 2009-01-27 22:34                       ` Eric Dumazet
  2009-01-27 22:43                         ` Rick Jones
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-27 22:34 UTC (permalink / raw)
  To: Rick Jones
  Cc: Netfilter Developers, Patrick McHardy,
	Linux Network Development list, Stephen Hemminger

Rick Jones a écrit :
>
>> Thanks for the report
>>
>> If you have so much contention on spinlocks, maybe hash function is 
>> not good at all...
>>
>> hash = (unsigned long)ct;
>> hash ^= hash >> 16;
>> hash ^= hash >> 8;
>>
>> I ass-u-me you compiled your kernel with NR_CPUS=32 ?
>
> I believe so - CONFIG_NR_CPUS in .config is 64 anyway.  If there is a 
> more definitive place to check I'd be happy to look.
>
> rick
>
>

By any chance, are you using SLAB instead of SLUB ?

While running your netperf session, issuing :

grep conntrack /proc/slabinfo

Would help to know hash dispertion on your machine.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 22:34                       ` Eric Dumazet
@ 2009-01-27 22:43                         ` Rick Jones
  0 siblings, 0 replies; 42+ messages in thread
From: Rick Jones @ 2009-01-27 22:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Netfilter Developers, Patrick McHardy,
	Linux Network Development list, Stephen Hemminger

> 
> By any chance, are you using SLAB instead of SLUB ?

Seems that way:

sut42:~/net-next-2.6# grep SLAB .config
CONFIG_SLAB=y
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set
sut42:~/net-next-2.6# grep SLUB .config
# CONFIG_SLUB is not set

The config file should be up under the base URL:

ftp://ftp.netperf.org/iptable_scaling

The system started as a Debian Lenny with a 2.6.26 kernel, and I did a make old 
config and mostly took defaults (save multiq)


> While running your netperf session, issuing :
> 
> grep conntrack /proc/slabinfo
> 
> Would help to know hash dispertion on your machine.


sut42:~/net-next-2.6# grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         276    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      0

about a minute later it is:

grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         315    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      0

after about another 40 to 60 seconds it was:

grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         314    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      7


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-27 19:24                 ` Rick Jones
  2009-01-27 22:17                   ` Eric Dumazet
@ 2009-01-28 13:55                   ` Eric Dumazet
  2009-01-28 16:25                     ` Patrick McHardy
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-28 13:55 UTC (permalink / raw)
  To: Rick Jones
  Cc: Netfilter Developers, Patrick McHardy,
	Linux Network Development list, Stephen Hemminger

Rick Jones a écrit :
>>> I will give it a try and let folks know the results - unless told
>>> otherwise, I will ass-u-me I only need rerun the "full_iptables" test
>>> case.
>>
>>
>> The runemomniagg2.sh script is still running, but the initial cycles
>> profile suggests that the main change is converting the write_lock
>> time into spinlock contention time with 78.39% of the cycles spent in
>> ia64_spinlock_contention. When the script completes I'll upload the
>> profiles and the netperf results to the same base URL as in the
>> basenote under "contrack01/"
> 
> The script completed - although at some point I hit an fd limit - I
> think I have an fd leak in netperf somewhere :( .
> 
> Anyhow, there are still some netperfs that end-up kicking the bucket
> during the run - I suspect starvation because where in the other configs
> (no iptables, and empty iptables) each netperf seems to consume about
> 50% of a CPU - stands to reason - 64 netperfs, 32 cores - in the "full"
> case I see many netperfs consuming 100% of a CPU.  My gut is thinking
> that one or more netperf contexts gets stuck doing something on behalf
> of others.  There is also ksoftirqd time for a few of those processes.
> 
> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
> does represent an improvement.
>

Yes indeed you have a speedup, tcp conntracking is OK.

You now hit the nf_conntrack_lock spinlock we have in generic conntrack code 
(net/netfilter/nf_conntrack_core.c)

nf_ct_refresh_acct() for instance has to lock it.

We really want some finer locking here.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-28 13:55                   ` Eric Dumazet
@ 2009-01-28 16:25                     ` Patrick McHardy
  2009-01-28 17:07                       ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick McHardy @ 2009-01-28 16:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Netfilter Developers, Linux Network Development list,
	Stephen Hemminger

Eric Dumazet wrote:
> Rick Jones a écrit :
>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
>> does represent an improvement.
>>
> 
> Yes indeed you have a speedup, tcp conntracking is OK.
> 
> You now hit the nf_conntrack_lock spinlock we have in generic conntrack code 
> (net/netfilter/nf_conntrack_core.c)
> 
> nf_ct_refresh_acct() for instance has to lock it.
> 
> We really want some finer locking here.

That looks more complicated since it requires to take multiple locks
occasionally (f.i. hash insertion, potentially helper-related and
expectation-related stuff), and there is the unconfirmed_list, where
fine-grained locking can't really be used without changing it to
a hash.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-28 16:25                     ` Patrick McHardy
@ 2009-01-28 17:07                       ` Eric Dumazet
  2009-01-28 17:34                         ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-28 17:07 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Rick Jones, Netfilter Developers, Linux Network Development list,
	Stephen Hemminger

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Rick Jones a écrit :
>>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
>>> does represent an improvement.
>>>
>>
>> Yes indeed you have a speedup, tcp conntracking is OK.
>>
>> You now hit the nf_conntrack_lock spinlock we have in generic
>> conntrack code (net/netfilter/nf_conntrack_core.c)
>>
>> nf_ct_refresh_acct() for instance has to lock it.
>>
>> We really want some finer locking here.
> 
> That looks more complicated since it requires to take multiple locks
> occasionally (f.i. hash insertion, potentially helper-related and
> expectation-related stuff), and there is the unconfirmed_list, where
> fine-grained locking can't really be used without changing it to
> a hash.
>

Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
 sockets, using RCU to have lockless lookups.
Yes, we still take a lock when doing an insert or delete at socket
bind/unbind time.

We could keep a central nf_conntrack_lock to guard insertions/deletes
from hash and unconfirmed_list.

But *normal* packets that only need to change state of one particular
connection could use RCU (without spinlock) to locate the conntrack,
then lock the found conntrack to perform all state changes.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-28 17:07                       ` Eric Dumazet
@ 2009-01-28 17:34                         ` Eric Dumazet
  2009-01-29 15:31                           ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
  2009-02-09 14:57                           ` 32 core net-next stack/netfilter "scaling" Patrick McHardy
  0 siblings, 2 replies; 42+ messages in thread
From: Eric Dumazet @ 2009-01-28 17:34 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Rick Jones, Netfilter Developers, Linux Network Development list,
	Stephen Hemminger

Eric Dumazet a écrit :
> Patrick McHardy a écrit :
>> Eric Dumazet wrote:
>>> Rick Jones a écrit :
>>>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which
>>>> does represent an improvement.
>>>>
>>> Yes indeed you have a speedup, tcp conntracking is OK.
>>>
>>> You now hit the nf_conntrack_lock spinlock we have in generic
>>> conntrack code (net/netfilter/nf_conntrack_core.c)
>>>
>>> nf_ct_refresh_acct() for instance has to lock it.
>>>
>>> We really want some finer locking here.
>> That looks more complicated since it requires to take multiple locks
>> occasionally (f.i. hash insertion, potentially helper-related and
>> expectation-related stuff), and there is the unconfirmed_list, where
>> fine-grained locking can't really be used without changing it to
>> a hash.
>>
> 
> Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
>  sockets, using RCU to have lockless lookups.
> Yes, we still take a lock when doing an insert or delete at socket
> bind/unbind time.
> 
> We could keep a central nf_conntrack_lock to guard insertions/deletes
> from hash and unconfirmed_list.
> 
> But *normal* packets that only need to change state of one particular
> connection could use RCU (without spinlock) to locate the conntrack,
> then lock the found conntrack to perform all state changes.

Well... RCU is already used by conntrack :)

Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-28 17:34                         ` Eric Dumazet
@ 2009-01-29 15:31                           ` Eric Dumazet
  2009-01-30 15:47                             ` Andi Kleen
  2009-02-20 10:02                             ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
  2009-02-09 14:57                           ` 32 core net-next stack/netfilter "scaling" Patrick McHardy
  1 sibling, 2 replies; 42+ messages in thread
From: Eric Dumazet @ 2009-01-29 15:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

While doing oprofile tests I noticed two loops are not properly unrolled by gcc

Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/netfilter/ip_tables.c |   32 +++++++++++++++++--------------
 1 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ef8b6ca..7bdb2cf 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -65,6 +65,22 @@ do {								\
 #define inline
 #endif
 
+static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	return ret;
+}
+
 /*
    We keep a set of rules for each CPU, so we can avoid write-locking
    them in the softirq when updating the counters and therefore
@@ -83,7 +99,6 @@ ip_packet_match(const struct iphdr *ip,
 		const struct ipt_ip *ipinfo,
 		int isfrag)
 {
-	size_t i;
 	unsigned long ret;
 
 #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,13 +118,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ipinfo->iniface)[i])
-			& ((const unsigned long *)ipinfo->iniface_mask)[i];
-	}
-
+	ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
 	if (FWINV(ret != 0, IPT_INV_VIA_IN)) {
 		dprintf("VIA in mismatch (%s vs %s).%s\n",
 			indev, ipinfo->iniface,
@@ -117,12 +126,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)outdev)[i]
-			^ ((const unsigned long *)ipinfo->outiface)[i])
-			& ((const unsigned long *)ipinfo->outiface_mask)[i];
-	}
-
+	ret = ifname_compare(outdev, ipinfo->outiface, ipinfo->outiface_mask);
 	if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
 		dprintf("VIA out mismatch (%s vs %s).%s\n",
 			outdev, ipinfo->outiface,

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-29 15:31                           ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
@ 2009-01-30 15:47                             ` Andi Kleen
  2009-01-30 16:54                               ` Eric Dumazet
  2009-02-20 10:02                             ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
  1 sibling, 1 reply; 42+ messages in thread
From: Andi Kleen @ 2009-01-30 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet <dada1@cosmosbay.com> writes:

> While doing oprofile tests I noticed two loops are not properly unrolled by gcc

That's because nobody passed -funroll-loops. Did you try that for
that file? Likely will need -O2 too

> +static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
> +{
> +	const unsigned long *a = (const unsigned long *)_a;
> +	const unsigned long *b = (const unsigned long *)_b;
> +	const unsigned long *mask = (const unsigned long *)_mask;
> +	unsigned long ret;
> +
> +	ret = (a[0] ^ b[0]) & mask[0];
> +	ret |= (a[1] ^ b[1]) & mask[1];
> +	if (IFNAMSIZ > 2 * sizeof(unsigned long))
> +		ret |= (a[2] ^ b[2]) & mask[2];
> +	if (IFNAMSIZ > 3 * sizeof(unsigned long))
> +		ret |= (a[3] ^ b[3]) & mask[3];

That will silently break for IFNAMSIZ >= 4*sizeof(unsigned long)
You should add a dummy loop for that or at least a BUILD_BUG_ON

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-30 15:47                             ` Andi Kleen
@ 2009-01-30 16:54                               ` Eric Dumazet
  2009-01-30 17:27                                 ` Andi Kleen
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-30 16:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Patrick McHardy, David S. Miller, Netfilter Developers,
	Linux Network Development list

Andi Kleen a écrit :
> Eric Dumazet <dada1@cosmosbay.com> writes:
> 
>> While doing oprofile tests I noticed two loops are not properly unrolled by gcc
> 
> That's because nobody passed -funroll-loops. Did you try that for
> that file? Likely will need -O2 too

I dont want to unroll all loops, only those two :)
I wish gcc (4.3.2 here) was litle bit smarter :(

Without using -funroll-loops

size ipv4/netfilter/ip_tables.o
   text    data     bss     dec     hex filename
   6424     368      16    6808    1a98 net/ipv4/netfilter/ip_tables.o

With -funroll-loops :

size ipv4/netfilter/ip_tables.o
   text    data     bss     dec     hex filename
   7144     368      16    7528    1d68 net/ipv4/netfilter/ip_tables.o

With my patch and no -funroll-loops
   text    data     bss     dec     hex filename
   6488     368      16    6872    1ad8 net/ipv4/netfilter/ip_tables.o


> 
>> +static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
>> +{
>> +	const unsigned long *a = (const unsigned long *)_a;
>> +	const unsigned long *b = (const unsigned long *)_b;
>> +	const unsigned long *mask = (const unsigned long *)_mask;
>> +	unsigned long ret;
>> +
>> +	ret = (a[0] ^ b[0]) & mask[0];
>> +	ret |= (a[1] ^ b[1]) & mask[1];
>> +	if (IFNAMSIZ > 2 * sizeof(unsigned long))
>> +		ret |= (a[2] ^ b[2]) & mask[2];
>> +	if (IFNAMSIZ > 3 * sizeof(unsigned long))
>> +		ret |= (a[3] ^ b[3]) & mask[3];
> 
> That will silently break for IFNAMSIZ >= 4*sizeof(unsigned long)
> You should add a dummy loop for that or at least a BUILD_BUG_ON

It will also break the day we port linux to a 128 bits machine :)

Thanks Andi

(By the way, I still use the patch on arch/x86/oprofile/op_model_ppro.c
to have a working oprofile on my dev machine...)


[PATCH] netfilter: unfold two critical loops in ip_packet_match()

While doing oprofile tests I noticed two loops are not properly unrolled by gcc

Using hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench, for a small
text size increase (62 bytes for both loops)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/netfilter/ip_tables.c |   34 ++++++++++++++++++-------------
 1 files changed, 20 insertions(+), 14 deletions(-)


diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ef8b6ca..9298d0a 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -65,6 +65,24 @@ do {								\
 #define inline
 #endif
 
+static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 /*
    We keep a set of rules for each CPU, so we can avoid write-locking
    them in the softirq when updating the counters and therefore
@@ -83,7 +101,6 @@ ip_packet_match(const struct iphdr *ip,
 		const struct ipt_ip *ipinfo,
 		int isfrag)
 {
-	size_t i;
 	unsigned long ret;
 
 #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,13 +120,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ipinfo->iniface)[i])
-			& ((const unsigned long *)ipinfo->iniface_mask)[i];
-	}
-
+	ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
 	if (FWINV(ret != 0, IPT_INV_VIA_IN)) {
 		dprintf("VIA in mismatch (%s vs %s).%s\n",
 			indev, ipinfo->iniface,
@@ -117,12 +128,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)outdev)[i]
-			^ ((const unsigned long *)ipinfo->outiface)[i])
-			& ((const unsigned long *)ipinfo->outiface_mask)[i];
-	}
-
+	ret = ifname_compare(outdev, ipinfo->outiface, ipinfo->outiface_mask);
 	if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
 		dprintf("VIA out mismatch (%s vs %s).%s\n",
 			outdev, ipinfo->outiface,


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-30 16:54                               ` Eric Dumazet
@ 2009-01-30 17:27                                 ` Andi Kleen
  2009-01-30 17:27                                   ` Eric Dumazet
  2009-02-09 13:41                                   ` Patrick McHardy
  0 siblings, 2 replies; 42+ messages in thread
From: Andi Kleen @ 2009-01-30 17:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Patrick McHardy, David S. Miller,
	Netfilter Developers, Linux Network Development list

On Fri, Jan 30, 2009 at 05:54:10PM +0100, Eric Dumazet wrote:
> Andi Kleen a écrit :
> > Eric Dumazet <dada1@cosmosbay.com> writes:
> > 
> >> While doing oprofile tests I noticed two loops are not properly unrolled by gcc
> > 
> > That's because nobody passed -funroll-loops. Did you try that for
> > that file? Likely will need -O2 too
> 
> I dont want to unroll all loops, only those two :)

gcc 4.4 will have a way to do that per function, but earlier
you would need to move it to a separate file and specify
the option only for that.

Doing so would be still a good idea compared to your
patch because the code will be cleaner and might
be more adaptable to future architectures
(such manual tunings tend to outdate)

> I wish gcc (4.3.2 here) was litle bit smarter :(

It cannot do much without profile feedback because
it has no clue which loops are hot and which are not.

> (By the way, I still use the patch on arch/x86/oprofile/op_model_ppro.c
> to have a working oprofile on my dev machine...)

Yes I know, sorry for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-30 17:27                                 ` Andi Kleen
@ 2009-01-30 17:27                                   ` Eric Dumazet
  2009-01-30 17:50                                     ` Andi Kleen
  2009-02-09 13:41                                   ` Patrick McHardy
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-01-30 17:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Patrick McHardy, David S. Miller, Netfilter Developers,
	Linux Network Development list

Andi Kleen a écrit :
> On Fri, Jan 30, 2009 at 05:54:10PM +0100, Eric Dumazet wrote:
>> Andi Kleen a écrit :
>>> Eric Dumazet <dada1@cosmosbay.com> writes:
>>>
>>>> While doing oprofile tests I noticed two loops are not properly unrolled by gcc
>>> That's because nobody passed -funroll-loops. Did you try that for
>>> that file? Likely will need -O2 too
>> I dont want to unroll all loops, only those two :)
> 
> gcc 4.4 will have a way to do that per function, but earlier
> you would need to move it to a separate file and specify
> the option only for that.
> 
> Doing so would be still a good idea compared to your
> patch because the code will be cleaner and might
> be more adaptable to future architectures
> (such manual tunings tend to outdate)

So... you suggest me to split file, and use a "-funroll-loops",
that might doing strange things on some arches / compilers...

This is far too complicated and risky imho.

Check compare_ether_addr_64bits() definition in 
include/linux/etherdevice.h for a truly unreadable code :)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-30 17:27                                   ` Eric Dumazet
@ 2009-01-30 17:50                                     ` Andi Kleen
  0 siblings, 0 replies; 42+ messages in thread
From: Andi Kleen @ 2009-01-30 17:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Patrick McHardy, David S. Miller,
	Netfilter Developers, Linux Network Development list

> So... you suggest me to split file, and use a "-funroll-loops",
> that might doing strange things on some arches / compilers...

I don't think it will. -funroll-loops is pretty clear defined.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-30 17:27                                 ` Andi Kleen
  2009-01-30 17:27                                   ` Eric Dumazet
@ 2009-02-09 13:41                                   ` Patrick McHardy
  2009-02-18 15:10                                     ` Eric Dumazet
  1 sibling, 1 reply; 42+ messages in thread
From: Patrick McHardy @ 2009-02-09 13:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, David S. Miller, Netfilter Developers,
	Linux Network Development list

Andi Kleen wrote:
> On Fri, Jan 30, 2009 at 05:54:10PM +0100, Eric Dumazet wrote:
>> Andi Kleen a écrit :
>>> Eric Dumazet <dada1@cosmosbay.com> writes:
>>>
>>>> While doing oprofile tests I noticed two loops are not properly unrolled by gcc
>>> That's because nobody passed -funroll-loops. Did you try that for
>>> that file? Likely will need -O2 too
>> I dont want to unroll all loops, only those two :)
> 
> gcc 4.4 will have a way to do that per function, but earlier
> you would need to move it to a separate file and specify
> the option only for that.
> 
> Doing so would be still a good idea compared to your
> patch because the code will be cleaner and might
> be more adaptable to future architectures
> (such manual tunings tend to outdate)


The interface name matching has shown up in profiles forever
though and we've actually already tried to optimize it IIRC.

Eric, I'm trying to keep all the *tables files synchronized,
could you send me a patch updating the other ones as well
please?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-02-09 13:41                                   ` Patrick McHardy
@ 2009-02-18 15:10                                     ` Eric Dumazet
  2009-02-18 15:21                                       ` Patrick McHardy
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-02-18 15:10 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andi Kleen, David S. Miller, Netfilter Developers,
	Linux Network Development list

Patrick McHardy a écrit :
> 
> The interface name matching has shown up in profiles forever
> though and we've actually already tried to optimize it IIRC.
> 
> Eric, I'm trying to keep all the *tables files synchronized,
> could you send me a patch updating the other ones as well
> please?

While doing this, I found arp_tables is still using loop using
byte operations.

Also, I could not find how iniface_mask[], outiface_mask[], iniface[]
and outiface[] were forced to long word alignment ... 

(in struct ipt_ip, struct ip6t_ip6, struct arpt_arp)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-02-18 15:10                                     ` Eric Dumazet
@ 2009-02-18 15:21                                       ` Patrick McHardy
  2009-02-18 16:33                                         ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick McHardy @ 2009-02-18 15:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> The interface name matching has shown up in profiles forever
>> though and we've actually already tried to optimize it IIRC.
>>
>> Eric, I'm trying to keep all the *tables files synchronized,
>> could you send me a patch updating the other ones as well
>> please?
> 
> While doing this, I found arp_tables is still using loop using
> byte operations.
> 
> Also, I could not find how iniface_mask[], outiface_mask[], iniface[]
> and outiface[] were forced to long word alignment ... 
> 
> (in struct ipt_ip, struct ip6t_ip6, struct arpt_arp)

In case of IPv4 and IPv6 they are already suitable aligned, it
simply performing the comparison in unsigned long quantities.
struct arpt_arp unfortunately doesn't properly align the interface
names, so we need to continue to do byte-wise comparisons.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-02-18 15:21                                       ` Patrick McHardy
@ 2009-02-18 16:33                                         ` Eric Dumazet
  2009-02-18 16:52                                           ` Patrick McHardy
  2009-02-18 17:36                                           ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet
  0 siblings, 2 replies; 42+ messages in thread
From: Eric Dumazet @ 2009-02-18 16:33 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andi Kleen, David S. Miller, Netfilter Developers,
	Linux Network Development list

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Patrick McHardy a écrit :
>>> The interface name matching has shown up in profiles forever
>>> though and we've actually already tried to optimize it IIRC.
>>>
>>> Eric, I'm trying to keep all the *tables files synchronized,
>>> could you send me a patch updating the other ones as well
>>> please?
>>
>> While doing this, I found arp_tables is still using loop using
>> byte operations.
>>
>> Also, I could not find how iniface_mask[], outiface_mask[], iniface[]
>> and outiface[] were forced to long word alignment ...
>> (in struct ipt_ip, struct ip6t_ip6, struct arpt_arp)
> 
> In case of IPv4 and IPv6 they are already suitable aligned, it
> simply performing the comparison in unsigned long quantities.
> struct arpt_arp unfortunately doesn't properly align the interface
> names, so we need to continue to do byte-wise comparisons.
> 
> 

I see, but #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS can help here ;)

ifname_compare() should be static in three files (ipv4_ip_tables, ipv6_ip_tables and arp_tables),
since only arp_tables variant has the alignement problem.

[PATCH] netfilter: unfold two critical loops in arp_packet_match()

x86 and powerpc can perform long word accesses in an efficient maner.
We can use this to unroll two loops in arp_packet_match(), to
perform arithmetic on long words instead of bytes. This is a win
on x86_64 for example.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/netfilter/arp_tables.c |   44 +++++++++++++++++++++++-------
 1 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 7ea88b6..b5db463 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -73,6 +73,36 @@ static inline int arp_devaddr_compare(const struct arpt_devaddr_info *ap,
 	return (ret != 0);
 }
 
+/*
+ * Unfortunatly, _b and _mask are not aligned to an int (or long int)
+ * Some arches dont care, unrolling the loop is a win on them.
+ */
+static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask)
+{
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+#else
+	unsigned long ret = 0;
+	int i;
+
+	for (i = 0; i < IFNAMSIZ; i++)
+		ret |= (_a[i] ^ _b[i]) & _mask[i];
+#endif
+	return ret;
+}
+
 /* Returns whether packet matches rule or not. */
 static inline int arp_packet_match(const struct arphdr *arphdr,
 				   struct net_device *dev,
@@ -83,7 +113,7 @@ static inline int arp_packet_match(const struct arphdr *arphdr,
 	const char *arpptr = (char *)(arphdr + 1);
 	const char *src_devaddr, *tgt_devaddr;
 	__be32 src_ipaddr, tgt_ipaddr;
-	int i, ret;
+	long ret;
 
 #define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
 
@@ -156,10 +186,7 @@ static inline int arp_packet_match(const struct arphdr *arphdr,
 	}
 
 	/* Look for ifname matches.  */
-	for (i = 0, ret = 0; i < IFNAMSIZ; i++) {
-		ret |= (indev[i] ^ arpinfo->iniface[i])
-			& arpinfo->iniface_mask[i];
-	}
+	ret = ifname_compare(indev, arpinfo->iniface, arpinfo->iniface_mask);
 
 	if (FWINV(ret != 0, ARPT_INV_VIA_IN)) {
 		dprintf("VIA in mismatch (%s vs %s).%s\n",
@@ -168,10 +195,7 @@ static inline int arp_packet_match(const struct arphdr *arphdr,
 		return 0;
 	}
 
-	for (i = 0, ret = 0; i < IFNAMSIZ; i++) {
-		ret |= (outdev[i] ^ arpinfo->outiface[i])
-			& arpinfo->outiface_mask[i];
-	}
+	ret = ifname_compare(outdev, arpinfo->outiface, arpinfo->outiface_mask);
 
 	if (FWINV(ret != 0, ARPT_INV_VIA_OUT)) {
 		dprintf("VIA out mismatch (%s vs %s).%s\n",
@@ -221,7 +245,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 			   const struct net_device *out,
 			   struct xt_table *table)
 {
-	static const char nulldevname[IFNAMSIZ];
+	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	unsigned int verdict = NF_DROP;
 	const struct arphdr *arp;
 	bool hotdrop = false;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-02-18 16:33                                         ` Eric Dumazet
@ 2009-02-18 16:52                                           ` Patrick McHardy
  2009-02-18 17:36                                           ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet
  1 sibling, 0 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-02-18 16:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> In case of IPv4 and IPv6 they are already suitable aligned, it
>> simply performing the comparison in unsigned long quantities.
>> struct arpt_arp unfortunately doesn't properly align the interface
>> names, so we need to continue to do byte-wise comparisons.
>>
> I see, but #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS can help here ;)

get_unaligned() would work as well I guess. But we don't seem to
have a get_unaligned_long().

> ifname_compare() should be static in three files (ipv4_ip_tables, ipv6_ip_tables and arp_tables),
> since only arp_tables variant has the alignement problem.
> 
> [PATCH] netfilter: unfold two critical loops in arp_packet_match()
> 
> x86 and powerpc can perform long word accesses in an efficient maner.
> We can use this to unroll two loops in arp_packet_match(), to
> perform arithmetic on long words instead of bytes. This is a win
> on x86_64 for example.

This looks good to me. Applied, thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] netfilter: xt_physdev fixes
  2009-02-18 16:33                                         ` Eric Dumazet
  2009-02-18 16:52                                           ` Patrick McHardy
@ 2009-02-18 17:36                                           ` Eric Dumazet
  2009-02-18 18:14                                             ` Patrick McHardy
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-02-18 17:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

Hi Patrick

Please find following patch to prepare next one (loop unrolling)

I am not sure xt_physdev_info is aligned, either on an int or a long,
so please check my assertion before applying :)



Thank you

[PATCH] netfilter: xt_physdev fixes

1) physdev_mt() incorrectly assumes nulldevname[] is aligned on an int

2) It also uses word comparisons, while it could use long word ones.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 1bcdfc1..4b13ef7 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -24,9 +24,9 @@ static bool
 physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 {
 	int i;
-	static const char nulldevname[IFNAMSIZ];
+	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	const struct xt_physdev_info *info = par->matchinfo;
-	bool ret;
+	unsigned long ret;
 	const char *indev, *outdev;
 	const struct nf_bridge_info *nf_bridge;
 
@@ -68,10 +68,10 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 	if (!(info->bitmask & XT_PHYSDEV_OP_IN))
 		goto match_outdev;
 	indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname;
-	for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) {
-		ret |= (((const unsigned int *)indev)[i]
-			^ ((const unsigned int *)info->physindev)[i])
-			& ((const unsigned int *)info->in_mask)[i];
+	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
+		ret |= (((const unsigned long *)indev)[i]
+			^ ((const unsigned long *)info->physindev)[i])
+			& ((const unsigned long *)info->in_mask)[i];
 	}
 
 	if (!ret ^ !(info->invert & XT_PHYSDEV_OP_IN))
@@ -82,13 +82,12 @@ match_outdev:
 		return true;
 	outdev = nf_bridge->physoutdev ?
 		 nf_bridge->physoutdev->name : nulldevname;
-	for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) {
-		ret |= (((const unsigned int *)outdev)[i]
-			^ ((const unsigned int *)info->physoutdev)[i])
-			& ((const unsigned int *)info->out_mask)[i];
+	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
+		ret |= (((const unsigned long *)outdev)[i]
+			^ ((const unsigned long *)info->physoutdev)[i])
+			& ((const unsigned long *)info->out_mask)[i];
 	}
-
-	return ret ^ !(info->invert & XT_PHYSDEV_OP_OUT);
+	return (!!ret ^ !(info->invert & XT_PHYSDEV_OP_OUT));
 }
 
 static bool physdev_mt_check(const struct xt_mtchk_param *par)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: xt_physdev fixes
  2009-02-18 17:36                                           ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet
@ 2009-02-18 18:14                                             ` Patrick McHardy
  2009-02-19  8:00                                               ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Patrick McHardy @ 2009-02-18 18:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet wrote:
> Hi Patrick
> 
> Please find following patch to prepare next one (loop unrolling)
> 
> I am not sure xt_physdev_info is aligned, either on an int or a long,
> so please check my assertion before applying :)

Yes, the private structures can assume alignment suitable for
pointers and unsigned longs.

> [PATCH] netfilter: xt_physdev fixes
> 
> 1) physdev_mt() incorrectly assumes nulldevname[] is aligned on an int
> 
> 2) It also uses word comparisons, while it could use long word ones.

Applied, thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] netfilter: unfold two loops in physdev_mt()
  2009-02-18 18:14                                             ` Patrick McHardy
@ 2009-02-19  8:00                                               ` Eric Dumazet
  2009-02-19  8:14                                                 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet
  2009-02-19 10:17                                                 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy
  0 siblings, 2 replies; 42+ messages in thread
From: Eric Dumazet @ 2009-02-19  8:00 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

xt_physdev netfilter module can use an ifname_compare() helper
so that two loops are unfolded.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/netfilter/xt_physdev.c |   32 +++++++++++++++++++++-----------
 1 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 4b13ef7..44a234e 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -20,10 +20,27 @@ MODULE_DESCRIPTION("Xtables: Bridge physical device match");
 MODULE_ALIAS("ipt_physdev");
 MODULE_ALIAS("ip6t_physdev");
 
+static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 static bool
 physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 {
-	int i;
 	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	const struct xt_physdev_info *info = par->matchinfo;
 	unsigned long ret;
@@ -68,11 +85,7 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 	if (!(info->bitmask & XT_PHYSDEV_OP_IN))
 		goto match_outdev;
 	indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname;
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)info->physindev)[i])
-			& ((const unsigned long *)info->in_mask)[i];
-	}
+	ret = ifname_compare(indev, info->physindev, info->in_mask);
 
 	if (!ret ^ !(info->invert & XT_PHYSDEV_OP_IN))
 		return false;
@@ -82,11 +95,8 @@ match_outdev:
 		return true;
 	outdev = nf_bridge->physoutdev ?
 		 nf_bridge->physoutdev->name : nulldevname;
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)outdev)[i]
-			^ ((const unsigned long *)info->physoutdev)[i])
-			& ((const unsigned long *)info->out_mask)[i];
-	}
+	ret = ifname_compare(outdev, info->physoutdev, info->out_mask);
+
 	return (!!ret ^ !(info->invert & XT_PHYSDEV_OP_OUT));
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH] netfilter: unfold two loops in ip6_packet_match()
  2009-02-19  8:00                                               ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet
@ 2009-02-19  8:14                                                 ` Eric Dumazet
  2009-02-19 10:19                                                   ` Patrick McHardy
  2009-02-19 10:17                                                 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-02-19  8:14 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

ip6_tables netfilter module can use an ifname_compare() helper
so that two loops are unfolded.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv6/netfilter/ip6_tables.c |   33 +++++++++++++++++++-----------
 1 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index a33485d..d64594b 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -89,6 +89,25 @@ ip6t_ext_hdr(u8 nexthdr)
 		 (nexthdr == IPPROTO_DSTOPTS) );
 }
 
+static unsigned long ifname_compare(const char *_a, const char *_b,
+				    const unsigned char *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 /* Returns whether matches rule or not. */
 /* Performance critical - called for every packet */
 static inline bool
@@ -99,7 +118,6 @@ ip6_packet_match(const struct sk_buff *skb,
 		 unsigned int *protoff,
 		 int *fragoff, bool *hotdrop)
 {
-	size_t i;
 	unsigned long ret;
 	const struct ipv6hdr *ipv6 = ipv6_hdr(skb);
 
@@ -120,12 +138,7 @@ ip6_packet_match(const struct sk_buff *skb,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ip6info->iniface)[i])
-			& ((const unsigned long *)ip6info->iniface_mask)[i];
-	}
+	ret = ifname_compare(indev, ip6info->iniface, ip6info->iniface_mask);
 
 	if (FWINV(ret != 0, IP6T_INV_VIA_IN)) {
 		dprintf("VIA in mismatch (%s vs %s).%s\n",
@@ -134,11 +147,7 @@ ip6_packet_match(const struct sk_buff *skb,
 		return false;
 	}
 
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)outdev)[i]
-			^ ((const unsigned long *)ip6info->outiface)[i])
-			& ((const unsigned long *)ip6info->outiface_mask)[i];
-	}
+	ret = ifname_compare(outdev, ip6info->outiface, ip6info->outiface_mask);
 
 	if (FWINV(ret != 0, IP6T_INV_VIA_OUT)) {
 		dprintf("VIA out mismatch (%s vs %s).%s\n",


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two loops in ip6_packet_match()
  2009-02-19  8:14                                                 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet
@ 2009-02-19 10:19                                                   ` Patrick McHardy
  0 siblings, 0 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-02-19 10:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet wrote:
> ip6_tables netfilter module can use an ifname_compare() helper
> so that two loops are unfolded.

Also applied, thanks Eric.

Are you going to send a new patch for ip_tables.c or should I
just take the old one?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two loops in physdev_mt()
  2009-02-19  8:00                                               ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet
  2009-02-19  8:14                                                 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet
@ 2009-02-19 10:17                                                 ` Patrick McHardy
  1 sibling, 0 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-02-19 10:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet wrote:
> xt_physdev netfilter module can use an ifname_compare() helper
> so that two loops are unfolded.

Applied, thanks.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-01-29 15:31                           ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
  2009-01-30 15:47                             ` Andi Kleen
@ 2009-02-20 10:02                             ` Eric Dumazet
  2009-02-20 10:04                               ` Patrick McHardy
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2009-02-20 10:02 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

While doing oprofile tests I noticed two loops are not properly unrolled by gcc

Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 08cde5b..e5294ae 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -74,6 +74,25 @@ do {								\
 
    Hence the start of any table is given by get_table() below.  */
 
+static unsigned long ifname_compare(const char *_a, const char *_b,
+				    const unsigned char *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 /* Returns whether matches rule or not. */
 /* Performance critical - called for every packet */
 static inline bool
@@ -83,7 +102,6 @@ ip_packet_match(const struct iphdr *ip,
 		const struct ipt_ip *ipinfo,
 		int isfrag)
 {
-	size_t i;
 	unsigned long ret;
 
 #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,12 +121,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ipinfo->iniface)[i])
-			& ((const unsigned long *)ipinfo->iniface_mask)[i];
-	}
+	ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
 
 	if (FWINV(ret != 0, IPT_INV_VIA_IN)) {
 		dprintf("VIA in mismatch (%s vs %s).%s\n",
@@ -117,11 +130,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)outdev)[i]
-			^ ((const unsigned long *)ipinfo->outiface)[i])
-			& ((const unsigned long *)ipinfo->outiface_mask)[i];
-	}
+	ret = ifname_compare(outdev, ipinfo->outiface, ipinfo->outiface_mask);
 
 	if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
 		dprintf("VIA out mismatch (%s vs %s).%s\n",


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()
  2009-02-20 10:02                             ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
@ 2009-02-20 10:04                               ` Patrick McHardy
  0 siblings, 0 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-02-20 10:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Netfilter Developers,
	Linux Network Development list

Eric Dumazet wrote:
> While doing oprofile tests I noticed two loops are not properly unrolled by gcc
> 
> Using a hand coded unrolled loop provides nice speedup : ipt_do_table
> credited of 2.52 % of cpu instead of 3.29 % in tbench.

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-28 17:34                         ` Eric Dumazet
  2009-01-29 15:31                           ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
@ 2009-02-09 14:57                           ` Patrick McHardy
  1 sibling, 0 replies; 42+ messages in thread
From: Patrick McHardy @ 2009-02-09 14:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Netfilter Developers, Linux Network Development list,
	Stephen Hemminger

Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> Patrick McHardy a écrit :
>>> That looks more complicated since it requires to take multiple locks
>>> occasionally (f.i. hash insertion, potentially helper-related and
>>> expectation-related stuff), and there is the unconfirmed_list, where
>>> fine-grained locking can't really be used without changing it to
>>> a hash.
>>>
>> Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
>>  sockets, using RCU to have lockless lookups.
>> Yes, we still take a lock when doing an insert or delete at socket
>> bind/unbind time.
>>
>> We could keep a central nf_conntrack_lock to guard insertions/deletes
>> from hash and unconfirmed_list.
>>
>> But *normal* packets that only need to change state of one particular
>> connection could use RCU (without spinlock) to locate the conntrack,
>> then lock the found conntrack to perform all state changes.
> 
> Well... RCU is already used by conntrack :)
> 
> Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock

The lock is currently used to avoid timer update races.
Martin has two old patches two remove it, but they require
changes to the timer code to support a mod_timer variant
that doesn't enable an inactive timer:

http://people.netfilter.org/gandalf/performance/patches/mod_timer_noact
http://people.netfilter.org/gandalf/performance/patches/__nf_ct_refresh_acct-locking
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 32 core net-next stack/netfilter "scaling"
  2009-01-26 23:10 ` Eric Dumazet
  2009-01-26 23:14   ` Stephen Hemminger
  2009-01-26 23:19   ` Rick Jones
@ 2009-02-10 18:44   ` Stephen Hemminger
  2 siblings, 0 replies; 42+ messages in thread
From: Stephen Hemminger @ 2009-02-10 18:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Linux Network Development list, Netfilter Developers,
	Patrick McHardy

On Tue, 27 Jan 2009 00:10:44 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Rick Jones a écrit :
> > Folks -
> > 
> > Under:
> > 
> > ftp://ftp.netperf.org/iptable_scaling
> > 
> > can be found netperf results and Caliper profiles for three scenarios on
> > a 32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call
> > a "cell based" system in that it is comprised of "cell boards" on which
> > reside CPU and memory resources.  In this case there are four cell
> > boards, each with 4, dual-core Montecito processors and 1/4 of the
> > overall RAM.  The system was configured with a mix of cell-local and
> > global interleaved memory, where the global interleave is on a cacheline
> > (128 byte) boundary (IIRC).  Total RAM in the system is 256 GB.  The
> > cells are joined via cross-bar connections. (numactl --hardware output
> > is available under the URL above)
> > 
> > There was an "I/O expander" connected to the system.  This meant there
> > were as many distinct PCI-X domains as there were cells, and every cell
> > had a "local" set of PCI-X slots.
> > 
> > Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs -
> > aka Neterion XFrame IIs.  These were then connected to an HP ProCurve
> > 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP
> > DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet
> > NICs (Aka Chelsio T3C-based).  They were running RHEL 5.2 I think.  Each
> > NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble
> > x8 slot (DL585 G5)
> > 
> > The kernel is from DaveM's net-next tree ca last week, multiq enabled. 
> > The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get
> > multiq support.  It was loaded into the kernel via:
> > 
> > insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8
> > 
> > There were then 8 tx queues and 8 rx queues per interface in the
> > rx8640.  The "setaffinity.txt" script was used to set the IRQ affinities
> > to cores "closest" to the physical NIC. In all three tests all 32 cores
> > went to 100% utilization. At least for all incense and porpoises. (there
> > was some occasional idle reported by top on the full_iptables run)
> > 
> > A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a
> > burst mode of 17 were run (ie 17 "transactions" outstanding on a
> > connection at one time,) with TCP_NODELAY set and the results gathered,
> > along with a set of Caliper profiles.  The script used to launch these
> > can be found in "runemomniagg2.sh.txt under the URL above.
> > 
> > I picked an "RR" test to maximize the trips up and down the stack while
> > minimizing the bandwidth consumed.
> > 
> > I picked a burst size of 16 because that was sufficient to saturate a
> > single core on the rx8640.
> > 
> > I picked 64 concurrent netperfs because I wanted to make sure I had
> > enough concurrent connections to get spread across all the cores/queues
> > by the algorithms in place.
> > 
> > I picked the combination of 64 and 16 rather than say 1024 and 0 (one
> > tran at a time) because I didn't want to run a context switching
> > benchmark :)
> > 
> > The rx8640 was picked because it was available and I was confident it
> > was not going to have any hardware scaling issues getting in the way.  I
> > wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is
> > a reasonable analog for any "decent or better scaling" 32 core hardware
> > and that while there are ia64-specific routines present in the profiles,
> > they are there for platform-independent reasons.
> > 
> > The no_iptables/ data was run after a fresh boot, with no iptables
> > commands run and so no iptables related modules loaded into the kernel.
> > 
> > The empty_iptables/ data was run after an "iptables --list" command
> > which loaded one or two modules into the kernel.
> > 
> > The full_iptables/ data was run after an "iptables-restore" command
> > pointed at full_iptables/iptables.txt  which was created from what RH
> > creates by default when one enables firewall via their installer, with a
> > port range added by me to allow pretty much anything netperf would ask. 
> > As such, while it does excercise netfilter functionality, I cannot make
> > any claims as to its "real world" applicability.  (while the firewall
> > settings came from an RH setup, FWIW, the base bits running on the
> > rx8640 are Debian Lenny, with the net-next kernel on top)
> > 
> > The "cycles" profile is able to grab flat profile hits while interrupts
> > are disabled so it can see stuff happening while interrupts are
> > disabled.  The "scgprof" profile is an attempt to get some call graphs -
> > it does not have visibility into code running with interrupts disabled. 
> > The "cache" profile is a profile that looks to get some cache miss
> > information.
> > 
> > So, having said all that, details can be found under the previously
> > mentioned URL.  Some quick highlights:
> > 
> > no_iptables - ~22000 transactions/s/netperf.  Top of the cycles profile
> > looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >    5.70     5.70         37772   s2io.ko::tx_intr_handler
> >    5.14    10.84         34012   vmlinux::__ia64_readq
> >    4.88    15.72         32285   s2io.ko::s2io_msix_ring_handle
> >    4.63    20.34         30625   s2io.ko::rx_intr_handler
> >    4.60    24.94         30429   s2io.ko::s2io_xmit
> >    3.85    28.79         25488   s2io.ko::s2io_poll_msix
> >    2.87    31.65         18987   vmlinux::dev_queue_xmit
> >    2.51    34.16         16620   vmlinux::tcp_sendmsg
> >    2.51    36.67         16588   vmlinux::tcp_ack
> >    2.15    38.82         14221   vmlinux::__inet_lookup_established
> >    2.10    40.92         13937   vmlinux::ia64_spinlock_contention
> > 
> > empty_iptables - ~12000 transactions/s/netperf.  Top of the cycles
> > profile looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >   26.38    26.38        137458   vmlinux::_read_lock_bh
> >   10.63    37.01         55388   vmlinux::local_bh_enable_ip
> >    3.42    40.43         17812   s2io.ko::tx_intr_handler
> >    3.01    43.44         15691   ip_tables.ko::ipt_do_table
> >    2.90    46.34         15100   vmlinux::__ia64_readq
> >    2.72    49.06         14179   s2io.ko::rx_intr_handler
> >    2.55    51.61         13288   s2io.ko::s2io_xmit
> >    1.98    53.59         10329   s2io.ko::s2io_msix_ring_handle
> >    1.75    55.34          9104   vmlinux::dev_queue_xmit
> >    1.64    56.98          8546   s2io.ko::s2io_poll_msix
> >    1.52    58.50          7943   vmlinux::sock_wfree
> >    1.40    59.91          7302   vmlinux::tcp_ack
> > 
> > full_iptables - some test instances didn't complete, I think they got
> > starved. Of those which did complete, their performance ranged all the
> > way from 330 to 3100 transactions/s/netperf.  Top of the cycles profile
> > looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >   64.71    64.71        582171   vmlinux::_write_lock_bh
> >   18.43    83.14        165822   vmlinux::ia64_spinlock_contention
> >    2.86    85.99         25709   nf_conntrack.ko::init_module
> >    2.36    88.35         21194   nf_conntrack.ko::tcp_packet
> >    1.78    90.13         16009   vmlinux::_spin_lock_bh
> >    1.20    91.33         10810   nf_conntrack.ko::nf_conntrack_in
> >    1.20    92.52         10755   vmlinux::nf_iterate
> >    1.09    93.62          9833   vmlinux::default_idle
> >    0.26    93.88          2331   vmlinux::__ia64_readq
> >    0.25    94.12          2213   vmlinux::__interrupt
> >    0.24    94.37          2203   s2io.ko::tx_intr_handler
> > 
> > Suggestions as to things to look at/with and/or patches to try are
> > welcome.  I should have the HW available to me for at least a little
> > while, but not indefinitely.
> > 
> > rick jones
> 
> Hi Rick, nice hardware you have :)
> 
> Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
> I hit this contention point even with low cost hardware, and quite standard application.
> 
> I pinged him few days ago to try to finish the job with him, but it seems Stephen
> is busy at the moment.
> 
> Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
>  that must be write_locked() for basically every handled tcp frame...
> 
> How long is "not indefinitely" ? 

Does anyone have a fix for this bottleneck.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-02-20 10:04 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-26 22:15 32 core net-next stack/netfilter "scaling" Rick Jones
2009-01-26 23:10 ` Eric Dumazet
2009-01-26 23:14   ` Stephen Hemminger
2009-01-26 23:19   ` Rick Jones
2009-01-27  9:10     ` Eric Dumazet
2009-01-27  9:15       ` Patrick McHardy
2009-01-27 11:29         ` Eric Dumazet
2009-01-27 11:37           ` Patrick McHardy
2009-01-27 16:23         ` Eric Dumazet
2009-01-27 17:33           ` Patrick McHardy
2009-01-27 18:02             ` Rick Jones
2009-01-27 19:09               ` Rick Jones
2009-01-27 19:24                 ` Rick Jones
2009-01-27 22:17                   ` Eric Dumazet
2009-01-27 22:29                     ` Rick Jones
2009-01-27 22:34                       ` Eric Dumazet
2009-01-27 22:43                         ` Rick Jones
2009-01-28 13:55                   ` Eric Dumazet
2009-01-28 16:25                     ` Patrick McHardy
2009-01-28 17:07                       ` Eric Dumazet
2009-01-28 17:34                         ` Eric Dumazet
2009-01-29 15:31                           ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
2009-01-30 15:47                             ` Andi Kleen
2009-01-30 16:54                               ` Eric Dumazet
2009-01-30 17:27                                 ` Andi Kleen
2009-01-30 17:27                                   ` Eric Dumazet
2009-01-30 17:50                                     ` Andi Kleen
2009-02-09 13:41                                   ` Patrick McHardy
2009-02-18 15:10                                     ` Eric Dumazet
2009-02-18 15:21                                       ` Patrick McHardy
2009-02-18 16:33                                         ` Eric Dumazet
2009-02-18 16:52                                           ` Patrick McHardy
2009-02-18 17:36                                           ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet
2009-02-18 18:14                                             ` Patrick McHardy
2009-02-19  8:00                                               ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet
2009-02-19  8:14                                                 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet
2009-02-19 10:19                                                   ` Patrick McHardy
2009-02-19 10:17                                                 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy
2009-02-20 10:02                             ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
2009-02-20 10:04                               ` Patrick McHardy
2009-02-09 14:57                           ` 32 core net-next stack/netfilter "scaling" Patrick McHardy
2009-02-10 18:44   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).