* 32 core net-next stack/netfilter "scaling" @ 2009-01-26 22:15 Rick Jones 2009-01-26 23:10 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Rick Jones @ 2009-01-26 22:15 UTC (permalink / raw) To: Linux Network Development list, Netfilter Developers Folks - Under: ftp://ftp.netperf.org/iptable_scaling can be found netperf results and Caliper profiles for three scenarios on a 32-core, 1.6 GHz 'Montecito' rx8640 system. An rx8640 is what HP call a "cell based" system in that it is comprised of "cell boards" on which reside CPU and memory resources. In this case there are four cell boards, each with 4, dual-core Montecito processors and 1/4 of the overall RAM. The system was configured with a mix of cell-local and global interleaved memory, where the global interleave is on a cacheline (128 byte) boundary (IIRC). Total RAM in the system is 256 GB. The cells are joined via cross-bar connections. (numactl --hardware output is available under the URL above) There was an "I/O expander" connected to the system. This meant there were as many distinct PCI-X domains as there were cells, and every cell had a "local" set of PCI-X slots. Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - aka Neterion XFrame IIs. These were then connected to an HP ProCurve 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet NICs (Aka Chelsio T3C-based). They were running RHEL 5.2 I think. Each NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble x8 slot (DL585 G5) The kernel is from DaveM's net-next tree ca last week, multiq enabled. The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get multiq support. It was loaded into the kernel via: insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8 There were then 8 tx queues and 8 rx queues per interface in the rx8640. The "setaffinity.txt" script was used to set the IRQ affinities to cores "closest" to the physical NIC. In all three tests all 32 cores went to 100% utilization. At least for all incense and porpoises. (there was some occasional idle reported by top on the full_iptables run) A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a burst mode of 17 were run (ie 17 "transactions" outstanding on a connection at one time,) with TCP_NODELAY set and the results gathered, along with a set of Caliper profiles. The script used to launch these can be found in "runemomniagg2.sh.txt under the URL above. I picked an "RR" test to maximize the trips up and down the stack while minimizing the bandwidth consumed. I picked a burst size of 16 because that was sufficient to saturate a single core on the rx8640. I picked 64 concurrent netperfs because I wanted to make sure I had enough concurrent connections to get spread across all the cores/queues by the algorithms in place. I picked the combination of 64 and 16 rather than say 1024 and 0 (one tran at a time) because I didn't want to run a context switching benchmark :) The rx8640 was picked because it was available and I was confident it was not going to have any hardware scaling issues getting in the way. I wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is a reasonable analog for any "decent or better scaling" 32 core hardware and that while there are ia64-specific routines present in the profiles, they are there for platform-independent reasons. The no_iptables/ data was run after a fresh boot, with no iptables commands run and so no iptables related modules loaded into the kernel. The empty_iptables/ data was run after an "iptables --list" command which loaded one or two modules into the kernel. The full_iptables/ data was run after an "iptables-restore" command pointed at full_iptables/iptables.txt which was created from what RH creates by default when one enables firewall via their installer, with a port range added by me to allow pretty much anything netperf would ask. As such, while it does excercise netfilter functionality, I cannot make any claims as to its "real world" applicability. (while the firewall settings came from an RH setup, FWIW, the base bits running on the rx8640 are Debian Lenny, with the net-next kernel on top) The "cycles" profile is able to grab flat profile hits while interrupts are disabled so it can see stuff happening while interrupts are disabled. The "scgprof" profile is an attempt to get some call graphs - it does not have visibility into code running with interrupts disabled. The "cache" profile is a profile that looks to get some cache miss information. So, having said all that, details can be found under the previously mentioned URL. Some quick highlights: no_iptables - ~22000 transactions/s/netperf. Top of the cycles profile looks like: Function Summary ----------------------------------------------------------------------- % Total IP Cumulat IP Samples % of Samples (ETB) Total (ETB) Function File ----------------------------------------------------------------------- 5.70 5.70 37772 s2io.ko::tx_intr_handler 5.14 10.84 34012 vmlinux::__ia64_readq 4.88 15.72 32285 s2io.ko::s2io_msix_ring_handle 4.63 20.34 30625 s2io.ko::rx_intr_handler 4.60 24.94 30429 s2io.ko::s2io_xmit 3.85 28.79 25488 s2io.ko::s2io_poll_msix 2.87 31.65 18987 vmlinux::dev_queue_xmit 2.51 34.16 16620 vmlinux::tcp_sendmsg 2.51 36.67 16588 vmlinux::tcp_ack 2.15 38.82 14221 vmlinux::__inet_lookup_established 2.10 40.92 13937 vmlinux::ia64_spinlock_contention empty_iptables - ~12000 transactions/s/netperf. Top of the cycles profile looks like: Function Summary ----------------------------------------------------------------------- % Total IP Cumulat IP Samples % of Samples (ETB) Total (ETB) Function File ----------------------------------------------------------------------- 26.38 26.38 137458 vmlinux::_read_lock_bh 10.63 37.01 55388 vmlinux::local_bh_enable_ip 3.42 40.43 17812 s2io.ko::tx_intr_handler 3.01 43.44 15691 ip_tables.ko::ipt_do_table 2.90 46.34 15100 vmlinux::__ia64_readq 2.72 49.06 14179 s2io.ko::rx_intr_handler 2.55 51.61 13288 s2io.ko::s2io_xmit 1.98 53.59 10329 s2io.ko::s2io_msix_ring_handle 1.75 55.34 9104 vmlinux::dev_queue_xmit 1.64 56.98 8546 s2io.ko::s2io_poll_msix 1.52 58.50 7943 vmlinux::sock_wfree 1.40 59.91 7302 vmlinux::tcp_ack full_iptables - some test instances didn't complete, I think they got starved. Of those which did complete, their performance ranged all the way from 330 to 3100 transactions/s/netperf. Top of the cycles profile looks like: Function Summary ----------------------------------------------------------------------- % Total IP Cumulat IP Samples % of Samples (ETB) Total (ETB) Function File ----------------------------------------------------------------------- 64.71 64.71 582171 vmlinux::_write_lock_bh 18.43 83.14 165822 vmlinux::ia64_spinlock_contention 2.86 85.99 25709 nf_conntrack.ko::init_module 2.36 88.35 21194 nf_conntrack.ko::tcp_packet 1.78 90.13 16009 vmlinux::_spin_lock_bh 1.20 91.33 10810 nf_conntrack.ko::nf_conntrack_in 1.20 92.52 10755 vmlinux::nf_iterate 1.09 93.62 9833 vmlinux::default_idle 0.26 93.88 2331 vmlinux::__ia64_readq 0.25 94.12 2213 vmlinux::__interrupt 0.24 94.37 2203 s2io.ko::tx_intr_handler Suggestions as to things to look at/with and/or patches to try are welcome. I should have the HW available to me for at least a little while, but not indefinitely. rick jones ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-26 22:15 32 core net-next stack/netfilter "scaling" Rick Jones @ 2009-01-26 23:10 ` Eric Dumazet 2009-01-26 23:14 ` Stephen Hemminger ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Eric Dumazet @ 2009-01-26 23:10 UTC (permalink / raw) To: Rick Jones Cc: Linux Network Development list, Netfilter Developers, Stephen Hemminger, Patrick McHardy Rick Jones a écrit : > Folks - > > Under: > > ftp://ftp.netperf.org/iptable_scaling > > can be found netperf results and Caliper profiles for three scenarios on > a 32-core, 1.6 GHz 'Montecito' rx8640 system. An rx8640 is what HP call > a "cell based" system in that it is comprised of "cell boards" on which > reside CPU and memory resources. In this case there are four cell > boards, each with 4, dual-core Montecito processors and 1/4 of the > overall RAM. The system was configured with a mix of cell-local and > global interleaved memory, where the global interleave is on a cacheline > (128 byte) boundary (IIRC). Total RAM in the system is 256 GB. The > cells are joined via cross-bar connections. (numactl --hardware output > is available under the URL above) > > There was an "I/O expander" connected to the system. This meant there > were as many distinct PCI-X domains as there were cells, and every cell > had a "local" set of PCI-X slots. > > Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - > aka Neterion XFrame IIs. These were then connected to an HP ProCurve > 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP > DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet > NICs (Aka Chelsio T3C-based). They were running RHEL 5.2 I think. Each > NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble > x8 slot (DL585 G5) > > The kernel is from DaveM's net-next tree ca last week, multiq enabled. > The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get > multiq support. It was loaded into the kernel via: > > insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8 > > There were then 8 tx queues and 8 rx queues per interface in the > rx8640. The "setaffinity.txt" script was used to set the IRQ affinities > to cores "closest" to the physical NIC. In all three tests all 32 cores > went to 100% utilization. At least for all incense and porpoises. (there > was some occasional idle reported by top on the full_iptables run) > > A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a > burst mode of 17 were run (ie 17 "transactions" outstanding on a > connection at one time,) with TCP_NODELAY set and the results gathered, > along with a set of Caliper profiles. The script used to launch these > can be found in "runemomniagg2.sh.txt under the URL above. > > I picked an "RR" test to maximize the trips up and down the stack while > minimizing the bandwidth consumed. > > I picked a burst size of 16 because that was sufficient to saturate a > single core on the rx8640. > > I picked 64 concurrent netperfs because I wanted to make sure I had > enough concurrent connections to get spread across all the cores/queues > by the algorithms in place. > > I picked the combination of 64 and 16 rather than say 1024 and 0 (one > tran at a time) because I didn't want to run a context switching > benchmark :) > > The rx8640 was picked because it was available and I was confident it > was not going to have any hardware scaling issues getting in the way. I > wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is > a reasonable analog for any "decent or better scaling" 32 core hardware > and that while there are ia64-specific routines present in the profiles, > they are there for platform-independent reasons. > > The no_iptables/ data was run after a fresh boot, with no iptables > commands run and so no iptables related modules loaded into the kernel. > > The empty_iptables/ data was run after an "iptables --list" command > which loaded one or two modules into the kernel. > > The full_iptables/ data was run after an "iptables-restore" command > pointed at full_iptables/iptables.txt which was created from what RH > creates by default when one enables firewall via their installer, with a > port range added by me to allow pretty much anything netperf would ask. > As such, while it does excercise netfilter functionality, I cannot make > any claims as to its "real world" applicability. (while the firewall > settings came from an RH setup, FWIW, the base bits running on the > rx8640 are Debian Lenny, with the net-next kernel on top) > > The "cycles" profile is able to grab flat profile hits while interrupts > are disabled so it can see stuff happening while interrupts are > disabled. The "scgprof" profile is an attempt to get some call graphs - > it does not have visibility into code running with interrupts disabled. > The "cache" profile is a profile that looks to get some cache miss > information. > > So, having said all that, details can be found under the previously > mentioned URL. Some quick highlights: > > no_iptables - ~22000 transactions/s/netperf. Top of the cycles profile > looks like: > > Function Summary > ----------------------------------------------------------------------- > % Total > IP Cumulat IP > Samples % of Samples > (ETB) Total (ETB) Function File > ----------------------------------------------------------------------- > 5.70 5.70 37772 s2io.ko::tx_intr_handler > 5.14 10.84 34012 vmlinux::__ia64_readq > 4.88 15.72 32285 s2io.ko::s2io_msix_ring_handle > 4.63 20.34 30625 s2io.ko::rx_intr_handler > 4.60 24.94 30429 s2io.ko::s2io_xmit > 3.85 28.79 25488 s2io.ko::s2io_poll_msix > 2.87 31.65 18987 vmlinux::dev_queue_xmit > 2.51 34.16 16620 vmlinux::tcp_sendmsg > 2.51 36.67 16588 vmlinux::tcp_ack > 2.15 38.82 14221 vmlinux::__inet_lookup_established > 2.10 40.92 13937 vmlinux::ia64_spinlock_contention > > empty_iptables - ~12000 transactions/s/netperf. Top of the cycles > profile looks like: > > Function Summary > ----------------------------------------------------------------------- > % Total > IP Cumulat IP > Samples % of Samples > (ETB) Total (ETB) Function File > ----------------------------------------------------------------------- > 26.38 26.38 137458 vmlinux::_read_lock_bh > 10.63 37.01 55388 vmlinux::local_bh_enable_ip > 3.42 40.43 17812 s2io.ko::tx_intr_handler > 3.01 43.44 15691 ip_tables.ko::ipt_do_table > 2.90 46.34 15100 vmlinux::__ia64_readq > 2.72 49.06 14179 s2io.ko::rx_intr_handler > 2.55 51.61 13288 s2io.ko::s2io_xmit > 1.98 53.59 10329 s2io.ko::s2io_msix_ring_handle > 1.75 55.34 9104 vmlinux::dev_queue_xmit > 1.64 56.98 8546 s2io.ko::s2io_poll_msix > 1.52 58.50 7943 vmlinux::sock_wfree > 1.40 59.91 7302 vmlinux::tcp_ack > > full_iptables - some test instances didn't complete, I think they got > starved. Of those which did complete, their performance ranged all the > way from 330 to 3100 transactions/s/netperf. Top of the cycles profile > looks like: > > Function Summary > ----------------------------------------------------------------------- > % Total > IP Cumulat IP > Samples % of Samples > (ETB) Total (ETB) Function File > ----------------------------------------------------------------------- > 64.71 64.71 582171 vmlinux::_write_lock_bh > 18.43 83.14 165822 vmlinux::ia64_spinlock_contention > 2.86 85.99 25709 nf_conntrack.ko::init_module > 2.36 88.35 21194 nf_conntrack.ko::tcp_packet > 1.78 90.13 16009 vmlinux::_spin_lock_bh > 1.20 91.33 10810 nf_conntrack.ko::nf_conntrack_in > 1.20 92.52 10755 vmlinux::nf_iterate > 1.09 93.62 9833 vmlinux::default_idle > 0.26 93.88 2331 vmlinux::__ia64_readq > 0.25 94.12 2213 vmlinux::__interrupt > 0.24 94.37 2203 s2io.ko::tx_intr_handler > > Suggestions as to things to look at/with and/or patches to try are > welcome. I should have the HW available to me for at least a little > while, but not indefinitely. > > rick jones Hi Rick, nice hardware you have :) Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks. I hit this contention point even with low cost hardware, and quite standard application. I pinged him few days ago to try to finish the job with him, but it seems Stephen is busy at the moment. Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock that must be write_locked() for basically every handled tcp frame... How long is "not indefinitely" ? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-26 23:10 ` Eric Dumazet @ 2009-01-26 23:14 ` Stephen Hemminger 2009-01-26 23:19 ` Rick Jones 2009-02-10 18:44 ` Stephen Hemminger 2 siblings, 0 replies; 42+ messages in thread From: Stephen Hemminger @ 2009-01-26 23:14 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Patrick McHardy On Tue, 27 Jan 2009 00:10:44 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > Rick Jones a écrit : > > Folks - > > > > Under: > > > > ftp://ftp.netperf.org/iptable_scaling > > > > can be found netperf results and Caliper profiles for three scenarios on > > a 32-core, 1.6 GHz 'Montecito' rx8640 system. An rx8640 is what HP call > > a "cell based" system in that it is comprised of "cell boards" on which > > reside CPU and memory resources. In this case there are four cell > > boards, each with 4, dual-core Montecito processors and 1/4 of the > > overall RAM. The system was configured with a mix of cell-local and > > global interleaved memory, where the global interleave is on a cacheline > > (128 byte) boundary (IIRC). Total RAM in the system is 256 GB. The > > cells are joined via cross-bar connections. (numactl --hardware output > > is available under the URL above) > > > > There was an "I/O expander" connected to the system. This meant there > > were as many distinct PCI-X domains as there were cells, and every cell > > had a "local" set of PCI-X slots. > > > > Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - > > aka Neterion XFrame IIs. These were then connected to an HP ProCurve > > 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP > > DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet > > NICs (Aka Chelsio T3C-based). They were running RHEL 5.2 I think. Each > > NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble > > x8 slot (DL585 G5) > > > > The kernel is from DaveM's net-next tree ca last week, multiq enabled. > > The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get > > multiq support. It was loaded into the kernel via: > > > > insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8 > > > > There were then 8 tx queues and 8 rx queues per interface in the > > rx8640. The "setaffinity.txt" script was used to set the IRQ affinities > > to cores "closest" to the physical NIC. In all three tests all 32 cores > > went to 100% utilization. At least for all incense and porpoises. (there > > was some occasional idle reported by top on the full_iptables run) > > > > A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a > > burst mode of 17 were run (ie 17 "transactions" outstanding on a > > connection at one time,) with TCP_NODELAY set and the results gathered, > > along with a set of Caliper profiles. The script used to launch these > > can be found in "runemomniagg2.sh.txt under the URL above. > > > > I picked an "RR" test to maximize the trips up and down the stack while > > minimizing the bandwidth consumed. > > > > I picked a burst size of 16 because that was sufficient to saturate a > > single core on the rx8640. > > > > I picked 64 concurrent netperfs because I wanted to make sure I had > > enough concurrent connections to get spread across all the cores/queues > > by the algorithms in place. > > > > I picked the combination of 64 and 16 rather than say 1024 and 0 (one > > tran at a time) because I didn't want to run a context switching > > benchmark :) > > > > The rx8640 was picked because it was available and I was confident it > > was not going to have any hardware scaling issues getting in the way. I > > wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is > > a reasonable analog for any "decent or better scaling" 32 core hardware > > and that while there are ia64-specific routines present in the profiles, > > they are there for platform-independent reasons. > > > > The no_iptables/ data was run after a fresh boot, with no iptables > > commands run and so no iptables related modules loaded into the kernel. > > > > The empty_iptables/ data was run after an "iptables --list" command > > which loaded one or two modules into the kernel. > > > > The full_iptables/ data was run after an "iptables-restore" command > > pointed at full_iptables/iptables.txt which was created from what RH > > creates by default when one enables firewall via their installer, with a > > port range added by me to allow pretty much anything netperf would ask. > > As such, while it does excercise netfilter functionality, I cannot make > > any claims as to its "real world" applicability. (while the firewall > > settings came from an RH setup, FWIW, the base bits running on the > > rx8640 are Debian Lenny, with the net-next kernel on top) > > > > The "cycles" profile is able to grab flat profile hits while interrupts > > are disabled so it can see stuff happening while interrupts are > > disabled. The "scgprof" profile is an attempt to get some call graphs - > > it does not have visibility into code running with interrupts disabled. > > The "cache" profile is a profile that looks to get some cache miss > > information. > > > > So, having said all that, details can be found under the previously > > mentioned URL. Some quick highlights: > > > > no_iptables - ~22000 transactions/s/netperf. Top of the cycles profile > > looks like: > > > > Function Summary > > ----------------------------------------------------------------------- > > % Total > > IP Cumulat IP > > Samples % of Samples > > (ETB) Total (ETB) Function File > > ----------------------------------------------------------------------- > > 5.70 5.70 37772 s2io.ko::tx_intr_handler > > 5.14 10.84 34012 vmlinux::__ia64_readq > > 4.88 15.72 32285 s2io.ko::s2io_msix_ring_handle > > 4.63 20.34 30625 s2io.ko::rx_intr_handler > > 4.60 24.94 30429 s2io.ko::s2io_xmit > > 3.85 28.79 25488 s2io.ko::s2io_poll_msix > > 2.87 31.65 18987 vmlinux::dev_queue_xmit > > 2.51 34.16 16620 vmlinux::tcp_sendmsg > > 2.51 36.67 16588 vmlinux::tcp_ack > > 2.15 38.82 14221 vmlinux::__inet_lookup_established > > 2.10 40.92 13937 vmlinux::ia64_spinlock_contention > > > > empty_iptables - ~12000 transactions/s/netperf. Top of the cycles > > profile looks like: > > > > Function Summary > > ----------------------------------------------------------------------- > > % Total > > IP Cumulat IP > > Samples % of Samples > > (ETB) Total (ETB) Function File > > ----------------------------------------------------------------------- > > 26.38 26.38 137458 vmlinux::_read_lock_bh > > 10.63 37.01 55388 vmlinux::local_bh_enable_ip > > 3.42 40.43 17812 s2io.ko::tx_intr_handler > > 3.01 43.44 15691 ip_tables.ko::ipt_do_table > > 2.90 46.34 15100 vmlinux::__ia64_readq > > 2.72 49.06 14179 s2io.ko::rx_intr_handler > > 2.55 51.61 13288 s2io.ko::s2io_xmit > > 1.98 53.59 10329 s2io.ko::s2io_msix_ring_handle > > 1.75 55.34 9104 vmlinux::dev_queue_xmit > > 1.64 56.98 8546 s2io.ko::s2io_poll_msix > > 1.52 58.50 7943 vmlinux::sock_wfree > > 1.40 59.91 7302 vmlinux::tcp_ack > > > > full_iptables - some test instances didn't complete, I think they got > > starved. Of those which did complete, their performance ranged all the > > way from 330 to 3100 transactions/s/netperf. Top of the cycles profile > > looks like: > > > > Function Summary > > ----------------------------------------------------------------------- > > % Total > > IP Cumulat IP > > Samples % of Samples > > (ETB) Total (ETB) Function File > > ----------------------------------------------------------------------- > > 64.71 64.71 582171 vmlinux::_write_lock_bh > > 18.43 83.14 165822 vmlinux::ia64_spinlock_contention > > 2.86 85.99 25709 nf_conntrack.ko::init_module > > 2.36 88.35 21194 nf_conntrack.ko::tcp_packet > > 1.78 90.13 16009 vmlinux::_spin_lock_bh > > 1.20 91.33 10810 nf_conntrack.ko::nf_conntrack_in > > 1.20 92.52 10755 vmlinux::nf_iterate > > 1.09 93.62 9833 vmlinux::default_idle > > 0.26 93.88 2331 vmlinux::__ia64_readq > > 0.25 94.12 2213 vmlinux::__interrupt > > 0.24 94.37 2203 s2io.ko::tx_intr_handler > > > > Suggestions as to things to look at/with and/or patches to try are > > welcome. I should have the HW available to me for at least a little > > while, but not indefinitely. > > > > rick jones > > Hi Rick, nice hardware you have :) > > Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks. > I hit this contention point even with low cost hardware, and quite standard application. > > I pinged him few days ago to try to finish the job with him, but it seems Stephen > is busy at the moment. > > Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock > that must be write_locked() for basically every handled tcp frame... > > How long is "not indefinitely" ? Hey, I just got back from Linux Conf Au, haven't had time to catch up yet. It is on my list, after dealing with the other work related stuff. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-26 23:10 ` Eric Dumazet 2009-01-26 23:14 ` Stephen Hemminger @ 2009-01-26 23:19 ` Rick Jones 2009-01-27 9:10 ` Eric Dumazet 2009-02-10 18:44 ` Stephen Hemminger 2 siblings, 1 reply; 42+ messages in thread From: Rick Jones @ 2009-01-26 23:19 UTC (permalink / raw) To: Eric Dumazet Cc: Linux Network Development list, Netfilter Developers, Stephen Hemminger, Patrick McHardy > > Hi Rick, nice hardware you have :) Every once in a while the cobbler's children get some shoes to wear :) > > Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks. > I hit this contention point even with low cost hardware, and quite standard application. > > I pinged him few days ago to try to finish the job with him, but it seems Stephen > is busy at the moment. > > Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock > that must be write_locked() for basically every handled tcp frame... > > How long is "not indefinitely" ? The system I am using is being borrowed under an open ended loan. However, my use of it can be thought of as being the "null process" in VMS - once anyone else wants it I have to get off of it. That said, I would guess that the chances of someone else trying to get that system are pretty small for the next four+ weeks. I had a similar system (PCIe I/O rather than PCI-X) for quite a few weeks before it got pulled-out from under. rick ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-26 23:19 ` Rick Jones @ 2009-01-27 9:10 ` Eric Dumazet 2009-01-27 9:15 ` Patrick McHardy 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-27 9:10 UTC (permalink / raw) To: Rick Jones Cc: Linux Network Development list, Netfilter Developers, Stephen Hemminger, Patrick McHardy Rick Jones a écrit : >> >> Hi Rick, nice hardware you have :) > > Every once in a while the cobbler's children get some shoes to wear :) > >> >> Stephen had a patch to nuke read_lock() from iptables, using RCU and >> seqlocks. >> I hit this contention point even with low cost hardware, and quite >> standard application. >> >> I pinged him few days ago to try to finish the job with him, but it >> seems Stephen >> is busy at the moment. >> >> Then conntrack (tcp sessions) is awfull, since it uses a single >> rwlock_t tcp_lock >> that must be write_locked() for basically every handled tcp frame... >> >> How long is "not indefinitely" ? > > The system I am using is being borrowed under an open ended loan. > However, my use of it can be thought of as being the "null process" in > VMS - once anyone else wants it I have to get off of it. > > That said, I would guess that the chances of someone else trying to get > that system are pretty small for the next four+ weeks. I had a similar > system (PCIe I/O rather than PCI-X) for quite a few weeks before it got > pulled-out from under. > As Stephen will probably send the iptable part, I can do the tcp conntrack improvment. Let me know how it helps your last test (connection tracking enabled) Thanks [PATCH] netfilter: Get rid of central rwlock in tcp conntracking TCP connection tracking suffers of huge contention on a global rwlock, used to protect tcp conntracking state. As each tcp conntrack state have no relations between each others, we can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp" tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Reported-by: Rick Jones <rick.jones2@hp.com> --- diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h index a049df4..3dc72c6 100644 --- a/include/linux/netfilter/nf_conntrack_tcp.h +++ b/include/linux/netfilter/nf_conntrack_tcp.h @@ -50,6 +50,7 @@ struct ip_ct_tcp_state { struct ip_ct_tcp { + spinlock_t lock; struct ip_ct_tcp_state seen[2]; /* connection parameters per direction */ u_int8_t state; /* state of the connection (enum tcp_conntrack) */ /* For detecting stale connections */ diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c index a1edb9c..1bc8fc1 100644 --- a/net/netfilter/nf_conntrack_proto_tcp.c +++ b/net/netfilter/nf_conntrack_proto_tcp.c @@ -26,9 +26,6 @@ #include <net/netfilter/nf_conntrack_ecache.h> #include <net/netfilter/nf_log.h> -/* Protects ct->proto.tcp */ -static DEFINE_RWLOCK(tcp_lock); - /* "Be conservative in what you do, be liberal in what you accept from others." If it's non-zero, we mark only out of window RST segments as INVALID. */ @@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct) { enum tcp_conntrack state; - read_lock_bh(&tcp_lock); state = ct->proto.tcp.state; - read_unlock_bh(&tcp_lock); return seq_printf(s, "%s ", tcp_conntrack_names[state]); } @@ -705,14 +700,14 @@ void nf_conntrack_tcp_update(const struct sk_buff *skb, end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph); - write_lock_bh(&tcp_lock); + spin_lock_bh(&ct->proto.lock); /* * We have to worry for the ack in the reply packet only... */ if (after(end, ct->proto.tcp.seen[dir].td_end)) ct->proto.tcp.seen[dir].td_end = end; ct->proto.tcp.last_end = end; - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.lock); pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i " "receiver end=%u maxend=%u maxwin=%u scale=%i\n", sender->td_end, sender->td_maxend, sender->td_maxwin, @@ -821,7 +816,7 @@ static int tcp_packet(struct nf_conn *ct, th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph); BUG_ON(th == NULL); - write_lock_bh(&tcp_lock); + spin_lock_bh(&ct->proto.tcp.lock); old_state = ct->proto.tcp.state; dir = CTINFO2DIR(ctinfo); index = get_conntrack_index(th); @@ -851,7 +846,7 @@ static int tcp_packet(struct nf_conn *ct, && ct->proto.tcp.last_index == TCP_RST_SET)) { /* Attempt to reopen a closed/aborted connection. * Delete this connection and look up again. */ - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); /* Only repeat if we can actually remove the timer. * Destruction may already be in progress in process @@ -887,7 +882,7 @@ static int tcp_packet(struct nf_conn *ct, * that the client cannot but retransmit its SYN and * thus initiate a clean new session. */ - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); if (LOG_INVALID(net, IPPROTO_TCP)) nf_log_packet(pf, 0, skb, NULL, NULL, NULL, "nf_ct_tcp: killing out of sync session "); @@ -900,7 +895,7 @@ static int tcp_packet(struct nf_conn *ct, ct->proto.tcp.last_end = segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th); - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); if (LOG_INVALID(net, IPPROTO_TCP)) nf_log_packet(pf, 0, skb, NULL, NULL, NULL, "nf_ct_tcp: invalid packet ignored "); @@ -909,7 +904,7 @@ static int tcp_packet(struct nf_conn *ct, /* Invalid packet */ pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n", dir, get_conntrack_index(th), old_state); - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); if (LOG_INVALID(net, IPPROTO_TCP)) nf_log_packet(pf, 0, skb, NULL, NULL, NULL, "nf_ct_tcp: invalid state "); @@ -940,7 +935,7 @@ static int tcp_packet(struct nf_conn *ct, if (!tcp_in_window(ct, &ct->proto.tcp, dir, index, skb, dataoff, th, pf)) { - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); return -NF_ACCEPT; } in_window: @@ -969,7 +964,7 @@ static int tcp_packet(struct nf_conn *ct, timeout = nf_ct_tcp_timeout_unacknowledged; else timeout = tcp_timeouts[new_state]; - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct); if (new_state != old_state) @@ -1022,6 +1017,7 @@ static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb, pr_debug("nf_ct_tcp: invalid new deleting.\n"); return false; } + spin_lock_init(&ct->proto.tcp.lock); if (new_state == TCP_CONNTRACK_SYN_SENT) { /* SYN packet */ @@ -1092,7 +1088,7 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla, struct nlattr *nest_parms; struct nf_ct_tcp_flags tmp = {}; - read_lock_bh(&tcp_lock); + spin_lock_bh(&ct->proto.tcp.lock); nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED); if (!nest_parms) goto nla_put_failure; @@ -1112,14 +1108,14 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla, tmp.flags = ct->proto.tcp.seen[1].flags; NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY, sizeof(struct nf_ct_tcp_flags), &tmp); - read_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); nla_nest_end(skb, nest_parms); return 0; nla_put_failure: - read_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); return -1; } @@ -1150,7 +1146,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct) nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX) return -EINVAL; - write_lock_bh(&tcp_lock); + spin_lock_bh(&ct->proto.tcp.lock); if (tb[CTA_PROTOINFO_TCP_STATE]) ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]); @@ -1177,7 +1173,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct) ct->proto.tcp.seen[1].td_scale = nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]); } - write_unlock_bh(&tcp_lock); + spin_unlock_bh(&ct->proto.tcp.lock); return 0; } -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 9:10 ` Eric Dumazet @ 2009-01-27 9:15 ` Patrick McHardy 2009-01-27 11:29 ` Eric Dumazet 2009-01-27 16:23 ` Eric Dumazet 0 siblings, 2 replies; 42+ messages in thread From: Patrick McHardy @ 2009-01-27 9:15 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Stephen Hemminger Eric Dumazet wrote: > [PATCH] netfilter: Get rid of central rwlock in tcp conntracking > > TCP connection tracking suffers of huge contention on a global rwlock, > used to protect tcp conntracking state. > As each tcp conntrack state have no relations between each others, we > can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp" > > tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state, > so speedup /proc/net/ip_conntrack as well. Thats an interesting test-case, but one lock per conntrack just for TCP tracking seems like overkill. We're trying to keep the conntrack stuctures as small as possible, so I'd prefer an array of spinlocks or something like that. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 9:15 ` Patrick McHardy @ 2009-01-27 11:29 ` Eric Dumazet 2009-01-27 11:37 ` Patrick McHardy 2009-01-27 16:23 ` Eric Dumazet 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-27 11:29 UTC (permalink / raw) To: Patrick McHardy Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Stephen Hemminger Patrick McHardy a écrit : > Eric Dumazet wrote: >> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking >> >> TCP connection tracking suffers of huge contention on a global rwlock, >> used to protect tcp conntracking state. >> As each tcp conntrack state have no relations between each others, we >> can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp" >> >> tcp_print_conntrack() dont need to lock anything to read >> ct->proto.tcp.state, >> so speedup /proc/net/ip_conntrack as well. > > Thats an interesting test-case, but one lock per conntrack just for > TCP tracking seems like overkill. We're trying to keep the conntrack > stuctures as small as possible, so I'd prefer an array of spinlocks > or something like that. Yes, this is wise. Current sizeof(struct nf_conn) is 220 (0xdc) on 32 bits, probably rounded to 0xE0 by SLAB/SLUB. I will provide a new patch using an array of say 512 spinlocks. (512 spinlocks use 2048 bytes if non debuging spinlocks, that spread to 32 x 64bytes cache lines) However I wonder if for very large number of cpus we should at least ask conntrack to use hardware aligned "struct nf_conn" to avoid false sharing We might also use a generic SLAB_HWCACHE_ALIGN_IFMANYCPUS flag if same tactic could help other kmem_cache_create() users diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 90ce9dd..82332ce 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1167,8 +1167,10 @@ static int nf_conntrack_init_init_net(void) nf_conntrack_max); nf_conntrack_cachep = kmem_cache_create("nf_conntrack", - sizeof(struct nf_conn), - 0, 0, NULL); + sizeof(struct nf_conn), 0, + num_possible_cpus() >= 32 ? + SLAB_HWCACHE_ALIGN : 0, + NULL); if (!nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); ret = -ENOMEM; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 11:29 ` Eric Dumazet @ 2009-01-27 11:37 ` Patrick McHardy 0 siblings, 0 replies; 42+ messages in thread From: Patrick McHardy @ 2009-01-27 11:37 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Stephen Hemminger Eric Dumazet wrote: > Patrick McHardy a écrit : > >> Thats an interesting test-case, but one lock per conntrack just for >> TCP tracking seems like overkill. We're trying to keep the conntrack >> stuctures as small as possible, so I'd prefer an array of spinlocks >> or something like that. >> > > Yes, this is wise. Current sizeof(struct nf_conn) is 220 (0xdc) on 32 bits, > probably rounded to 0xE0 by SLAB/SLUB. I will provide a new patch using > an array of say 512 spinlocks. (512 spinlocks use 2048 bytes if non > debuging spinlocks, that spread to 32 x 64bytes cache lines) > Sounds good, but it should be limited to NR_CPUS I guess. > However I wonder if for very large number of cpus we should at least ask conntrack > to use hardware aligned "struct nf_conn" to avoid false sharing > I'm not sure that is really a problem in practice, you usually have quite a few inactive conntrack entries and false sharing would only happen when two consequitive entries are active. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 9:15 ` Patrick McHardy 2009-01-27 11:29 ` Eric Dumazet @ 2009-01-27 16:23 ` Eric Dumazet 2009-01-27 17:33 ` Patrick McHardy 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-27 16:23 UTC (permalink / raw) To: Patrick McHardy Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Stephen Hemminger Patrick McHardy a écrit : > Eric Dumazet wrote: >> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking >> >> TCP connection tracking suffers of huge contention on a global rwlock, >> used to protect tcp conntracking state. >> As each tcp conntrack state have no relations between each others, we >> can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp" >> >> tcp_print_conntrack() dont need to lock anything to read >> ct->proto.tcp.state, >> so speedup /proc/net/ip_conntrack as well. > > Thats an interesting test-case, but one lock per conntrack just for > TCP tracking seems like overkill. We're trying to keep the conntrack > stuctures as small as possible, so I'd prefer an array of spinlocks > or something like that. Please find a new version of patch, using an array of spinlocks. I chose to define an array at nf_conntrack level, since tcp conntracking is one user of this array but we might find other users of this array. Thanks [PATCH] netfilter: Get rid of central rwlock in tcp conntracking TCP connection tracking suffers of huge contention on a global rwlock, used to protect tcp conntracking state. As each tcp conntrack state have no relations between each others, we can switch to fine grained lock. Using an array of spinlocks avoids enlarging size of connection tracking structures, yet giving reasonable fanout. tcp_print_conntrack() doesnt need to lock anything to read ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Reported-by: Rick Jones <rick.jones2@hp.com> --- include/net/netfilter/nf_conntrack.h | 32 +++++++++++++++++++++ net/netfilter/nf_conntrack_core.c | 10 ++++-- net/netfilter/nf_conntrack_proto_tcp.c | 34 ++++++++++------------- 3 files changed, 53 insertions(+), 23 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 2e0c536..288aff5 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -129,6 +129,38 @@ struct nf_conn struct rcu_head rcu; }; +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \ + defined(CONFIG_PROVE_LOCKING) + +/* + * We reserve 16 locks per cpu (typical cache line size is 64 bytes) + * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory. + * (on lockdep we have a quite big spinlock_t, so keep the size down there) + */ +#ifdef CONFIG_LOCKDEP +#define CT_HASH_LOCK_SZ 64 +#elif NR_CPUS >= 32 +#define CT_HASH_LOCK_SZ 512 +#else +#define CT_HASH_LOCK_SZ (roundup_pow_of_two(NR_CPUS) * 16) +#endif + +extern spinlock_t ct_hlock[CT_HASH_LOCK_SZ]; + +#else +#define CT_HASH_LOCK_SZ 0 +#endif +static inline spinlock_t *ct_lock_addr(const struct nf_conn *ct) +{ + if (CT_HASH_LOCK_SZ) { + unsigned long hash = (unsigned long)ct; + hash ^= hash >> 16; + hash ^= hash >> 8; + return &ct_hlock[hash % CT_HASH_LOCK_SZ]; + } + return NULL; +} + static inline struct nf_conn * nf_ct_tuplehash_to_ctrack(const struct nf_conntrack_tuple_hash *hash) { diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 90ce9dd..68822d8 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -61,9 +61,9 @@ struct nf_conn nf_conntrack_untracked __read_mostly; EXPORT_SYMBOL_GPL(nf_conntrack_untracked); static struct kmem_cache *nf_conntrack_cachep __read_mostly; - -static int nf_conntrack_hash_rnd_initted; -static unsigned int nf_conntrack_hash_rnd; +spinlock_t ct_hlock[CT_HASH_LOCK_SZ]; +static int nf_conntrack_hash_rnd_initted __read_mostly; +static unsigned int nf_conntrack_hash_rnd __read_mostly; static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple, unsigned int size, unsigned int rnd) @@ -1141,7 +1141,7 @@ module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint, static int nf_conntrack_init_init_net(void) { int max_factor = 8; - int ret; + int i, ret; /* Idea from tcp.c: use 1/16384 of memory. On i386: 32MB * machine has 512 buckets. >= 1GB machines have 16384 buckets. */ @@ -1174,6 +1174,8 @@ static int nf_conntrack_init_init_net(void) ret = -ENOMEM; goto err_cache; } + for (i = 0; i < CT_HASH_LOCK_SZ; i++) + spin_lock_init(&ct_hlock[i]); ret = nf_conntrack_proto_init(); if (ret < 0) diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c index a1edb9c..247e82f 100644 --- a/net/netfilter/nf_conntrack_proto_tcp.c +++ b/net/netfilter/nf_conntrack_proto_tcp.c @@ -26,9 +26,6 @@ #include <net/netfilter/nf_conntrack_ecache.h> #include <net/netfilter/nf_log.h> -/* Protects ct->proto.tcp */ -static DEFINE_RWLOCK(tcp_lock); - /* "Be conservative in what you do, be liberal in what you accept from others." If it's non-zero, we mark only out of window RST segments as INVALID. */ @@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct) { enum tcp_conntrack state; - read_lock_bh(&tcp_lock); state = ct->proto.tcp.state; - read_unlock_bh(&tcp_lock); return seq_printf(s, "%s ", tcp_conntrack_names[state]); } @@ -705,14 +700,14 @@ void nf_conntrack_tcp_update(const struct sk_buff *skb, end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph); - write_lock_bh(&tcp_lock); + spin_lock_bh(ct_lock_addr(ct)); /* * We have to worry for the ack in the reply packet only... */ if (after(end, ct->proto.tcp.seen[dir].td_end)) ct->proto.tcp.seen[dir].td_end = end; ct->proto.tcp.last_end = end; - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i " "receiver end=%u maxend=%u maxwin=%u scale=%i\n", sender->td_end, sender->td_maxend, sender->td_maxwin, @@ -821,7 +816,7 @@ static int tcp_packet(struct nf_conn *ct, th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph); BUG_ON(th == NULL); - write_lock_bh(&tcp_lock); + spin_lock_bh(ct_lock_addr(ct)); old_state = ct->proto.tcp.state; dir = CTINFO2DIR(ctinfo); index = get_conntrack_index(th); @@ -851,7 +846,7 @@ static int tcp_packet(struct nf_conn *ct, && ct->proto.tcp.last_index == TCP_RST_SET)) { /* Attempt to reopen a closed/aborted connection. * Delete this connection and look up again. */ - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); /* Only repeat if we can actually remove the timer. * Destruction may already be in progress in process @@ -887,7 +882,7 @@ static int tcp_packet(struct nf_conn *ct, * that the client cannot but retransmit its SYN and * thus initiate a clean new session. */ - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); if (LOG_INVALID(net, IPPROTO_TCP)) nf_log_packet(pf, 0, skb, NULL, NULL, NULL, "nf_ct_tcp: killing out of sync session "); @@ -900,7 +895,7 @@ static int tcp_packet(struct nf_conn *ct, ct->proto.tcp.last_end = segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th); - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); if (LOG_INVALID(net, IPPROTO_TCP)) nf_log_packet(pf, 0, skb, NULL, NULL, NULL, "nf_ct_tcp: invalid packet ignored "); @@ -909,7 +904,7 @@ static int tcp_packet(struct nf_conn *ct, /* Invalid packet */ pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n", dir, get_conntrack_index(th), old_state); - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); if (LOG_INVALID(net, IPPROTO_TCP)) nf_log_packet(pf, 0, skb, NULL, NULL, NULL, "nf_ct_tcp: invalid state "); @@ -940,7 +935,7 @@ static int tcp_packet(struct nf_conn *ct, if (!tcp_in_window(ct, &ct->proto.tcp, dir, index, skb, dataoff, th, pf)) { - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); return -NF_ACCEPT; } in_window: @@ -969,7 +964,7 @@ static int tcp_packet(struct nf_conn *ct, timeout = nf_ct_tcp_timeout_unacknowledged; else timeout = tcp_timeouts[new_state]; - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct); if (new_state != old_state) @@ -1022,6 +1017,7 @@ static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb, pr_debug("nf_ct_tcp: invalid new deleting.\n"); return false; } + spin_lock_init(ct_lock_addr(ct)); if (new_state == TCP_CONNTRACK_SYN_SENT) { /* SYN packet */ @@ -1092,7 +1088,7 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla, struct nlattr *nest_parms; struct nf_ct_tcp_flags tmp = {}; - read_lock_bh(&tcp_lock); + spin_lock_bh(ct_lock_addr(ct)); nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED); if (!nest_parms) goto nla_put_failure; @@ -1112,14 +1108,14 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla, tmp.flags = ct->proto.tcp.seen[1].flags; NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY, sizeof(struct nf_ct_tcp_flags), &tmp); - read_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); nla_nest_end(skb, nest_parms); return 0; nla_put_failure: - read_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); return -1; } @@ -1150,7 +1146,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct) nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX) return -EINVAL; - write_lock_bh(&tcp_lock); + spin_lock_bh(ct_lock_addr(ct)); if (tb[CTA_PROTOINFO_TCP_STATE]) ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]); @@ -1177,7 +1173,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct) ct->proto.tcp.seen[1].td_scale = nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]); } - write_unlock_bh(&tcp_lock); + spin_unlock_bh(ct_lock_addr(ct)); return 0; } ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 16:23 ` Eric Dumazet @ 2009-01-27 17:33 ` Patrick McHardy 2009-01-27 18:02 ` Rick Jones 0 siblings, 1 reply; 42+ messages in thread From: Patrick McHardy @ 2009-01-27 17:33 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Stephen Hemminger Eric Dumazet wrote: > [PATCH] netfilter: Get rid of central rwlock in tcp conntracking > > TCP connection tracking suffers of huge contention on a global rwlock, > used to protect tcp conntracking state. > > As each tcp conntrack state have no relations between each others, we > can switch to fine grained lock. Using an array of spinlocks avoids > enlarging size of connection tracking structures, yet giving reasonable > fanout. > > tcp_print_conntrack() doesnt need to lock anything to read > ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. > > nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly > This looks good to me. Rick, would you like to give it a try? I'll convert the remaining conntrack protocols when applying it. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 17:33 ` Patrick McHardy @ 2009-01-27 18:02 ` Rick Jones 2009-01-27 19:09 ` Rick Jones 0 siblings, 1 reply; 42+ messages in thread From: Rick Jones @ 2009-01-27 18:02 UTC (permalink / raw) To: Patrick McHardy Cc: Eric Dumazet, Linux Network Development list, Netfilter Developers, Stephen Hemminger Patrick McHardy wrote: > Eric Dumazet wrote: > >>[PATCH] netfilter: Get rid of central rwlock in tcp conntracking >> >>TCP connection tracking suffers of huge contention on a global rwlock, >>used to protect tcp conntracking state. >> >>As each tcp conntrack state have no relations between each others, we >>can switch to fine grained lock. Using an array of spinlocks avoids >>enlarging size of connection tracking structures, yet giving reasonable >>fanout. >> >>tcp_print_conntrack() doesnt need to lock anything to read >>ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. >> >>nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly >> > > > This looks good to me. Rick, would you like to give it a try? > > I'll convert the remaining conntrack protocols when applying it. I will give it a try and let folks know the results - unless told otherwise, I will ass-u-me I only need rerun the "full_iptables" test case. rick jones ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 18:02 ` Rick Jones @ 2009-01-27 19:09 ` Rick Jones 2009-01-27 19:24 ` Rick Jones 0 siblings, 1 reply; 42+ messages in thread From: Rick Jones @ 2009-01-27 19:09 UTC (permalink / raw) To: Rick Jones Cc: Patrick McHardy, Eric Dumazet, Linux Network Development list, Netfilter Developers, Stephen Hemminger Rick Jones wrote: > Patrick McHardy wrote: > >> Eric Dumazet wrote: >> >>> [PATCH] netfilter: Get rid of central rwlock in tcp conntracking >>> >>> TCP connection tracking suffers of huge contention on a global rwlock, >>> used to protect tcp conntracking state. >>> >>> As each tcp conntrack state have no relations between each others, we >>> can switch to fine grained lock. Using an array of spinlocks avoids >>> enlarging size of connection tracking structures, yet giving reasonable >>> fanout. >>> >>> tcp_print_conntrack() doesnt need to lock anything to read >>> ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. >>> >>> nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared >>> read_mostly >>> >> >> >> >> This looks good to me. Rick, would you like to give it a try? >> >> I'll convert the remaining conntrack protocols when applying it. > > > I will give it a try and let folks know the results - unless told > otherwise, I will ass-u-me I only need rerun the "full_iptables" test case. The runemomniagg2.sh script is still running, but the initial cycles profile suggests that the main change is converting the write_lock time into spinlock contention time with 78.39% of the cycles spent in ia64_spinlock_contention. When the script completes I'll upload the profiles and the netperf results to the same base URL as in the basenote under "contrack01/" rick jones ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 19:09 ` Rick Jones @ 2009-01-27 19:24 ` Rick Jones 2009-01-27 22:17 ` Eric Dumazet 2009-01-28 13:55 ` Eric Dumazet 0 siblings, 2 replies; 42+ messages in thread From: Rick Jones @ 2009-01-27 19:24 UTC (permalink / raw) To: Netfilter Developers Cc: Patrick McHardy, Eric Dumazet, Linux Network Development list, Stephen Hemminger >> I will give it a try and let folks know the results - unless told >> otherwise, I will ass-u-me I only need rerun the "full_iptables" test >> case. > > > The runemomniagg2.sh script is still running, but the initial cycles > profile suggests that the main change is converting the write_lock time > into spinlock contention time with 78.39% of the cycles spent in > ia64_spinlock_contention. When the script completes I'll upload the > profiles and the netperf results to the same base URL as in the basenote > under "contrack01/" The script completed - although at some point I hit an fd limit - I think I have an fd leak in netperf somewhere :( . Anyhow, there are still some netperfs that end-up kicking the bucket during the run - I suspect starvation because where in the other configs (no iptables, and empty iptables) each netperf seems to consume about 50% of a CPU - stands to reason - 64 netperfs, 32 cores - in the "full" case I see many netperfs consuming 100% of a CPU. My gut is thinking that one or more netperf contexts gets stuck doing something on behalf of others. There is also ksoftirqd time for a few of those processes. Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which does represent an improvement. rick jones PS - just to be certain that running-out of fd's didn't skew the results I'm rerunning the script with ulimit -n 10240 and will see if that changes the results any. And I suppose I need to go fd leak hunting in netperf omni code :( ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 19:24 ` Rick Jones @ 2009-01-27 22:17 ` Eric Dumazet 2009-01-27 22:29 ` Rick Jones 2009-01-28 13:55 ` Eric Dumazet 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-27 22:17 UTC (permalink / raw) To: Rick Jones Cc: Netfilter Developers, Patrick McHardy, Linux Network Development list, Stephen Hemminger Rick Jones a écrit : >>> I will give it a try and let folks know the results - unless told >>> otherwise, I will ass-u-me I only need rerun the "full_iptables" >>> test case. >> >> >> The runemomniagg2.sh script is still running, but the initial cycles >> profile suggests that the main change is converting the write_lock >> time into spinlock contention time with 78.39% of the cycles spent in >> ia64_spinlock_contention. When the script completes I'll upload the >> profiles and the netperf results to the same base URL as in the >> basenote under "contrack01/" > > The script completed - although at some point I hit an fd limit - I > think I have an fd leak in netperf somewhere :( . > > Anyhow, there are still some netperfs that end-up kicking the bucket > during the run - I suspect starvation because where in the other > configs (no iptables, and empty iptables) each netperf seems to > consume about 50% of a CPU - stands to reason - 64 netperfs, 32 cores > - in the "full" case I see many netperfs consuming 100% of a CPU. My > gut is thinking that one or more netperf contexts gets stuck doing > something on behalf of others. There is also ksoftirqd time for a few > of those processes. > > Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which > does represent an improvement. > > rick jones > > PS - just to be certain that running-out of fd's didn't skew the > results I'm rerunning the script with ulimit -n 10240 and will see if > that changes the results any. And I suppose I need to go fd leak > hunting in netperf omni code :( > -- > Thanks for the report If you have so much contention on spinlocks, maybe hash function is not good at all... hash = (unsigned long)ct; hash ^= hash >> 16; hash ^= hash >> 8; I ass-u-me you compiled your kernel with NR_CPUS=32 ? -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 22:17 ` Eric Dumazet @ 2009-01-27 22:29 ` Rick Jones 2009-01-27 22:34 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Rick Jones @ 2009-01-27 22:29 UTC (permalink / raw) To: Eric Dumazet Cc: Netfilter Developers, Patrick McHardy, Linux Network Development list, Stephen Hemminger > Thanks for the report > > If you have so much contention on spinlocks, maybe hash function is not > good at all... > > hash = (unsigned long)ct; > hash ^= hash >> 16; > hash ^= hash >> 8; > > I ass-u-me you compiled your kernel with NR_CPUS=32 ? I believe so - CONFIG_NR_CPUS in .config is 64 anyway. If there is a more definitive place to check I'd be happy to look. rick ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 22:29 ` Rick Jones @ 2009-01-27 22:34 ` Eric Dumazet 2009-01-27 22:43 ` Rick Jones 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-27 22:34 UTC (permalink / raw) To: Rick Jones Cc: Netfilter Developers, Patrick McHardy, Linux Network Development list, Stephen Hemminger Rick Jones a écrit : > >> Thanks for the report >> >> If you have so much contention on spinlocks, maybe hash function is >> not good at all... >> >> hash = (unsigned long)ct; >> hash ^= hash >> 16; >> hash ^= hash >> 8; >> >> I ass-u-me you compiled your kernel with NR_CPUS=32 ? > > I believe so - CONFIG_NR_CPUS in .config is 64 anyway. If there is a > more definitive place to check I'd be happy to look. > > rick > > By any chance, are you using SLAB instead of SLUB ? While running your netperf session, issuing : grep conntrack /proc/slabinfo Would help to know hash dispertion on your machine. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 22:34 ` Eric Dumazet @ 2009-01-27 22:43 ` Rick Jones 0 siblings, 0 replies; 42+ messages in thread From: Rick Jones @ 2009-01-27 22:43 UTC (permalink / raw) To: Eric Dumazet Cc: Netfilter Developers, Patrick McHardy, Linux Network Development list, Stephen Hemminger > > By any chance, are you using SLAB instead of SLUB ? Seems that way: sut42:~/net-next-2.6# grep SLAB .config CONFIG_SLAB=y CONFIG_SLABINFO=y # CONFIG_DEBUG_SLAB is not set sut42:~/net-next-2.6# grep SLUB .config # CONFIG_SLUB is not set The config file should be up under the base URL: ftp://ftp.netperf.org/iptable_scaling The system started as a Debian Lenny with a 2.6.26 kernel, and I did a make old config and mostly took defaults (save multiq) > While running your netperf session, issuing : > > grep conntrack /proc/slabinfo > > Would help to know hash dispertion on your machine. sut42:~/net-next-2.6# grep conntrack /proc/slabinfo nf_conntrack_expect 0 0 240 66 1 : tunables 120 60 8 : slabdata 0 0 0 nf_conntrack 276 742 304 53 1 : tunables 54 27 8 : slabdata 14 14 0 about a minute later it is: grep conntrack /proc/slabinfo nf_conntrack_expect 0 0 240 66 1 : tunables 120 60 8 : slabdata 0 0 0 nf_conntrack 315 742 304 53 1 : tunables 54 27 8 : slabdata 14 14 0 after about another 40 to 60 seconds it was: grep conntrack /proc/slabinfo nf_conntrack_expect 0 0 240 66 1 : tunables 120 60 8 : slabdata 0 0 0 nf_conntrack 314 742 304 53 1 : tunables 54 27 8 : slabdata 14 14 7 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-27 19:24 ` Rick Jones 2009-01-27 22:17 ` Eric Dumazet @ 2009-01-28 13:55 ` Eric Dumazet 2009-01-28 16:25 ` Patrick McHardy 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-28 13:55 UTC (permalink / raw) To: Rick Jones Cc: Netfilter Developers, Patrick McHardy, Linux Network Development list, Stephen Hemminger Rick Jones a écrit : >>> I will give it a try and let folks know the results - unless told >>> otherwise, I will ass-u-me I only need rerun the "full_iptables" test >>> case. >> >> >> The runemomniagg2.sh script is still running, but the initial cycles >> profile suggests that the main change is converting the write_lock >> time into spinlock contention time with 78.39% of the cycles spent in >> ia64_spinlock_contention. When the script completes I'll upload the >> profiles and the netperf results to the same base URL as in the >> basenote under "contrack01/" > > The script completed - although at some point I hit an fd limit - I > think I have an fd leak in netperf somewhere :( . > > Anyhow, there are still some netperfs that end-up kicking the bucket > during the run - I suspect starvation because where in the other configs > (no iptables, and empty iptables) each netperf seems to consume about > 50% of a CPU - stands to reason - 64 netperfs, 32 cores - in the "full" > case I see many netperfs consuming 100% of a CPU. My gut is thinking > that one or more netperf contexts gets stuck doing something on behalf > of others. There is also ksoftirqd time for a few of those processes. > > Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which > does represent an improvement. > Yes indeed you have a speedup, tcp conntracking is OK. You now hit the nf_conntrack_lock spinlock we have in generic conntrack code (net/netfilter/nf_conntrack_core.c) nf_ct_refresh_acct() for instance has to lock it. We really want some finer locking here. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-28 13:55 ` Eric Dumazet @ 2009-01-28 16:25 ` Patrick McHardy 2009-01-28 17:07 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Patrick McHardy @ 2009-01-28 16:25 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Netfilter Developers, Linux Network Development list, Stephen Hemminger Eric Dumazet wrote: > Rick Jones a écrit : >> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which >> does represent an improvement. >> > > Yes indeed you have a speedup, tcp conntracking is OK. > > You now hit the nf_conntrack_lock spinlock we have in generic conntrack code > (net/netfilter/nf_conntrack_core.c) > > nf_ct_refresh_acct() for instance has to lock it. > > We really want some finer locking here. That looks more complicated since it requires to take multiple locks occasionally (f.i. hash insertion, potentially helper-related and expectation-related stuff), and there is the unconfirmed_list, where fine-grained locking can't really be used without changing it to a hash. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-28 16:25 ` Patrick McHardy @ 2009-01-28 17:07 ` Eric Dumazet 2009-01-28 17:34 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-28 17:07 UTC (permalink / raw) To: Patrick McHardy Cc: Rick Jones, Netfilter Developers, Linux Network Development list, Stephen Hemminger Patrick McHardy a écrit : > Eric Dumazet wrote: >> Rick Jones a écrit : >>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which >>> does represent an improvement. >>> >> >> Yes indeed you have a speedup, tcp conntracking is OK. >> >> You now hit the nf_conntrack_lock spinlock we have in generic >> conntrack code (net/netfilter/nf_conntrack_core.c) >> >> nf_ct_refresh_acct() for instance has to lock it. >> >> We really want some finer locking here. > > That looks more complicated since it requires to take multiple locks > occasionally (f.i. hash insertion, potentially helper-related and > expectation-related stuff), and there is the unconfirmed_list, where > fine-grained locking can't really be used without changing it to > a hash. > Yes its more complicated, but look what we did in 2.6.29 for tcp/udp sockets, using RCU to have lockless lookups. Yes, we still take a lock when doing an insert or delete at socket bind/unbind time. We could keep a central nf_conntrack_lock to guard insertions/deletes from hash and unconfirmed_list. But *normal* packets that only need to change state of one particular connection could use RCU (without spinlock) to locate the conntrack, then lock the found conntrack to perform all state changes. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-28 17:07 ` Eric Dumazet @ 2009-01-28 17:34 ` Eric Dumazet 2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet 2009-02-09 14:57 ` 32 core net-next stack/netfilter "scaling" Patrick McHardy 0 siblings, 2 replies; 42+ messages in thread From: Eric Dumazet @ 2009-01-28 17:34 UTC (permalink / raw) To: Patrick McHardy Cc: Rick Jones, Netfilter Developers, Linux Network Development list, Stephen Hemminger Eric Dumazet a écrit : > Patrick McHardy a écrit : >> Eric Dumazet wrote: >>> Rick Jones a écrit : >>>> Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which >>>> does represent an improvement. >>>> >>> Yes indeed you have a speedup, tcp conntracking is OK. >>> >>> You now hit the nf_conntrack_lock spinlock we have in generic >>> conntrack code (net/netfilter/nf_conntrack_core.c) >>> >>> nf_ct_refresh_acct() for instance has to lock it. >>> >>> We really want some finer locking here. >> That looks more complicated since it requires to take multiple locks >> occasionally (f.i. hash insertion, potentially helper-related and >> expectation-related stuff), and there is the unconfirmed_list, where >> fine-grained locking can't really be used without changing it to >> a hash. >> > > Yes its more complicated, but look what we did in 2.6.29 for tcp/udp > sockets, using RCU to have lockless lookups. > Yes, we still take a lock when doing an insert or delete at socket > bind/unbind time. > > We could keep a central nf_conntrack_lock to guard insertions/deletes > from hash and unconfirmed_list. > > But *normal* packets that only need to change state of one particular > connection could use RCU (without spinlock) to locate the conntrack, > then lock the found conntrack to perform all state changes. Well... RCU is already used by conntrack :) Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-28 17:34 ` Eric Dumazet @ 2009-01-29 15:31 ` Eric Dumazet 2009-01-30 15:47 ` Andi Kleen 2009-02-20 10:02 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet 2009-02-09 14:57 ` 32 core net-next stack/netfilter "scaling" Patrick McHardy 1 sibling, 2 replies; 42+ messages in thread From: Eric Dumazet @ 2009-01-29 15:31 UTC (permalink / raw) To: Patrick McHardy Cc: David S. Miller, Netfilter Developers, Linux Network Development list While doing oprofile tests I noticed two loops are not properly unrolled by gcc Using a hand coded unrolled loop provides nice speedup : ipt_do_table credited of 2.52 % of cpu instead of 3.29 % in tbench. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- net/ipv4/netfilter/ip_tables.c | 32 +++++++++++++++++-------------- 1 files changed, 18 insertions(+), 14 deletions(-) diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index ef8b6ca..7bdb2cf 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -65,6 +65,22 @@ do { \ #define inline #endif +static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask) +{ + const unsigned long *a = (const unsigned long *)_a; + const unsigned long *b = (const unsigned long *)_b; + const unsigned long *mask = (const unsigned long *)_mask; + unsigned long ret; + + ret = (a[0] ^ b[0]) & mask[0]; + ret |= (a[1] ^ b[1]) & mask[1]; + if (IFNAMSIZ > 2 * sizeof(unsigned long)) + ret |= (a[2] ^ b[2]) & mask[2]; + if (IFNAMSIZ > 3 * sizeof(unsigned long)) + ret |= (a[3] ^ b[3]) & mask[3]; + return ret; +} + /* We keep a set of rules for each CPU, so we can avoid write-locking them in the softirq when updating the counters and therefore @@ -83,7 +99,6 @@ ip_packet_match(const struct iphdr *ip, const struct ipt_ip *ipinfo, int isfrag) { - size_t i; unsigned long ret; #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg))) @@ -103,13 +118,7 @@ ip_packet_match(const struct iphdr *ip, return false; } - /* Look for ifname matches; this should unroll nicely. */ - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)indev)[i] - ^ ((const unsigned long *)ipinfo->iniface)[i]) - & ((const unsigned long *)ipinfo->iniface_mask)[i]; - } - + ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask); if (FWINV(ret != 0, IPT_INV_VIA_IN)) { dprintf("VIA in mismatch (%s vs %s).%s\n", indev, ipinfo->iniface, @@ -117,12 +126,7 @@ ip_packet_match(const struct iphdr *ip, return false; } - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)outdev)[i] - ^ ((const unsigned long *)ipinfo->outiface)[i]) - & ((const unsigned long *)ipinfo->outiface_mask)[i]; - } - + ret = ifname_compare(outdev, ipinfo->outiface, ipinfo->outiface_mask); if (FWINV(ret != 0, IPT_INV_VIA_OUT)) { dprintf("VIA out mismatch (%s vs %s).%s\n", outdev, ipinfo->outiface, ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet @ 2009-01-30 15:47 ` Andi Kleen 2009-01-30 16:54 ` Eric Dumazet 2009-02-20 10:02 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet 1 sibling, 1 reply; 42+ messages in thread From: Andi Kleen @ 2009-01-30 15:47 UTC (permalink / raw) To: Eric Dumazet Cc: Patrick McHardy, David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet <dada1@cosmosbay.com> writes: > While doing oprofile tests I noticed two loops are not properly unrolled by gcc That's because nobody passed -funroll-loops. Did you try that for that file? Likely will need -O2 too > +static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask) > +{ > + const unsigned long *a = (const unsigned long *)_a; > + const unsigned long *b = (const unsigned long *)_b; > + const unsigned long *mask = (const unsigned long *)_mask; > + unsigned long ret; > + > + ret = (a[0] ^ b[0]) & mask[0]; > + ret |= (a[1] ^ b[1]) & mask[1]; > + if (IFNAMSIZ > 2 * sizeof(unsigned long)) > + ret |= (a[2] ^ b[2]) & mask[2]; > + if (IFNAMSIZ > 3 * sizeof(unsigned long)) > + ret |= (a[3] ^ b[3]) & mask[3]; That will silently break for IFNAMSIZ >= 4*sizeof(unsigned long) You should add a dummy loop for that or at least a BUILD_BUG_ON -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-30 15:47 ` Andi Kleen @ 2009-01-30 16:54 ` Eric Dumazet 2009-01-30 17:27 ` Andi Kleen 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-30 16:54 UTC (permalink / raw) To: Andi Kleen Cc: Patrick McHardy, David S. Miller, Netfilter Developers, Linux Network Development list Andi Kleen a écrit : > Eric Dumazet <dada1@cosmosbay.com> writes: > >> While doing oprofile tests I noticed two loops are not properly unrolled by gcc > > That's because nobody passed -funroll-loops. Did you try that for > that file? Likely will need -O2 too I dont want to unroll all loops, only those two :) I wish gcc (4.3.2 here) was litle bit smarter :( Without using -funroll-loops size ipv4/netfilter/ip_tables.o text data bss dec hex filename 6424 368 16 6808 1a98 net/ipv4/netfilter/ip_tables.o With -funroll-loops : size ipv4/netfilter/ip_tables.o text data bss dec hex filename 7144 368 16 7528 1d68 net/ipv4/netfilter/ip_tables.o With my patch and no -funroll-loops text data bss dec hex filename 6488 368 16 6872 1ad8 net/ipv4/netfilter/ip_tables.o > >> +static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask) >> +{ >> + const unsigned long *a = (const unsigned long *)_a; >> + const unsigned long *b = (const unsigned long *)_b; >> + const unsigned long *mask = (const unsigned long *)_mask; >> + unsigned long ret; >> + >> + ret = (a[0] ^ b[0]) & mask[0]; >> + ret |= (a[1] ^ b[1]) & mask[1]; >> + if (IFNAMSIZ > 2 * sizeof(unsigned long)) >> + ret |= (a[2] ^ b[2]) & mask[2]; >> + if (IFNAMSIZ > 3 * sizeof(unsigned long)) >> + ret |= (a[3] ^ b[3]) & mask[3]; > > That will silently break for IFNAMSIZ >= 4*sizeof(unsigned long) > You should add a dummy loop for that or at least a BUILD_BUG_ON It will also break the day we port linux to a 128 bits machine :) Thanks Andi (By the way, I still use the patch on arch/x86/oprofile/op_model_ppro.c to have a working oprofile on my dev machine...) [PATCH] netfilter: unfold two critical loops in ip_packet_match() While doing oprofile tests I noticed two loops are not properly unrolled by gcc Using hand coded unrolled loop provides nice speedup : ipt_do_table credited of 2.52 % of cpu instead of 3.29 % in tbench, for a small text size increase (62 bytes for both loops) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- net/ipv4/netfilter/ip_tables.c | 34 ++++++++++++++++++------------- 1 files changed, 20 insertions(+), 14 deletions(-) diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index ef8b6ca..9298d0a 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -65,6 +65,24 @@ do { \ #define inline #endif +static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask) +{ + const unsigned long *a = (const unsigned long *)_a; + const unsigned long *b = (const unsigned long *)_b; + const unsigned long *mask = (const unsigned long *)_mask; + unsigned long ret; + + ret = (a[0] ^ b[0]) & mask[0]; + if (IFNAMSIZ > sizeof(unsigned long)) + ret |= (a[1] ^ b[1]) & mask[1]; + if (IFNAMSIZ > 2 * sizeof(unsigned long)) + ret |= (a[2] ^ b[2]) & mask[2]; + if (IFNAMSIZ > 3 * sizeof(unsigned long)) + ret |= (a[3] ^ b[3]) & mask[3]; + BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long)); + return ret; +} + /* We keep a set of rules for each CPU, so we can avoid write-locking them in the softirq when updating the counters and therefore @@ -83,7 +101,6 @@ ip_packet_match(const struct iphdr *ip, const struct ipt_ip *ipinfo, int isfrag) { - size_t i; unsigned long ret; #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg))) @@ -103,13 +120,7 @@ ip_packet_match(const struct iphdr *ip, return false; } - /* Look for ifname matches; this should unroll nicely. */ - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)indev)[i] - ^ ((const unsigned long *)ipinfo->iniface)[i]) - & ((const unsigned long *)ipinfo->iniface_mask)[i]; - } - + ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask); if (FWINV(ret != 0, IPT_INV_VIA_IN)) { dprintf("VIA in mismatch (%s vs %s).%s\n", indev, ipinfo->iniface, @@ -117,12 +128,7 @@ ip_packet_match(const struct iphdr *ip, return false; } - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)outdev)[i] - ^ ((const unsigned long *)ipinfo->outiface)[i]) - & ((const unsigned long *)ipinfo->outiface_mask)[i]; - } - + ret = ifname_compare(outdev, ipinfo->outiface, ipinfo->outiface_mask); if (FWINV(ret != 0, IPT_INV_VIA_OUT)) { dprintf("VIA out mismatch (%s vs %s).%s\n", outdev, ipinfo->outiface, -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-30 16:54 ` Eric Dumazet @ 2009-01-30 17:27 ` Andi Kleen 2009-01-30 17:27 ` Eric Dumazet 2009-02-09 13:41 ` Patrick McHardy 0 siblings, 2 replies; 42+ messages in thread From: Andi Kleen @ 2009-01-30 17:27 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, Patrick McHardy, David S. Miller, Netfilter Developers, Linux Network Development list On Fri, Jan 30, 2009 at 05:54:10PM +0100, Eric Dumazet wrote: > Andi Kleen a écrit : > > Eric Dumazet <dada1@cosmosbay.com> writes: > > > >> While doing oprofile tests I noticed two loops are not properly unrolled by gcc > > > > That's because nobody passed -funroll-loops. Did you try that for > > that file? Likely will need -O2 too > > I dont want to unroll all loops, only those two :) gcc 4.4 will have a way to do that per function, but earlier you would need to move it to a separate file and specify the option only for that. Doing so would be still a good idea compared to your patch because the code will be cleaner and might be more adaptable to future architectures (such manual tunings tend to outdate) > I wish gcc (4.3.2 here) was litle bit smarter :( It cannot do much without profile feedback because it has no clue which loops are hot and which are not. > (By the way, I still use the patch on arch/x86/oprofile/op_model_ppro.c > to have a working oprofile on my dev machine...) Yes I know, sorry for that. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-30 17:27 ` Andi Kleen @ 2009-01-30 17:27 ` Eric Dumazet 2009-01-30 17:50 ` Andi Kleen 2009-02-09 13:41 ` Patrick McHardy 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-01-30 17:27 UTC (permalink / raw) To: Andi Kleen Cc: Patrick McHardy, David S. Miller, Netfilter Developers, Linux Network Development list Andi Kleen a écrit : > On Fri, Jan 30, 2009 at 05:54:10PM +0100, Eric Dumazet wrote: >> Andi Kleen a écrit : >>> Eric Dumazet <dada1@cosmosbay.com> writes: >>> >>>> While doing oprofile tests I noticed two loops are not properly unrolled by gcc >>> That's because nobody passed -funroll-loops. Did you try that for >>> that file? Likely will need -O2 too >> I dont want to unroll all loops, only those two :) > > gcc 4.4 will have a way to do that per function, but earlier > you would need to move it to a separate file and specify > the option only for that. > > Doing so would be still a good idea compared to your > patch because the code will be cleaner and might > be more adaptable to future architectures > (such manual tunings tend to outdate) So... you suggest me to split file, and use a "-funroll-loops", that might doing strange things on some arches / compilers... This is far too complicated and risky imho. Check compare_ether_addr_64bits() definition in include/linux/etherdevice.h for a truly unreadable code :) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-30 17:27 ` Eric Dumazet @ 2009-01-30 17:50 ` Andi Kleen 0 siblings, 0 replies; 42+ messages in thread From: Andi Kleen @ 2009-01-30 17:50 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, Patrick McHardy, David S. Miller, Netfilter Developers, Linux Network Development list > So... you suggest me to split file, and use a "-funroll-loops", > that might doing strange things on some arches / compilers... I don't think it will. -funroll-loops is pretty clear defined. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-30 17:27 ` Andi Kleen 2009-01-30 17:27 ` Eric Dumazet @ 2009-02-09 13:41 ` Patrick McHardy 2009-02-18 15:10 ` Eric Dumazet 1 sibling, 1 reply; 42+ messages in thread From: Patrick McHardy @ 2009-02-09 13:41 UTC (permalink / raw) To: Andi Kleen Cc: Eric Dumazet, David S. Miller, Netfilter Developers, Linux Network Development list Andi Kleen wrote: > On Fri, Jan 30, 2009 at 05:54:10PM +0100, Eric Dumazet wrote: >> Andi Kleen a écrit : >>> Eric Dumazet <dada1@cosmosbay.com> writes: >>> >>>> While doing oprofile tests I noticed two loops are not properly unrolled by gcc >>> That's because nobody passed -funroll-loops. Did you try that for >>> that file? Likely will need -O2 too >> I dont want to unroll all loops, only those two :) > > gcc 4.4 will have a way to do that per function, but earlier > you would need to move it to a separate file and specify > the option only for that. > > Doing so would be still a good idea compared to your > patch because the code will be cleaner and might > be more adaptable to future architectures > (such manual tunings tend to outdate) The interface name matching has shown up in profiles forever though and we've actually already tried to optimize it IIRC. Eric, I'm trying to keep all the *tables files synchronized, could you send me a patch updating the other ones as well please? -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-02-09 13:41 ` Patrick McHardy @ 2009-02-18 15:10 ` Eric Dumazet 2009-02-18 15:21 ` Patrick McHardy 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-02-18 15:10 UTC (permalink / raw) To: Patrick McHardy Cc: Andi Kleen, David S. Miller, Netfilter Developers, Linux Network Development list Patrick McHardy a écrit : > > The interface name matching has shown up in profiles forever > though and we've actually already tried to optimize it IIRC. > > Eric, I'm trying to keep all the *tables files synchronized, > could you send me a patch updating the other ones as well > please? While doing this, I found arp_tables is still using loop using byte operations. Also, I could not find how iniface_mask[], outiface_mask[], iniface[] and outiface[] were forced to long word alignment ... (in struct ipt_ip, struct ip6t_ip6, struct arpt_arp) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-02-18 15:10 ` Eric Dumazet @ 2009-02-18 15:21 ` Patrick McHardy 2009-02-18 16:33 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Patrick McHardy @ 2009-02-18 15:21 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet wrote: > Patrick McHardy a écrit : >> The interface name matching has shown up in profiles forever >> though and we've actually already tried to optimize it IIRC. >> >> Eric, I'm trying to keep all the *tables files synchronized, >> could you send me a patch updating the other ones as well >> please? > > While doing this, I found arp_tables is still using loop using > byte operations. > > Also, I could not find how iniface_mask[], outiface_mask[], iniface[] > and outiface[] were forced to long word alignment ... > > (in struct ipt_ip, struct ip6t_ip6, struct arpt_arp) In case of IPv4 and IPv6 they are already suitable aligned, it simply performing the comparison in unsigned long quantities. struct arpt_arp unfortunately doesn't properly align the interface names, so we need to continue to do byte-wise comparisons. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-02-18 15:21 ` Patrick McHardy @ 2009-02-18 16:33 ` Eric Dumazet 2009-02-18 16:52 ` Patrick McHardy 2009-02-18 17:36 ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet 0 siblings, 2 replies; 42+ messages in thread From: Eric Dumazet @ 2009-02-18 16:33 UTC (permalink / raw) To: Patrick McHardy Cc: Andi Kleen, David S. Miller, Netfilter Developers, Linux Network Development list Patrick McHardy a écrit : > Eric Dumazet wrote: >> Patrick McHardy a écrit : >>> The interface name matching has shown up in profiles forever >>> though and we've actually already tried to optimize it IIRC. >>> >>> Eric, I'm trying to keep all the *tables files synchronized, >>> could you send me a patch updating the other ones as well >>> please? >> >> While doing this, I found arp_tables is still using loop using >> byte operations. >> >> Also, I could not find how iniface_mask[], outiface_mask[], iniface[] >> and outiface[] were forced to long word alignment ... >> (in struct ipt_ip, struct ip6t_ip6, struct arpt_arp) > > In case of IPv4 and IPv6 they are already suitable aligned, it > simply performing the comparison in unsigned long quantities. > struct arpt_arp unfortunately doesn't properly align the interface > names, so we need to continue to do byte-wise comparisons. > > I see, but #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS can help here ;) ifname_compare() should be static in three files (ipv4_ip_tables, ipv6_ip_tables and arp_tables), since only arp_tables variant has the alignement problem. [PATCH] netfilter: unfold two critical loops in arp_packet_match() x86 and powerpc can perform long word accesses in an efficient maner. We can use this to unroll two loops in arp_packet_match(), to perform arithmetic on long words instead of bytes. This is a win on x86_64 for example. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- net/ipv4/netfilter/arp_tables.c | 44 +++++++++++++++++++++++------- 1 files changed, 34 insertions(+), 10 deletions(-) diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 7ea88b6..b5db463 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -73,6 +73,36 @@ static inline int arp_devaddr_compare(const struct arpt_devaddr_info *ap, return (ret != 0); } +/* + * Unfortunatly, _b and _mask are not aligned to an int (or long int) + * Some arches dont care, unrolling the loop is a win on them. + */ +static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask) +{ +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + const unsigned long *a = (const unsigned long *)_a; + const unsigned long *b = (const unsigned long *)_b; + const unsigned long *mask = (const unsigned long *)_mask; + unsigned long ret; + + ret = (a[0] ^ b[0]) & mask[0]; + if (IFNAMSIZ > sizeof(unsigned long)) + ret |= (a[1] ^ b[1]) & mask[1]; + if (IFNAMSIZ > 2 * sizeof(unsigned long)) + ret |= (a[2] ^ b[2]) & mask[2]; + if (IFNAMSIZ > 3 * sizeof(unsigned long)) + ret |= (a[3] ^ b[3]) & mask[3]; + BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long)); +#else + unsigned long ret = 0; + int i; + + for (i = 0; i < IFNAMSIZ; i++) + ret |= (_a[i] ^ _b[i]) & _mask[i]; +#endif + return ret; +} + /* Returns whether packet matches rule or not. */ static inline int arp_packet_match(const struct arphdr *arphdr, struct net_device *dev, @@ -83,7 +113,7 @@ static inline int arp_packet_match(const struct arphdr *arphdr, const char *arpptr = (char *)(arphdr + 1); const char *src_devaddr, *tgt_devaddr; __be32 src_ipaddr, tgt_ipaddr; - int i, ret; + long ret; #define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg))) @@ -156,10 +186,7 @@ static inline int arp_packet_match(const struct arphdr *arphdr, } /* Look for ifname matches. */ - for (i = 0, ret = 0; i < IFNAMSIZ; i++) { - ret |= (indev[i] ^ arpinfo->iniface[i]) - & arpinfo->iniface_mask[i]; - } + ret = ifname_compare(indev, arpinfo->iniface, arpinfo->iniface_mask); if (FWINV(ret != 0, ARPT_INV_VIA_IN)) { dprintf("VIA in mismatch (%s vs %s).%s\n", @@ -168,10 +195,7 @@ static inline int arp_packet_match(const struct arphdr *arphdr, return 0; } - for (i = 0, ret = 0; i < IFNAMSIZ; i++) { - ret |= (outdev[i] ^ arpinfo->outiface[i]) - & arpinfo->outiface_mask[i]; - } + ret = ifname_compare(outdev, arpinfo->outiface, arpinfo->outiface_mask); if (FWINV(ret != 0, ARPT_INV_VIA_OUT)) { dprintf("VIA out mismatch (%s vs %s).%s\n", @@ -221,7 +245,7 @@ unsigned int arpt_do_table(struct sk_buff *skb, const struct net_device *out, struct xt_table *table) { - static const char nulldevname[IFNAMSIZ]; + static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long)))); unsigned int verdict = NF_DROP; const struct arphdr *arp; bool hotdrop = false; ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-02-18 16:33 ` Eric Dumazet @ 2009-02-18 16:52 ` Patrick McHardy 2009-02-18 17:36 ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet 1 sibling, 0 replies; 42+ messages in thread From: Patrick McHardy @ 2009-02-18 16:52 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet wrote: > Patrick McHardy a écrit : >> In case of IPv4 and IPv6 they are already suitable aligned, it >> simply performing the comparison in unsigned long quantities. >> struct arpt_arp unfortunately doesn't properly align the interface >> names, so we need to continue to do byte-wise comparisons. >> > I see, but #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS can help here ;) get_unaligned() would work as well I guess. But we don't seem to have a get_unaligned_long(). > ifname_compare() should be static in three files (ipv4_ip_tables, ipv6_ip_tables and arp_tables), > since only arp_tables variant has the alignement problem. > > [PATCH] netfilter: unfold two critical loops in arp_packet_match() > > x86 and powerpc can perform long word accesses in an efficient maner. > We can use this to unroll two loops in arp_packet_match(), to > perform arithmetic on long words instead of bytes. This is a win > on x86_64 for example. This looks good to me. Applied, thanks. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH] netfilter: xt_physdev fixes 2009-02-18 16:33 ` Eric Dumazet 2009-02-18 16:52 ` Patrick McHardy @ 2009-02-18 17:36 ` Eric Dumazet 2009-02-18 18:14 ` Patrick McHardy 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-02-18 17:36 UTC (permalink / raw) To: Patrick McHardy Cc: David S. Miller, Netfilter Developers, Linux Network Development list Hi Patrick Please find following patch to prepare next one (loop unrolling) I am not sure xt_physdev_info is aligned, either on an int or a long, so please check my assertion before applying :) Thank you [PATCH] netfilter: xt_physdev fixes 1) physdev_mt() incorrectly assumes nulldevname[] is aligned on an int 2) It also uses word comparisons, while it could use long word ones. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c index 1bcdfc1..4b13ef7 100644 --- a/net/netfilter/xt_physdev.c +++ b/net/netfilter/xt_physdev.c @@ -24,9 +24,9 @@ static bool physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par) { int i; - static const char nulldevname[IFNAMSIZ]; + static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long)))); const struct xt_physdev_info *info = par->matchinfo; - bool ret; + unsigned long ret; const char *indev, *outdev; const struct nf_bridge_info *nf_bridge; @@ -68,10 +68,10 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (!(info->bitmask & XT_PHYSDEV_OP_IN)) goto match_outdev; indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname; - for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) { - ret |= (((const unsigned int *)indev)[i] - ^ ((const unsigned int *)info->physindev)[i]) - & ((const unsigned int *)info->in_mask)[i]; + for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { + ret |= (((const unsigned long *)indev)[i] + ^ ((const unsigned long *)info->physindev)[i]) + & ((const unsigned long *)info->in_mask)[i]; } if (!ret ^ !(info->invert & XT_PHYSDEV_OP_IN)) @@ -82,13 +82,12 @@ match_outdev: return true; outdev = nf_bridge->physoutdev ? nf_bridge->physoutdev->name : nulldevname; - for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) { - ret |= (((const unsigned int *)outdev)[i] - ^ ((const unsigned int *)info->physoutdev)[i]) - & ((const unsigned int *)info->out_mask)[i]; + for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { + ret |= (((const unsigned long *)outdev)[i] + ^ ((const unsigned long *)info->physoutdev)[i]) + & ((const unsigned long *)info->out_mask)[i]; } - - return ret ^ !(info->invert & XT_PHYSDEV_OP_OUT); + return (!!ret ^ !(info->invert & XT_PHYSDEV_OP_OUT)); } static bool physdev_mt_check(const struct xt_mtchk_param *par) ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: xt_physdev fixes 2009-02-18 17:36 ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet @ 2009-02-18 18:14 ` Patrick McHardy 2009-02-19 8:00 ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Patrick McHardy @ 2009-02-18 18:14 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet wrote: > Hi Patrick > > Please find following patch to prepare next one (loop unrolling) > > I am not sure xt_physdev_info is aligned, either on an int or a long, > so please check my assertion before applying :) Yes, the private structures can assume alignment suitable for pointers and unsigned longs. > [PATCH] netfilter: xt_physdev fixes > > 1) physdev_mt() incorrectly assumes nulldevname[] is aligned on an int > > 2) It also uses word comparisons, while it could use long word ones. Applied, thanks. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH] netfilter: unfold two loops in physdev_mt() 2009-02-18 18:14 ` Patrick McHardy @ 2009-02-19 8:00 ` Eric Dumazet 2009-02-19 8:14 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet 2009-02-19 10:17 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy 0 siblings, 2 replies; 42+ messages in thread From: Eric Dumazet @ 2009-02-19 8:00 UTC (permalink / raw) To: Patrick McHardy Cc: David S. Miller, Netfilter Developers, Linux Network Development list xt_physdev netfilter module can use an ifname_compare() helper so that two loops are unfolded. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- net/netfilter/xt_physdev.c | 32 +++++++++++++++++++++----------- 1 files changed, 21 insertions(+), 11 deletions(-) diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c index 4b13ef7..44a234e 100644 --- a/net/netfilter/xt_physdev.c +++ b/net/netfilter/xt_physdev.c @@ -20,10 +20,27 @@ MODULE_DESCRIPTION("Xtables: Bridge physical device match"); MODULE_ALIAS("ipt_physdev"); MODULE_ALIAS("ip6t_physdev"); +static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask) +{ + const unsigned long *a = (const unsigned long *)_a; + const unsigned long *b = (const unsigned long *)_b; + const unsigned long *mask = (const unsigned long *)_mask; + unsigned long ret; + + ret = (a[0] ^ b[0]) & mask[0]; + if (IFNAMSIZ > sizeof(unsigned long)) + ret |= (a[1] ^ b[1]) & mask[1]; + if (IFNAMSIZ > 2 * sizeof(unsigned long)) + ret |= (a[2] ^ b[2]) & mask[2]; + if (IFNAMSIZ > 3 * sizeof(unsigned long)) + ret |= (a[3] ^ b[3]) & mask[3]; + BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long)); + return ret; +} + static bool physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par) { - int i; static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long)))); const struct xt_physdev_info *info = par->matchinfo; unsigned long ret; @@ -68,11 +85,7 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (!(info->bitmask & XT_PHYSDEV_OP_IN)) goto match_outdev; indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname; - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)indev)[i] - ^ ((const unsigned long *)info->physindev)[i]) - & ((const unsigned long *)info->in_mask)[i]; - } + ret = ifname_compare(indev, info->physindev, info->in_mask); if (!ret ^ !(info->invert & XT_PHYSDEV_OP_IN)) return false; @@ -82,11 +95,8 @@ match_outdev: return true; outdev = nf_bridge->physoutdev ? nf_bridge->physoutdev->name : nulldevname; - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)outdev)[i] - ^ ((const unsigned long *)info->physoutdev)[i]) - & ((const unsigned long *)info->out_mask)[i]; - } + ret = ifname_compare(outdev, info->physoutdev, info->out_mask); + return (!!ret ^ !(info->invert & XT_PHYSDEV_OP_OUT)); } ^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH] netfilter: unfold two loops in ip6_packet_match() 2009-02-19 8:00 ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet @ 2009-02-19 8:14 ` Eric Dumazet 2009-02-19 10:19 ` Patrick McHardy 2009-02-19 10:17 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-02-19 8:14 UTC (permalink / raw) To: Patrick McHardy Cc: David S. Miller, Netfilter Developers, Linux Network Development list ip6_tables netfilter module can use an ifname_compare() helper so that two loops are unfolded. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- net/ipv6/netfilter/ip6_tables.c | 33 +++++++++++++++++++----------- 1 files changed, 21 insertions(+), 12 deletions(-) diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index a33485d..d64594b 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -89,6 +89,25 @@ ip6t_ext_hdr(u8 nexthdr) (nexthdr == IPPROTO_DSTOPTS) ); } +static unsigned long ifname_compare(const char *_a, const char *_b, + const unsigned char *_mask) +{ + const unsigned long *a = (const unsigned long *)_a; + const unsigned long *b = (const unsigned long *)_b; + const unsigned long *mask = (const unsigned long *)_mask; + unsigned long ret; + + ret = (a[0] ^ b[0]) & mask[0]; + if (IFNAMSIZ > sizeof(unsigned long)) + ret |= (a[1] ^ b[1]) & mask[1]; + if (IFNAMSIZ > 2 * sizeof(unsigned long)) + ret |= (a[2] ^ b[2]) & mask[2]; + if (IFNAMSIZ > 3 * sizeof(unsigned long)) + ret |= (a[3] ^ b[3]) & mask[3]; + BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long)); + return ret; +} + /* Returns whether matches rule or not. */ /* Performance critical - called for every packet */ static inline bool @@ -99,7 +118,6 @@ ip6_packet_match(const struct sk_buff *skb, unsigned int *protoff, int *fragoff, bool *hotdrop) { - size_t i; unsigned long ret; const struct ipv6hdr *ipv6 = ipv6_hdr(skb); @@ -120,12 +138,7 @@ ip6_packet_match(const struct sk_buff *skb, return false; } - /* Look for ifname matches; this should unroll nicely. */ - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)indev)[i] - ^ ((const unsigned long *)ip6info->iniface)[i]) - & ((const unsigned long *)ip6info->iniface_mask)[i]; - } + ret = ifname_compare(indev, ip6info->iniface, ip6info->iniface_mask); if (FWINV(ret != 0, IP6T_INV_VIA_IN)) { dprintf("VIA in mismatch (%s vs %s).%s\n", @@ -134,11 +147,7 @@ ip6_packet_match(const struct sk_buff *skb, return false; } - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)outdev)[i] - ^ ((const unsigned long *)ip6info->outiface)[i]) - & ((const unsigned long *)ip6info->outiface_mask)[i]; - } + ret = ifname_compare(outdev, ip6info->outiface, ip6info->outiface_mask); if (FWINV(ret != 0, IP6T_INV_VIA_OUT)) { dprintf("VIA out mismatch (%s vs %s).%s\n", ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two loops in ip6_packet_match() 2009-02-19 8:14 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet @ 2009-02-19 10:19 ` Patrick McHardy 0 siblings, 0 replies; 42+ messages in thread From: Patrick McHardy @ 2009-02-19 10:19 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet wrote: > ip6_tables netfilter module can use an ifname_compare() helper > so that two loops are unfolded. Also applied, thanks Eric. Are you going to send a new patch for ip_tables.c or should I just take the old one? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two loops in physdev_mt() 2009-02-19 8:00 ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet 2009-02-19 8:14 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet @ 2009-02-19 10:17 ` Patrick McHardy 1 sibling, 0 replies; 42+ messages in thread From: Patrick McHardy @ 2009-02-19 10:17 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet wrote: > xt_physdev netfilter module can use an ifname_compare() helper > so that two loops are unfolded. Applied, thanks. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet 2009-01-30 15:47 ` Andi Kleen @ 2009-02-20 10:02 ` Eric Dumazet 2009-02-20 10:04 ` Patrick McHardy 1 sibling, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2009-02-20 10:02 UTC (permalink / raw) To: Patrick McHardy Cc: David S. Miller, Netfilter Developers, Linux Network Development list While doing oprofile tests I noticed two loops are not properly unrolled by gcc Using a hand coded unrolled loop provides nice speedup : ipt_do_table credited of 2.52 % of cpu instead of 3.29 % in tbench. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 08cde5b..e5294ae 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -74,6 +74,25 @@ do { \ Hence the start of any table is given by get_table() below. */ +static unsigned long ifname_compare(const char *_a, const char *_b, + const unsigned char *_mask) +{ + const unsigned long *a = (const unsigned long *)_a; + const unsigned long *b = (const unsigned long *)_b; + const unsigned long *mask = (const unsigned long *)_mask; + unsigned long ret; + + ret = (a[0] ^ b[0]) & mask[0]; + if (IFNAMSIZ > sizeof(unsigned long)) + ret |= (a[1] ^ b[1]) & mask[1]; + if (IFNAMSIZ > 2 * sizeof(unsigned long)) + ret |= (a[2] ^ b[2]) & mask[2]; + if (IFNAMSIZ > 3 * sizeof(unsigned long)) + ret |= (a[3] ^ b[3]) & mask[3]; + BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long)); + return ret; +} + /* Returns whether matches rule or not. */ /* Performance critical - called for every packet */ static inline bool @@ -83,7 +102,6 @@ ip_packet_match(const struct iphdr *ip, const struct ipt_ip *ipinfo, int isfrag) { - size_t i; unsigned long ret; #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg))) @@ -103,12 +121,7 @@ ip_packet_match(const struct iphdr *ip, return false; } - /* Look for ifname matches; this should unroll nicely. */ - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)indev)[i] - ^ ((const unsigned long *)ipinfo->iniface)[i]) - & ((const unsigned long *)ipinfo->iniface_mask)[i]; - } + ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask); if (FWINV(ret != 0, IPT_INV_VIA_IN)) { dprintf("VIA in mismatch (%s vs %s).%s\n", @@ -117,11 +130,7 @@ ip_packet_match(const struct iphdr *ip, return false; } - for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) { - ret |= (((const unsigned long *)outdev)[i] - ^ ((const unsigned long *)ipinfo->outiface)[i]) - & ((const unsigned long *)ipinfo->outiface_mask)[i]; - } + ret = ifname_compare(outdev, ipinfo->outiface, ipinfo->outiface_mask); if (FWINV(ret != 0, IPT_INV_VIA_OUT)) { dprintf("VIA out mismatch (%s vs %s).%s\n", ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match() 2009-02-20 10:02 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet @ 2009-02-20 10:04 ` Patrick McHardy 0 siblings, 0 replies; 42+ messages in thread From: Patrick McHardy @ 2009-02-20 10:04 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, Netfilter Developers, Linux Network Development list Eric Dumazet wrote: > While doing oprofile tests I noticed two loops are not properly unrolled by gcc > > Using a hand coded unrolled loop provides nice speedup : ipt_do_table > credited of 2.52 % of cpu instead of 3.29 % in tbench. Applied, thanks Eric. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-28 17:34 ` Eric Dumazet 2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet @ 2009-02-09 14:57 ` Patrick McHardy 1 sibling, 0 replies; 42+ messages in thread From: Patrick McHardy @ 2009-02-09 14:57 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Netfilter Developers, Linux Network Development list, Stephen Hemminger Eric Dumazet wrote: > Eric Dumazet a écrit : >> Patrick McHardy a écrit : >>> That looks more complicated since it requires to take multiple locks >>> occasionally (f.i. hash insertion, potentially helper-related and >>> expectation-related stuff), and there is the unconfirmed_list, where >>> fine-grained locking can't really be used without changing it to >>> a hash. >>> >> Yes its more complicated, but look what we did in 2.6.29 for tcp/udp >> sockets, using RCU to have lockless lookups. >> Yes, we still take a lock when doing an insert or delete at socket >> bind/unbind time. >> >> We could keep a central nf_conntrack_lock to guard insertions/deletes >> from hash and unconfirmed_list. >> >> But *normal* packets that only need to change state of one particular >> connection could use RCU (without spinlock) to locate the conntrack, >> then lock the found conntrack to perform all state changes. > > Well... RCU is already used by conntrack :) > > Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock The lock is currently used to avoid timer update races. Martin has two old patches two remove it, but they require changes to the timer code to support a mod_timer variant that doesn't enable an inactive timer: http://people.netfilter.org/gandalf/performance/patches/mod_timer_noact http://people.netfilter.org/gandalf/performance/patches/__nf_ct_refresh_acct-locking -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 32 core net-next stack/netfilter "scaling" 2009-01-26 23:10 ` Eric Dumazet 2009-01-26 23:14 ` Stephen Hemminger 2009-01-26 23:19 ` Rick Jones @ 2009-02-10 18:44 ` Stephen Hemminger 2 siblings, 0 replies; 42+ messages in thread From: Stephen Hemminger @ 2009-02-10 18:44 UTC (permalink / raw) To: Eric Dumazet Cc: Rick Jones, Linux Network Development list, Netfilter Developers, Patrick McHardy On Tue, 27 Jan 2009 00:10:44 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > Rick Jones a écrit : > > Folks - > > > > Under: > > > > ftp://ftp.netperf.org/iptable_scaling > > > > can be found netperf results and Caliper profiles for three scenarios on > > a 32-core, 1.6 GHz 'Montecito' rx8640 system. An rx8640 is what HP call > > a "cell based" system in that it is comprised of "cell boards" on which > > reside CPU and memory resources. In this case there are four cell > > boards, each with 4, dual-core Montecito processors and 1/4 of the > > overall RAM. The system was configured with a mix of cell-local and > > global interleaved memory, where the global interleave is on a cacheline > > (128 byte) boundary (IIRC). Total RAM in the system is 256 GB. The > > cells are joined via cross-bar connections. (numactl --hardware output > > is available under the URL above) > > > > There was an "I/O expander" connected to the system. This meant there > > were as many distinct PCI-X domains as there were cells, and every cell > > had a "local" set of PCI-X slots. > > > > Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - > > aka Neterion XFrame IIs. These were then connected to an HP ProCurve > > 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP > > DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet > > NICs (Aka Chelsio T3C-based). They were running RHEL 5.2 I think. Each > > NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble > > x8 slot (DL585 G5) > > > > The kernel is from DaveM's net-next tree ca last week, multiq enabled. > > The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get > > multiq support. It was loaded into the kernel via: > > > > insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8 > > > > There were then 8 tx queues and 8 rx queues per interface in the > > rx8640. The "setaffinity.txt" script was used to set the IRQ affinities > > to cores "closest" to the physical NIC. In all three tests all 32 cores > > went to 100% utilization. At least for all incense and porpoises. (there > > was some occasional idle reported by top on the full_iptables run) > > > > A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a > > burst mode of 17 were run (ie 17 "transactions" outstanding on a > > connection at one time,) with TCP_NODELAY set and the results gathered, > > along with a set of Caliper profiles. The script used to launch these > > can be found in "runemomniagg2.sh.txt under the URL above. > > > > I picked an "RR" test to maximize the trips up and down the stack while > > minimizing the bandwidth consumed. > > > > I picked a burst size of 16 because that was sufficient to saturate a > > single core on the rx8640. > > > > I picked 64 concurrent netperfs because I wanted to make sure I had > > enough concurrent connections to get spread across all the cores/queues > > by the algorithms in place. > > > > I picked the combination of 64 and 16 rather than say 1024 and 0 (one > > tran at a time) because I didn't want to run a context switching > > benchmark :) > > > > The rx8640 was picked because it was available and I was confident it > > was not going to have any hardware scaling issues getting in the way. I > > wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is > > a reasonable analog for any "decent or better scaling" 32 core hardware > > and that while there are ia64-specific routines present in the profiles, > > they are there for platform-independent reasons. > > > > The no_iptables/ data was run after a fresh boot, with no iptables > > commands run and so no iptables related modules loaded into the kernel. > > > > The empty_iptables/ data was run after an "iptables --list" command > > which loaded one or two modules into the kernel. > > > > The full_iptables/ data was run after an "iptables-restore" command > > pointed at full_iptables/iptables.txt which was created from what RH > > creates by default when one enables firewall via their installer, with a > > port range added by me to allow pretty much anything netperf would ask. > > As such, while it does excercise netfilter functionality, I cannot make > > any claims as to its "real world" applicability. (while the firewall > > settings came from an RH setup, FWIW, the base bits running on the > > rx8640 are Debian Lenny, with the net-next kernel on top) > > > > The "cycles" profile is able to grab flat profile hits while interrupts > > are disabled so it can see stuff happening while interrupts are > > disabled. The "scgprof" profile is an attempt to get some call graphs - > > it does not have visibility into code running with interrupts disabled. > > The "cache" profile is a profile that looks to get some cache miss > > information. > > > > So, having said all that, details can be found under the previously > > mentioned URL. Some quick highlights: > > > > no_iptables - ~22000 transactions/s/netperf. Top of the cycles profile > > looks like: > > > > Function Summary > > ----------------------------------------------------------------------- > > % Total > > IP Cumulat IP > > Samples % of Samples > > (ETB) Total (ETB) Function File > > ----------------------------------------------------------------------- > > 5.70 5.70 37772 s2io.ko::tx_intr_handler > > 5.14 10.84 34012 vmlinux::__ia64_readq > > 4.88 15.72 32285 s2io.ko::s2io_msix_ring_handle > > 4.63 20.34 30625 s2io.ko::rx_intr_handler > > 4.60 24.94 30429 s2io.ko::s2io_xmit > > 3.85 28.79 25488 s2io.ko::s2io_poll_msix > > 2.87 31.65 18987 vmlinux::dev_queue_xmit > > 2.51 34.16 16620 vmlinux::tcp_sendmsg > > 2.51 36.67 16588 vmlinux::tcp_ack > > 2.15 38.82 14221 vmlinux::__inet_lookup_established > > 2.10 40.92 13937 vmlinux::ia64_spinlock_contention > > > > empty_iptables - ~12000 transactions/s/netperf. Top of the cycles > > profile looks like: > > > > Function Summary > > ----------------------------------------------------------------------- > > % Total > > IP Cumulat IP > > Samples % of Samples > > (ETB) Total (ETB) Function File > > ----------------------------------------------------------------------- > > 26.38 26.38 137458 vmlinux::_read_lock_bh > > 10.63 37.01 55388 vmlinux::local_bh_enable_ip > > 3.42 40.43 17812 s2io.ko::tx_intr_handler > > 3.01 43.44 15691 ip_tables.ko::ipt_do_table > > 2.90 46.34 15100 vmlinux::__ia64_readq > > 2.72 49.06 14179 s2io.ko::rx_intr_handler > > 2.55 51.61 13288 s2io.ko::s2io_xmit > > 1.98 53.59 10329 s2io.ko::s2io_msix_ring_handle > > 1.75 55.34 9104 vmlinux::dev_queue_xmit > > 1.64 56.98 8546 s2io.ko::s2io_poll_msix > > 1.52 58.50 7943 vmlinux::sock_wfree > > 1.40 59.91 7302 vmlinux::tcp_ack > > > > full_iptables - some test instances didn't complete, I think they got > > starved. Of those which did complete, their performance ranged all the > > way from 330 to 3100 transactions/s/netperf. Top of the cycles profile > > looks like: > > > > Function Summary > > ----------------------------------------------------------------------- > > % Total > > IP Cumulat IP > > Samples % of Samples > > (ETB) Total (ETB) Function File > > ----------------------------------------------------------------------- > > 64.71 64.71 582171 vmlinux::_write_lock_bh > > 18.43 83.14 165822 vmlinux::ia64_spinlock_contention > > 2.86 85.99 25709 nf_conntrack.ko::init_module > > 2.36 88.35 21194 nf_conntrack.ko::tcp_packet > > 1.78 90.13 16009 vmlinux::_spin_lock_bh > > 1.20 91.33 10810 nf_conntrack.ko::nf_conntrack_in > > 1.20 92.52 10755 vmlinux::nf_iterate > > 1.09 93.62 9833 vmlinux::default_idle > > 0.26 93.88 2331 vmlinux::__ia64_readq > > 0.25 94.12 2213 vmlinux::__interrupt > > 0.24 94.37 2203 s2io.ko::tx_intr_handler > > > > Suggestions as to things to look at/with and/or patches to try are > > welcome. I should have the HW available to me for at least a little > > while, but not indefinitely. > > > > rick jones > > Hi Rick, nice hardware you have :) > > Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks. > I hit this contention point even with low cost hardware, and quite standard application. > > I pinged him few days ago to try to finish the job with him, but it seems Stephen > is busy at the moment. > > Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock > that must be write_locked() for basically every handled tcp frame... > > How long is "not indefinitely" ? Does anyone have a fix for this bottleneck. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2009-02-20 10:04 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-26 22:15 32 core net-next stack/netfilter "scaling" Rick Jones 2009-01-26 23:10 ` Eric Dumazet 2009-01-26 23:14 ` Stephen Hemminger 2009-01-26 23:19 ` Rick Jones 2009-01-27 9:10 ` Eric Dumazet 2009-01-27 9:15 ` Patrick McHardy 2009-01-27 11:29 ` Eric Dumazet 2009-01-27 11:37 ` Patrick McHardy 2009-01-27 16:23 ` Eric Dumazet 2009-01-27 17:33 ` Patrick McHardy 2009-01-27 18:02 ` Rick Jones 2009-01-27 19:09 ` Rick Jones 2009-01-27 19:24 ` Rick Jones 2009-01-27 22:17 ` Eric Dumazet 2009-01-27 22:29 ` Rick Jones 2009-01-27 22:34 ` Eric Dumazet 2009-01-27 22:43 ` Rick Jones 2009-01-28 13:55 ` Eric Dumazet 2009-01-28 16:25 ` Patrick McHardy 2009-01-28 17:07 ` Eric Dumazet 2009-01-28 17:34 ` Eric Dumazet 2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet 2009-01-30 15:47 ` Andi Kleen 2009-01-30 16:54 ` Eric Dumazet 2009-01-30 17:27 ` Andi Kleen 2009-01-30 17:27 ` Eric Dumazet 2009-01-30 17:50 ` Andi Kleen 2009-02-09 13:41 ` Patrick McHardy 2009-02-18 15:10 ` Eric Dumazet 2009-02-18 15:21 ` Patrick McHardy 2009-02-18 16:33 ` Eric Dumazet 2009-02-18 16:52 ` Patrick McHardy 2009-02-18 17:36 ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet 2009-02-18 18:14 ` Patrick McHardy 2009-02-19 8:00 ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet 2009-02-19 8:14 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet 2009-02-19 10:19 ` Patrick McHardy 2009-02-19 10:17 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy 2009-02-20 10:02 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet 2009-02-20 10:04 ` Patrick McHardy 2009-02-09 14:57 ` 32 core net-next stack/netfilter "scaling" Patrick McHardy 2009-02-10 18:44 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).