* ucc_geth: nf_conntrack: table full, dropping packet. @ 2009-03-23 10:42 Joakim Tjernlund 2009-03-23 12:15 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-23 10:42 UTC (permalink / raw) To: netdev, avorontsov doing a "ping -f -l 3" on my host towards my board on linus tree as of Friday results in lots of: nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. __ratelimit: 11 callbacks suppressed nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. for ucc_geth on a MPC832x. This really looks strange to me, ideas? Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 10:42 ucc_geth: nf_conntrack: table full, dropping packet Joakim Tjernlund @ 2009-03-23 12:15 ` Patrick McHardy 2009-03-23 12:25 ` Joakim Tjernlund 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-23 12:15 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: netdev, avorontsov Joakim Tjernlund wrote: > doing a "ping -f -l 3" on my host towards my board on linus tree as of > Friday results in lots of: > nf_conntrack: table full, dropping packet. > nf_conntrack: table full, dropping packet. > nf_conntrack: table full, dropping packet. > __ratelimit: 11 callbacks suppressed > nf_conntrack: table full, dropping packet. > nf_conntrack: table full, dropping packet. > nf_conntrack: table full, dropping packet. > nf_conntrack: table full, dropping packet. > > for ucc_geth on a MPC832x. > This really looks strange to me, ideas? What does /proc/net/netfilter/nf_conntrack show? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 12:15 ` Patrick McHardy @ 2009-03-23 12:25 ` Joakim Tjernlund 2009-03-23 12:29 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-23 12:25 UTC (permalink / raw) To: Patrick McHardy; +Cc: avorontsov, netdev Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: > > Joakim Tjernlund wrote: > > doing a "ping -f -l 3" on my host towards my board on linus tree as of > > Friday results in lots of: > > nf_conntrack: table full, dropping packet. > > nf_conntrack: table full, dropping packet. > > nf_conntrack: table full, dropping packet. > > __ratelimit: 11 callbacks suppressed > > nf_conntrack: table full, dropping packet. > > nf_conntrack: table full, dropping packet. > > nf_conntrack: table full, dropping packet. > > nf_conntrack: table full, dropping packet. > > > > for ucc_geth on a MPC832x. > > This really looks strange to me, ideas? > > What does /proc/net/netfilter/nf_conntrack show? There is no /proc/net/netfilter/nf_conntrack. There is a /proc/net/nf_conntrack though and it is empty. If I telnet to the board I see: ipv4 2 tcp 6 431990 ESTABLISHED src=192.168.1.15 dst=192.168.1.17 sport=56445 dport=23 src=192.168.1.17 dst=192.168.1.15 sport=23 dport=56445 [ASSURED] use=1 ipv4 2 udp 17 7 src=192.168.1.228 dst=192.168.1.255 sport=138 dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.228 sport=138 dport=138 use=1 ipv4 2 udp 17 20 src=127.0.0.1 dst=127.0.0.1 sport=34261 dport=53 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=34261 use=1 ipv4 2 udp 17 2 src=192.168.1.199 dst=192.168.1.255 sport=138 dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.199 sport=138 dport=138 use=1 ipv4 2 udp 17 20 src=127.0.0.1 dst=127.0.0.1 sport=40417 dport=53 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=40417 use=1 Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 12:25 ` Joakim Tjernlund @ 2009-03-23 12:29 ` Patrick McHardy 2009-03-23 12:59 ` Joakim Tjernlund ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-23 12:29 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: avorontsov, netdev Joakim Tjernlund wrote: > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: >> Joakim Tjernlund wrote: >>> doing a "ping -f -l 3" on my host towards my board on linus tree as of > >>> Friday results in lots of: >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> __ratelimit: 11 callbacks suppressed >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> >>> for ucc_geth on a MPC832x. >>> This really looks strange to me, ideas? >> What does /proc/net/netfilter/nf_conntrack show? > > There is no /proc/net/netfilter/nf_conntrack. There is a > /proc/net/nf_conntrack though and it is empty. If I telnet > to the board I see: That means that something is leaking conntrack references, most likely by leaking skbs. Since I haven't seen any other reports, my guess would be the ucc_geth driver. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 12:29 ` Patrick McHardy @ 2009-03-23 12:59 ` Joakim Tjernlund [not found] ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain> 2009-03-23 17:42 ` Joakim Tjernlund 2 siblings, 0 replies; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-23 12:59 UTC (permalink / raw) To: Patrick McHardy; +Cc: avorontsov, netdev Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > Joakim Tjernlund wrote: > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: > >> Joakim Tjernlund wrote: > >>> doing a "ping -f -l 3" on my host towards my board on linus tree as of > > > >>> Friday results in lots of: > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> __ratelimit: 11 callbacks suppressed > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> > >>> for ucc_geth on a MPC832x. > >>> This really looks strange to me, ideas? > >> What does /proc/net/netfilter/nf_conntrack show? > > > > There is no /proc/net/netfilter/nf_conntrack. There is a > > /proc/net/nf_conntrack though and it is empty. If I telnet > > to the board I see: > > That means that something is leaking conntrack references, most likely > by leaking skbs. Since I haven't seen any other reports, my guess would > be the ucc_geth driver. hmm, I cannot see what in the ucc_geth driver is possibly "leaking". One thing I do notice is that the board becomes almost unresponsive during the ping flood. Perhaps it is building up a backlog of conntracks during the ping flood? Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>]
* Re: ucc_geth: nf_conntrack: table full, dropping packet. [not found] ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain> @ 2009-03-23 13:09 ` Joakim Tjernlund 0 siblings, 0 replies; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-23 13:09 UTC (permalink / raw) To: leoli; +Cc: avorontsov, Patrick McHardy, netdev Joakim Tjernlund/Transmode wrote on 23/03/2009 13:59:15: > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > > Joakim Tjernlund wrote: > > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: > > >> Joakim Tjernlund wrote: > > >>> doing a "ping -f -l 3" on my host towards my board on linus tree as of > > > > > >>> Friday results in lots of: > > >>> nf_conntrack: table full, dropping packet. > > >>> nf_conntrack: table full, dropping packet. > > >>> nf_conntrack: table full, dropping packet. > > >>> __ratelimit: 11 callbacks suppressed > > >>> nf_conntrack: table full, dropping packet. > > >>> nf_conntrack: table full, dropping packet. > > >>> nf_conntrack: table full, dropping packet. > > >>> nf_conntrack: table full, dropping packet. > > >>> > > >>> for ucc_geth on a MPC832x. > > >>> This really looks strange to me, ideas? > > >> What does /proc/net/netfilter/nf_conntrack show? > > > > > > There is no /proc/net/netfilter/nf_conntrack. There is a > > > /proc/net/nf_conntrack though and it is empty. If I telnet > > > to the board I see: > > > > That means that something is leaking conntrack references, most likely > > by leaking skbs. Since I haven't seen any other reports, my guess would > > be the ucc_geth driver. > hmm, I cannot see what in the ucc_geth driver is possibly "leaking". One thing > I do notice is that the board becomes almost unresponsive during the ping flood. > Perhaps it is building up a backlog of conntracks during the ping flood? > > Jocke next hmm, ethtool -S eth0 shows: ... rx-mismatch-drop-frames: 13 ... 13 matches the number of lost frames after I stop the ping flood. The MPC832x manual says this(if i found the correct counter): REBASE+2C MisMatchDrop 32 Counts number of frames dropped due to MAC filtering process, (e.g. Address Mismatch, Type mismatch) and that would otherwise considered good frame that would be transferred to upper layers Why would a ping flood result in Address Mismatch or Type mismatch? Leo? Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 12:29 ` Patrick McHardy 2009-03-23 12:59 ` Joakim Tjernlund [not found] ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain> @ 2009-03-23 17:42 ` Joakim Tjernlund 2009-03-23 17:49 ` Patrick McHardy 2009-03-23 17:49 ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet 2 siblings, 2 replies; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-23 17:42 UTC (permalink / raw) To: Patrick McHardy; +Cc: avorontsov, netdev Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > > Joakim Tjernlund wrote: > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: > >> Joakim Tjernlund wrote: > >>> doing a "ping -f -l 3" on my host towards my board on linus tree as of > > > >>> Friday results in lots of: > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> __ratelimit: 11 callbacks suppressed > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> nf_conntrack: table full, dropping packet. > >>> > >>> for ucc_geth on a MPC832x. > >>> This really looks strange to me, ideas? > >> What does /proc/net/netfilter/nf_conntrack show? > > > > There is no /proc/net/netfilter/nf_conntrack. There is a > > /proc/net/nf_conntrack though and it is empty. If I telnet > > to the board I see: > > That means that something is leaking conntrack references, most likely > by leaking skbs. Since I haven't seen any other reports, my guess would > be the ucc_geth driver. > Mucking around with the ucc_geth driver I found that if I: - Move TX from IRQ to NAPI context - double the weight. - after booting up, wait a few mins until the JFFS2 GC kernel thread has stopped scanning the FS Then the "nf_conntrack: table full, dropping packet." msgs stops. Does this seem right to you guys? Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 17:42 ` Joakim Tjernlund @ 2009-03-23 17:49 ` Patrick McHardy 2009-03-24 8:22 ` Joakim Tjernlund 2009-03-23 17:49 ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet 1 sibling, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-23 17:49 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: avorontsov, netdev Joakim Tjernlund wrote: > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > > >>> There is no /proc/net/netfilter/nf_conntrack. There is a >>> /proc/net/nf_conntrack though and it is empty. If I telnet >>> to the board I see: >>> >> That means that something is leaking conntrack references, most likely >> by leaking skbs. Since I haven't seen any other reports, my guess would >> be the ucc_geth driver. >> > > Mucking around with the ucc_geth driver I found that if I: > - Move TX from IRQ to NAPI context > - double the weight. > - after booting up, wait a few mins until the JFFS2 GC kernel thread has > stopped > scanning the FS > > Then the "nf_conntrack: table full, dropping packet." msgs stops. > Does this seem right to you guys? No. As I said, something seems to be leaking packets. You should be able to confirm that by checking the sk_buff slabs in /proc/slabinfo. If that *doesn't* show any signs of a leak, please run "conntrack -E" to capture the conntrack events before the "table full" message appears and post the output. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 17:49 ` Patrick McHardy @ 2009-03-24 8:22 ` Joakim Tjernlund 2009-03-24 9:12 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-24 8:22 UTC (permalink / raw) To: Patrick McHardy; +Cc: avorontsov, netdev Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: > > Joakim Tjernlund wrote: > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > > > > > >>> There is no /proc/net/netfilter/nf_conntrack. There is a > >>> /proc/net/nf_conntrack though and it is empty. If I telnet > >>> to the board I see: > >>> > >> That means that something is leaking conntrack references, most likely > >> by leaking skbs. Since I haven't seen any other reports, my guess would > >> be the ucc_geth driver. > >> > > > > Mucking around with the ucc_geth driver I found that if I: > > - Move TX from IRQ to NAPI context > > - double the weight. > > - after booting up, wait a few mins until the JFFS2 GC kernel thread has > > stopped > > scanning the FS > > > > Then the "nf_conntrack: table full, dropping packet." msgs stops. > > Does this seem right to you guys? > > No. As I said, something seems to be leaking packets. You should be > able to confirm that by checking the sk_buff slabs in /proc/slabinfo. > If that *doesn't* show any signs of a leak, please run "conntrack -E" > to capture the conntrack events before the "table full" message > appears and post the output. skbuff does not differ much, but others do Before ping: skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 0 : slabdata 0 0 0 skbuff_head_cache 20 20 192 20 1 : tunables 120 60 0 : slabdata 1 1 0 size-64 731 767 64 59 1 : tunables 120 60 0 : slabdata 13 13 0 nf_conntrack 10 19 208 19 1 : tunables 120 60 0 : slabdata 1 1 0 During ping: skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 0 : slabdata 0 0 0 skbuff_head_cache 40 40 192 20 1 : tunables 120 60 0 : slabdata 2 2 0 size-64 8909 8909 64 59 1 : tunables 120 60 0 : slabdata 151 151 0 nf_conntrack 5111 5111 208 19 1 : tunables 120 60 0 : slabdata 269 269 0 This feels more like the freeing of conntrack objects are delayed and builds up when ping flooding. Don't have "conntrack -E" for my embedded board so that will have to wait a bit longer. Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-24 8:22 ` Joakim Tjernlund @ 2009-03-24 9:12 ` Eric Dumazet 2009-03-24 10:55 ` Joakim Tjernlund 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 9:12 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: Patrick McHardy, avorontsov, netdev, Paul E. McKenney Joakim Tjernlund a écrit : > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: >> Joakim Tjernlund wrote: >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: >>> >>> >>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>> to the board I see: >>>>> >>>> That means that something is leaking conntrack references, most > likely >>>> by leaking skbs. Since I haven't seen any other reports, my guess > would >>>> be the ucc_geth driver. >>>> >>> Mucking around with the ucc_geth driver I found that if I: >>> - Move TX from IRQ to NAPI context >>> - double the weight. >>> - after booting up, wait a few mins until the JFFS2 GC kernel thread > has >>> stopped >>> scanning the FS >>> >>> Then the "nf_conntrack: table full, dropping packet." msgs stops. >>> Does this seem right to you guys? >> No. As I said, something seems to be leaking packets. You should be >> able to confirm that by checking the sk_buff slabs in /proc/slabinfo. >> If that *doesn't* show any signs of a leak, please run "conntrack -E" >> to capture the conntrack events before the "table full" message >> appears and post the output. > > skbuff does not differ much, but others do > > Before ping: > skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 0 > : slabdata 0 0 0 > skbuff_head_cache 20 20 192 20 1 : tunables 120 60 0 > : slabdata 1 1 0 > size-64 731 767 64 59 1 : tunables 120 60 0 > : slabdata 13 13 0 > nf_conntrack 10 19 208 19 1 : tunables 120 60 0 > : slabdata 1 1 0 > > During ping: > skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 0 > : slabdata 0 0 0 > skbuff_head_cache 40 40 192 20 1 : tunables 120 60 0 > : slabdata 2 2 0 > size-64 8909 8909 64 59 1 : tunables 120 60 0 > : slabdata 151 151 0 > nf_conntrack 5111 5111 208 19 1 : tunables 120 60 0 > : slabdata 269 269 0 > > This feels more like the freeing of conntrack objects are delayed and > builds up when ping flooding. > > Don't have "conntrack -E" for my embedded board so that will have to wait > a bit longer. I dont understand how your ping can use so many conntrack entries... Then, as I said yesterday, I believe you have a RCU delay, because of a misbehaving driver or something... grep RCU .config grep CONFIG_SMP .config You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 80) as a workaround. It should force a quiescent state after 1000 freed conntracks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-24 9:12 ` Eric Dumazet @ 2009-03-24 10:55 ` Joakim Tjernlund 2009-03-24 12:07 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-24 10:55 UTC (permalink / raw) To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53: > > Joakim Tjernlund a écrit : > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: > >> Joakim Tjernlund wrote: > >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > >>> > >>> > >>>>> There is no /proc/net/netfilter/nf_conntrack. There is a > >>>>> /proc/net/nf_conntrack though and it is empty. If I telnet > >>>>> to the board I see: > >>>>> > >>>> That means that something is leaking conntrack references, most > > likely > >>>> by leaking skbs. Since I haven't seen any other reports, my guess > > would > >>>> be the ucc_geth driver. > >>>> > >>> Mucking around with the ucc_geth driver I found that if I: > >>> - Move TX from IRQ to NAPI context > >>> - double the weight. > >>> - after booting up, wait a few mins until the JFFS2 GC kernel thread > > has > >>> stopped > >>> scanning the FS > >>> > >>> Then the "nf_conntrack: table full, dropping packet." msgs stops. > >>> Does this seem right to you guys? > >> No. As I said, something seems to be leaking packets. You should be > >> able to confirm that by checking the sk_buff slabs in /proc/slabinfo. > >> If that *doesn't* show any signs of a leak, please run "conntrack -E" > >> to capture the conntrack events before the "table full" message > >> appears and post the output. > > > > skbuff does not differ much, but others do > > > > Before ping: > > skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 0 > > : slabdata 0 0 0 > > skbuff_head_cache 20 20 192 20 1 : tunables 120 60 0 > > : slabdata 1 1 0 > > size-64 731 767 64 59 1 : tunables 120 60 0 > > : slabdata 13 13 0 > > nf_conntrack 10 19 208 19 1 : tunables 120 60 0 > > : slabdata 1 1 0 > > > > During ping: > > skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 0 > > : slabdata 0 0 0 > > skbuff_head_cache 40 40 192 20 1 : tunables 120 60 0 > > : slabdata 2 2 0 > > size-64 8909 8909 64 59 1 : tunables 120 60 0 > > : slabdata 151 151 0 > > nf_conntrack 5111 5111 208 19 1 : tunables 120 60 0 > > : slabdata 269 269 0 > > > > This feels more like the freeing of conntrack objects are delayed and > > builds up when ping flooding. > > > > Don't have "conntrack -E" for my embedded board so that will have to wait > > a bit longer. > > I dont understand how your ping can use so many conntrack entries... > > Then, as I said yesterday, I believe you have a RCU delay, because of > a misbehaving driver or something... > > grep RCU .config grep RCU .config # RCU Subsystem CONFIG_CLASSIC_RCU=y # CONFIG_TREE_RCU is not set # CONFIG_PREEMPT_RCU is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_PREEMPT_RCU_TRACE is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_CPU_STALL_DETECTOR is not set > grep CONFIG_SMP .config grep CONFIG_SMP .config # CONFIG_SMP is not set > > You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 80) > as a workaround. It should force a quiescent state after 1000 freed conntracks. right, doing this almost killed all conntrack messages, had to stress it pretty hard before I saw handful "nf_conntrack: table full, dropping packet" RCU is not my cup of tea, do you have any ideas were to look? Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 10:55 ` Joakim Tjernlund @ 2009-03-24 12:07 ` Eric Dumazet 2009-03-24 12:25 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 12:07 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney Joakim Tjernlund a écrit : > Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53: >> Joakim Tjernlund a écrit : >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: >>>> Joakim Tjernlund wrote: >>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: >>>>> >>>>> >>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>>>> to the board I see: >>>>>>> >>>>>> That means that something is leaking conntrack references, most >>> likely >>>>>> by leaking skbs. Since I haven't seen any other reports, my guess >>> would >>>>>> be the ucc_geth driver. >>>>>> >>>>> Mucking around with the ucc_geth driver I found that if I: >>>>> - Move TX from IRQ to NAPI context >>>>> - double the weight. >>>>> - after booting up, wait a few mins until the JFFS2 GC kernel > thread >>> has >>>>> stopped >>>>> scanning the FS >>>>> >>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops. >>>>> Does this seem right to you guys? >>>> No. As I said, something seems to be leaking packets. You should be >>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo. >>>> If that *doesn't* show any signs of a leak, please run "conntrack -E" >>>> to capture the conntrack events before the "table full" message >>>> appears and post the output. >>> skbuff does not differ much, but others do >>> >>> Before ping: >>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 > 0 >>> : slabdata 0 0 0 >>> skbuff_head_cache 20 20 192 20 1 : tunables 120 60 > 0 >>> : slabdata 1 1 0 >>> size-64 731 767 64 59 1 : tunables 120 60 > 0 >>> : slabdata 13 13 0 >>> nf_conntrack 10 19 208 19 1 : tunables 120 60 > 0 >>> : slabdata 1 1 0 >>> >>> During ping: >>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 > 0 >>> : slabdata 0 0 0 >>> skbuff_head_cache 40 40 192 20 1 : tunables 120 60 > 0 >>> : slabdata 2 2 0 >>> size-64 8909 8909 64 59 1 : tunables 120 60 > 0 >>> : slabdata 151 151 0 >>> nf_conntrack 5111 5111 208 19 1 : tunables 120 60 > 0 >>> : slabdata 269 269 0 >>> >>> This feels more like the freeing of conntrack objects are delayed and >>> builds up when ping flooding. >>> >>> Don't have "conntrack -E" for my embedded board so that will have to > wait >>> a bit longer. >> I dont understand how your ping can use so many conntrack entries... >> >> Then, as I said yesterday, I believe you have a RCU delay, because of >> a misbehaving driver or something... >> >> grep RCU .config > grep RCU .config > # RCU Subsystem > CONFIG_CLASSIC_RCU=y > # CONFIG_TREE_RCU is not set > # CONFIG_PREEMPT_RCU is not set > # CONFIG_TREE_RCU_TRACE is not set > # CONFIG_PREEMPT_RCU_TRACE is not set > # CONFIG_RCU_TORTURE_TEST is not set > # CONFIG_RCU_CPU_STALL_DETECTOR is not set > >> grep CONFIG_SMP .config > grep CONFIG_SMP .config > # CONFIG_SMP is not set > >> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line > 80) >> as a workaround. It should force a quiescent state after 1000 freed > conntracks. > > right, doing this almost killed all conntrack messages, had to stress it > pretty > hard before I saw handful "nf_conntrack: table full, dropping packet" > > RCU is not my cup of tea, do you have any ideas were to look? In a stress situation, you feed more deleted conntracks to call_rcu() than the blimit (10 real freeing per RCU softirq invocation). So with default qhimark being 10000, this means about 10000 conntracks can sit in RCU (per CPU) before being really freed. Only when hitting 10000, RCU enters a special mode to free all queued items, instead of a small batch of 10 To solve your problem we can : 1) reduce qhimark from 10000 to 1000 (for example) Probably should be done to reduce some spikes in RCU code when freeing whole 10000 elements... OR 2) change conntrack tunable (max conntrack entries on your machine) OR 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count in nf_conntrack_free() instead of callback. [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might accumulate about 10.000 elements per CPU in its internal queues. To get accurate conntrack counts (at the expense of slightly more RAM used), we might consider conntrack counter not taking into account "about to be freed elements, waiting in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU callback. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index f4935e3..6478dc7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc); static void nf_conntrack_free_rcu(struct rcu_head *head) { struct nf_conn *ct = container_of(head, struct nf_conn, rcu); - struct net *net = nf_ct_net(ct); nf_ct_ext_free(ct); kmem_cache_free(nf_conntrack_cachep, ct); - atomic_dec(&net->ct.count); } void nf_conntrack_free(struct nf_conn *ct) { + struct net *net = nf_ct_net(ct); + nf_ct_ext_destroy(ct); + atomic_dec(&net->ct.count); call_rcu(&ct->rcu, nf_conntrack_free_rcu); } EXPORT_SYMBOL_GPL(nf_conntrack_free); ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 12:07 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet @ 2009-03-24 12:25 ` Eric Dumazet 2009-03-24 12:43 ` Patrick McHardy 2009-03-24 13:20 ` Joakim Tjernlund 2009-03-24 15:17 ` Maxime Bizon 2 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 12:25 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney Eric Dumazet a écrit : > Joakim Tjernlund a écrit : >> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53: >>> Joakim Tjernlund a écrit : >>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: >>>>> Joakim Tjernlund wrote: >>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: >>>>>> >>>>>> >>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>>>>> to the board I see: >>>>>>>> >>>>>>> That means that something is leaking conntrack references, most >>>> likely >>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess >>>> would >>>>>>> be the ucc_geth driver. >>>>>>> >>>>>> Mucking around with the ucc_geth driver I found that if I: >>>>>> - Move TX from IRQ to NAPI context >>>>>> - double the weight. >>>>>> - after booting up, wait a few mins until the JFFS2 GC kernel >> thread >>>> has >>>>>> stopped >>>>>> scanning the FS >>>>>> >>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops. >>>>>> Does this seem right to you guys? >>>>> No. As I said, something seems to be leaking packets. You should be >>>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo. >>>>> If that *doesn't* show any signs of a leak, please run "conntrack -E" >>>>> to capture the conntrack events before the "table full" message >>>>> appears and post the output. >>>> skbuff does not differ much, but others do >>>> >>>> Before ping: >>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 >> 0 >>>> : slabdata 0 0 0 >>>> skbuff_head_cache 20 20 192 20 1 : tunables 120 60 >> 0 >>>> : slabdata 1 1 0 >>>> size-64 731 767 64 59 1 : tunables 120 60 >> 0 >>>> : slabdata 13 13 0 >>>> nf_conntrack 10 19 208 19 1 : tunables 120 60 >> 0 >>>> : slabdata 1 1 0 >>>> >>>> During ping: >>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 >> 0 >>>> : slabdata 0 0 0 >>>> skbuff_head_cache 40 40 192 20 1 : tunables 120 60 >> 0 >>>> : slabdata 2 2 0 >>>> size-64 8909 8909 64 59 1 : tunables 120 60 >> 0 >>>> : slabdata 151 151 0 >>>> nf_conntrack 5111 5111 208 19 1 : tunables 120 60 >> 0 >>>> : slabdata 269 269 0 >>>> >>>> This feels more like the freeing of conntrack objects are delayed and >>>> builds up when ping flooding. >>>> >>>> Don't have "conntrack -E" for my embedded board so that will have to >> wait >>>> a bit longer. >>> I dont understand how your ping can use so many conntrack entries... >>> >>> Then, as I said yesterday, I believe you have a RCU delay, because of >>> a misbehaving driver or something... >>> >>> grep RCU .config >> grep RCU .config >> # RCU Subsystem >> CONFIG_CLASSIC_RCU=y >> # CONFIG_TREE_RCU is not set >> # CONFIG_PREEMPT_RCU is not set >> # CONFIG_TREE_RCU_TRACE is not set >> # CONFIG_PREEMPT_RCU_TRACE is not set >> # CONFIG_RCU_TORTURE_TEST is not set >> # CONFIG_RCU_CPU_STALL_DETECTOR is not set >> >>> grep CONFIG_SMP .config >> grep CONFIG_SMP .config >> # CONFIG_SMP is not set >> >>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line >> 80) >>> as a workaround. It should force a quiescent state after 1000 freed >> conntracks. >> >> right, doing this almost killed all conntrack messages, had to stress it >> pretty >> hard before I saw handful "nf_conntrack: table full, dropping packet" >> >> RCU is not my cup of tea, do you have any ideas were to look? > > In a stress situation, you feed more deleted conntracks to call_rcu() than > the blimit (10 real freeing per RCU softirq invocation). > > So with default qhimark being 10000, this means about 10000 conntracks > can sit in RCU (per CPU) before being really freed. > > Only when hitting 10000, RCU enters a special mode to free all queued items, instead > of a small batch of 10 > > To solve your problem we can : > > 1) reduce qhimark from 10000 to 1000 (for example) > Probably should be done to reduce some spikes in RCU code when freeing > whole 10000 elements... > OR > 2) change conntrack tunable (max conntrack entries on your machine) > OR > 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count > in nf_conntrack_free() instead of callback. > > [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() > > We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might > accumulate about 10.000 elements per CPU in its internal queues. To get accurate > conntrack counts (at the expense of slightly more RAM used), we might consider > conntrack counter not taking into account "about to be freed elements, waiting > in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU > callback. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c > index f4935e3..6478dc7 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc); > static void nf_conntrack_free_rcu(struct rcu_head *head) > { > struct nf_conn *ct = container_of(head, struct nf_conn, rcu); > - struct net *net = nf_ct_net(ct); > > nf_ct_ext_free(ct); > kmem_cache_free(nf_conntrack_cachep, ct); > - atomic_dec(&net->ct.count); > } > > void nf_conntrack_free(struct nf_conn *ct) > { > + struct net *net = nf_ct_net(ct); > + > nf_ct_ext_destroy(ct); > + atomic_dec(&net->ct.count); > call_rcu(&ct->rcu, nf_conntrack_free_rcu); > } > EXPORT_SYMBOL_GPL(nf_conntrack_free); I forgot to say this is what we do for 'struct file' freeing as well. We decrement nr_files in file_free(), not in file_free_rcu() static inline void file_free_rcu(struct rcu_head *head) { struct file *f = container_of(head, struct file, f_u.fu_rcuhead); put_cred(f->f_cred); kmem_cache_free(filp_cachep, f); } static inline void file_free(struct file *f) { percpu_counter_dec(&nr_files); <<<< HERE >>>> file_check_state(f); call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); } ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 12:25 ` Eric Dumazet @ 2009-03-24 12:43 ` Patrick McHardy 2009-03-24 13:32 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-24 12:43 UTC (permalink / raw) To: Eric Dumazet; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney Eric Dumazet wrote: >> In a stress situation, you feed more deleted conntracks to call_rcu() than >> the blimit (10 real freeing per RCU softirq invocation). >> >> So with default qhimark being 10000, this means about 10000 conntracks >> can sit in RCU (per CPU) before being really freed. >> >> Only when hitting 10000, RCU enters a special mode to free all queued items, instead >> of a small batch of 10 >> >> To solve your problem we can : >> >> 1) reduce qhimark from 10000 to 1000 (for example) >> Probably should be done to reduce some spikes in RCU code when freeing >> whole 10000 elements... >> OR >> 2) change conntrack tunable (max conntrack entries on your machine) >> OR >> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count >> in nf_conntrack_free() instead of callback. >> >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() >> >> We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might >> accumulate about 10.000 elements per CPU in its internal queues. To get accurate >> conntrack counts (at the expense of slightly more RAM used), we might consider >> conntrack counter not taking into account "about to be freed elements, waiting >> in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU >> callback. >> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> >> >> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c >> index f4935e3..6478dc7 100644 >> --- a/net/netfilter/nf_conntrack_core.c >> +++ b/net/netfilter/nf_conntrack_core.c >> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc); >> static void nf_conntrack_free_rcu(struct rcu_head *head) >> { >> struct nf_conn *ct = container_of(head, struct nf_conn, rcu); >> - struct net *net = nf_ct_net(ct); >> >> nf_ct_ext_free(ct); >> kmem_cache_free(nf_conntrack_cachep, ct); >> - atomic_dec(&net->ct.count); >> } >> >> void nf_conntrack_free(struct nf_conn *ct) >> { >> + struct net *net = nf_ct_net(ct); >> + >> nf_ct_ext_destroy(ct); >> + atomic_dec(&net->ct.count); >> call_rcu(&ct->rcu, nf_conntrack_free_rcu); >> } >> EXPORT_SYMBOL_GPL(nf_conntrack_free); > > I forgot to say this is what we do for 'struct file' freeing as well. We > decrement nr_files in file_free(), not in file_free_rcu() While temporarily exceeding the limit by up to 10000 entries is quite a lot, I guess the important thing is that it can't grow unbounded, so I think this patch is fine. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 12:43 ` Patrick McHardy @ 2009-03-24 13:32 ` Eric Dumazet 2009-03-24 13:38 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 13:32 UTC (permalink / raw) To: Patrick McHardy; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney Patrick McHardy a écrit : > Eric Dumazet wrote: >>> In a stress situation, you feed more deleted conntracks to call_rcu() >>> than >>> the blimit (10 real freeing per RCU softirq invocation). >>> So with default qhimark being 10000, this means about 10000 conntracks >>> can sit in RCU (per CPU) before being really freed. >>> >>> Only when hitting 10000, RCU enters a special mode to free all queued >>> items, instead >>> of a small batch of 10 >>> >>> To solve your problem we can : >>> >>> 1) reduce qhimark from 10000 to 1000 (for example) >>> Probably should be done to reduce some spikes in RCU code when >>> freeing >>> whole 10000 elements... >>> OR >>> 2) change conntrack tunable (max conntrack entries on your machine) >>> OR >>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count >>> in nf_conntrack_free() instead of callback. >>> >>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() >>> >>> We use RCU to defer freeing of conntrack structures. In DOS >>> situation, RCU might >>> accumulate about 10.000 elements per CPU in its internal queues. To >>> get accurate >>> conntrack counts (at the expense of slightly more RAM used), we might >>> consider >>> conntrack counter not taking into account "about to be freed >>> elements, waiting >>> in RCU queues". We thus decrement it in nf_conntrack_free(), not in >>> the RCU >>> callback. >>> >>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>> >>> >>> diff --git a/net/netfilter/nf_conntrack_core.c >>> b/net/netfilter/nf_conntrack_core.c >>> index f4935e3..6478dc7 100644 >>> --- a/net/netfilter/nf_conntrack_core.c >>> +++ b/net/netfilter/nf_conntrack_core.c >>> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc); >>> static void nf_conntrack_free_rcu(struct rcu_head *head) >>> { >>> struct nf_conn *ct = container_of(head, struct nf_conn, rcu); >>> - struct net *net = nf_ct_net(ct); >>> >>> nf_ct_ext_free(ct); >>> kmem_cache_free(nf_conntrack_cachep, ct); >>> - atomic_dec(&net->ct.count); >>> } >>> >>> void nf_conntrack_free(struct nf_conn *ct) >>> { >>> + struct net *net = nf_ct_net(ct); >>> + >>> nf_ct_ext_destroy(ct); >>> + atomic_dec(&net->ct.count); >>> call_rcu(&ct->rcu, nf_conntrack_free_rcu); >>> } >>> EXPORT_SYMBOL_GPL(nf_conntrack_free); >> >> I forgot to say this is what we do for 'struct file' freeing as well. We >> decrement nr_files in file_free(), not in file_free_rcu() > > > While temporarily exceeding the limit by up to 10000 entries is > quite a lot, I guess the important thing is that it can't grow > unbounded, so I think this patch is fine. > > Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing problem. That would better use CPU caches as well... ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 13:32 ` Eric Dumazet @ 2009-03-24 13:38 ` Patrick McHardy 2009-03-24 13:47 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-24 13:38 UTC (permalink / raw) To: Eric Dumazet; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney Eric Dumazet wrote: > Patrick McHardy a écrit : >>> I forgot to say this is what we do for 'struct file' freeing as well. We >>> decrement nr_files in file_free(), not in file_free_rcu() >> >> While temporarily exceeding the limit by up to 10000 entries is >> quite a lot, I guess the important thing is that it can't grow >> unbounded, so I think this patch is fine. >> > > Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing > problem. That would better use CPU caches as well... I'm not sure I understand the rules correctly, but we'd still have to wait for the grace period before an object can be reused, no? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 13:38 ` Patrick McHardy @ 2009-03-24 13:47 ` Eric Dumazet [not found] ` <49C8F871.9070600@cosmosbay.com> 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 13:47 UTC (permalink / raw) To: Patrick McHardy; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney Patrick McHardy a écrit : > Eric Dumazet wrote: >> Patrick McHardy a écrit : >>>> I forgot to say this is what we do for 'struct file' freeing as >>>> well. We >>>> decrement nr_files in file_free(), not in file_free_rcu() >>> >>> While temporarily exceeding the limit by up to 10000 entries is >>> quite a lot, I guess the important thing is that it can't grow >>> unbounded, so I think this patch is fine. >>> >> >> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() >> queueing >> problem. That would better use CPU caches as well... > > I'm not sure I understand the rules correctly, but we'd still > have to wait for the grace period before an object can be reused, > no? No we dont have to, but we must do additionnal checks after getting a reference on object found on lookup. (We must re-check the keys used during search) This re-check is not very expensive since everything is hot in cpu cache. Check Documentation/RCU/rculist_nulls.txt for some documentation. 1) Lookup algo -------------- rcu_read_lock() begin: obj = lockless_lookup(key); if (obj) { if (!try_get_ref(obj)) // might fail for free objects goto begin; /* * Because a writer could delete object, and a writer could * reuse these object before the RCU grace period, we * must check key after geting the reference on object */ if (obj->key != key) { // not the object we expected put_ref(obj); goto begin; } } rcu_read_unlock(); ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <49C8F871.9070600@cosmosbay.com>]
[parent not found: <49C8F8E0.9050502@trash.net>]
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() [not found] ` <49C8F8E0.9050502@trash.net> @ 2009-03-25 3:53 ` Eric Dumazet 2009-03-25 13:39 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 3:53 UTC (permalink / raw) To: Patrick McHardy; +Cc: Netfilter Developers, Linux Netdev List Patrick McHardy a écrit : > Eric Dumazet wrote: >> Eric Dumazet a écrit : >>> Patrick McHardy a écrit : >>>> Eric Dumazet wrote: >> >>>>> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() >>>>> queueing >>>>> problem. That would better use CPU caches as well... >>>> I'm not sure I understand the rules correctly, but we'd still >>>> have to wait for the grace period before an object can be reused, >>>> no? >>> No we dont have to, but we must do additionnal checks after getting >>> a reference on object found on lookup. >>> (We must re-check the keys used during search) >>> >>> This re-check is not very expensive since everything is hot in cpu >>> cache. >>> >>> Check Documentation/RCU/rculist_nulls.txt for some documentation. >>> >> >> Patrick, I can work on this if you want, since this stuff is fresh >> in my head, let me know if you already challenged it or not :) > > I'm still looking at the details, if you want to take care of this, > that would be great .) > > I have a litle problem on __nf_conntrack_find() being exported. Problem is that with SLAB_DESTROY_BY_RCU we must take a reference on object to recheck it. So ideally only nf_conntrack_find_get() should be used, or callers of __nf_conntrack_find() should lock nf_conntrack_lock (as properly done for example in net/netfilter/nf_conntrack_netlink.c, line 1292) Here is preliminary patch for review (not tested at all, its 4h50 am here :) ) Could you help me, by checking __nf_conntrack_find() use in net/netfilter/xt_connlimit.c ? and line 1246 of net/netfilter/nf_conntrack_netlink.c This part is a litle bit gray for me. :) Thank you (Patch against nf-next-2.6 tree of course) Eric include/net/netfilter/nf_conntrack.h | 14 - include/net/netfilter/nf_conntrack_tuple.h | 6 include/net/netns/conntrack.h | 5 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 16 - net/ipv4/netfilter/nf_nat_core.c | 2 net/netfilter/nf_conntrack_core.c | 121 +++++----- net/netfilter/nf_conntrack_expect.c | 2 net/netfilter/nf_conntrack_helper.c | 7 net/netfilter/nf_conntrack_netlink.c | 7 net/netfilter/nf_conntrack_standalone.c | 16 - 10 files changed, 109 insertions(+), 87 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 4dfb793..6c3f964 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -91,8 +91,7 @@ struct nf_conn_help { #include <net/netfilter/ipv4/nf_conntrack_ipv4.h> #include <net/netfilter/ipv6/nf_conntrack_ipv6.h> -struct nf_conn -{ +struct nf_conn { /* Usage count in here is 1 for hash table/destruct timer, 1 per skb, plus 1 for any connection(s) we are `master' for */ struct nf_conntrack ct_general; @@ -126,7 +125,6 @@ struct nf_conn #ifdef CONFIG_NET_NS struct net *ct_net; #endif - struct rcu_head rcu; }; static inline struct nf_conn * @@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct) extern int nf_ct_l3proto_try_module_get(unsigned short l3proto); extern void nf_ct_l3proto_module_put(unsigned short l3proto); -extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced); -extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, - unsigned int size); +/* + * Allocate a hashtable of hlist_head (if nulls == 0), + * or hlist_nulls_head (if nulls == 1) + */ +extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls); + +extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size); extern struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple); diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h index f2f6aa7..2628c15 100644 --- a/include/net/netfilter/nf_conntrack_tuple.h +++ b/include/net/netfilter/nf_conntrack_tuple.h @@ -12,6 +12,7 @@ #include <linux/netfilter/x_tables.h> #include <linux/netfilter/nf_conntrack_tuple_common.h> +#include <linux/list_nulls.h> /* A `tuple' is a structure containing the information to uniquely identify a connection. ie. if two packets have the same tuple, they @@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t) ((enum ip_conntrack_dir)(h)->tuple.dst.dir) /* Connections have two entries in the hash table: one for each way */ -struct nf_conntrack_tuple_hash -{ - struct hlist_node hnode; +struct nf_conntrack_tuple_hash { + struct hlist_nulls_node hnnode; struct nf_conntrack_tuple tuple; }; diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index f4498a6..9dc5840 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -2,6 +2,7 @@ #define __NETNS_CONNTRACK_H #include <linux/list.h> +#include <linux/list_nulls.h> #include <asm/atomic.h> struct ctl_table_header; @@ -10,9 +11,9 @@ struct nf_conntrack_ecache; struct netns_ct { atomic_t count; unsigned int expect_count; - struct hlist_head *hash; + struct hlist_nulls_head *hash; struct hlist_head *expect_hash; - struct hlist_head unconfirmed; + struct hlist_nulls_head unconfirmed; struct ip_conntrack_stat *stat; #ifdef CONFIG_NF_CONNTRACK_EVENTS struct nf_conntrack_ecache *ecache; diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c index 6ba5c55..fcbcf62 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c @@ -25,30 +25,30 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { + while (is_a_nulls(head)) { if (++st->bucket >= nf_conntrack_htable_size) return NULL; head = rcu_dereference(net->ct.hash[st->bucket].first); @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c index a65cf69..fe65187 100644 --- a/net/ipv4/netfilter/nf_nat_core.c +++ b/net/ipv4/netfilter/nf_nat_core.c @@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct, static int __net_init nf_nat_net_init(struct net *net) { net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size, - &net->ipv4.nat_vmalloced); + &net->ipv4.nat_vmalloced, 0); if (!net->ipv4.nat_bysource) return -ENOMEM; return 0; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 55befe5..9f714e9 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -29,6 +29,7 @@ #include <linux/netdevice.h> #include <linux/socket.h> #include <linux/mm.h> +#include <linux/rculist_nulls.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_l3proto.h> @@ -163,8 +164,8 @@ static void clean_from_lists(struct nf_conn *ct) { pr_debug("clean_from_lists(%p)\n", ct); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode); /* Destroy all pending expectations */ nf_ct_remove_expectations(ct); @@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct) /* We overload first tuple to link into unconfirmed list. */ if (!nf_ct_is_confirmed(ct)) { - BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode)); - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); } NF_CT_STAT_INC(net, delete); @@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack) nf_ct_put(ct); } +/* + * Warning : + * - Caller must take a reference on returned object + * and recheck nf_ct_tuple_equal(tuple, &h->tuple) + * OR + * - Caller must lock nf_conntrack_lock before calling this function + */ struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) { struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we normally need to disable them * at least once for the stats anyway. */ local_bh_disable(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { +begin: + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); local_bh_enable(); @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) } NF_CT_STAT_INC(net, searched); } + /* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(n) != hash) + goto begin; local_bh_enable(); return NULL; @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) struct nf_conn *ct; rcu_read_lock(); +begin: h = __nf_conntrack_find(net, tuple); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) h = NULL; + else { + if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) { + nf_ct_put(ct); + goto begin; + } + } } rcu_read_unlock(); @@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct, { struct net *net = nf_ct_net(ct); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.hash[hash]); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode, &net->ct.hash[repl_hash]); } @@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct nf_conn_help *help; - struct hlist_node *n; + struct hlist_nulls_node *n; enum ip_conntrack_info ctinfo; struct net *net; @@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb) /* See if there's one in the list already, including reverse: NAT could have grabbed it without realizing, since we're not in the hash. If there is, we lost race. */ - hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, &h->tuple)) goto out; - hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, &h->tuple)) goto out; /* Remove from unconfirmed list */ - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); __nf_conntrack_hash_insert(ct, hash, repl_hash); /* Timer relative to confirmation time, not original @@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, { struct net *net = nf_ct_net(ignored_conntrack); struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we need to disable them at * least once for the stats anyway. */ rcu_read_lock_bh(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack && nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); @@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash) /* Use oldest entry, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; struct nf_conn *ct = NULL, *tmp; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int i, cnt = 0; int dropped = 0; rcu_read_lock(); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], + hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); if (!test_bit(IPS_ASSURED_BIT, &tmp->status)) ct = tmp; @@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net, #ifdef CONFIG_NET_NS ct->ct_net = net; #endif - INIT_RCU_HEAD(&ct->rcu); return ct; } EXPORT_SYMBOL_GPL(nf_conntrack_alloc); -static void nf_conntrack_free_rcu(struct rcu_head *head) -{ - struct nf_conn *ct = container_of(head, struct nf_conn, rcu); - - nf_ct_ext_free(ct); - kmem_cache_free(nf_conntrack_cachep, ct); -} - void nf_conntrack_free(struct nf_conn *ct) { struct net *net = nf_ct_net(ct); nf_ct_ext_destroy(ct); atomic_dec(&net->ct.count); - call_rcu(&ct->rcu, nf_conntrack_free_rcu); + nf_ct_ext_free(ct); + kmem_cache_free(nf_conntrack_cachep, ct); } EXPORT_SYMBOL_GPL(nf_conntrack_free); @@ -594,7 +609,7 @@ init_conntrack(struct net *net, } /* Overload tuple linked list to put us in unconfirmed list. */ - hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.unconfirmed); spin_unlock_bh(&nf_conntrack_lock); @@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), { struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; - struct hlist_node *n; + struct hlist_nulls_node *n; spin_lock_bh(&nf_conntrack_lock); for (; *bucket < nf_conntrack_htable_size; (*bucket)++) { - hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) goto found; } } - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) set_bit(IPS_DYING_BIT, &ct->status); @@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data) return 1; } -void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size) +void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size) { if (vmalloced) vfree(hash); @@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net) } } -struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced) +void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls) { - struct hlist_head *hash; - unsigned int size, i; + struct hlist_nulls_head *hash; + unsigned int nr_slots, i; + size_t sz; *vmalloced = 0; - size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head)); - hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN, - get_order(sizeof(struct hlist_head) - * size)); + BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head)); + nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head)); + sz = nr_slots * sizeof(struct hlist_nulls_head); + hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, + get_order(sz)); if (!hash) { *vmalloced = 1; printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n"); - hash = vmalloc(sizeof(struct hlist_head) * size); + hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL); } - if (hash) - for (i = 0; i < size; i++) - INIT_HLIST_HEAD(&hash[i]); + if (hash && nulls) + for (i = 0; i < nr_slots; i++) + INIT_HLIST_NULLS_HEAD(&hash[i], i); return hash; } @@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) int i, bucket, vmalloced, old_vmalloced; unsigned int hashsize, old_size; int rnd; - struct hlist_head *hash, *old_hash; + struct hlist_nulls_head *hash, *old_hash; struct nf_conntrack_tuple_hash *h; /* On boot, we can set this without any fancy locking. */ @@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) if (!hashsize) return -EINVAL; - hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced); + hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1); if (!hash) return -ENOMEM; @@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) */ spin_lock_bh(&nf_conntrack_lock); for (i = 0; i < nf_conntrack_htable_size; i++) { - while (!hlist_empty(&init_net.ct.hash[i])) { - h = hlist_entry(init_net.ct.hash[i].first, - struct nf_conntrack_tuple_hash, hnode); - hlist_del_rcu(&h->hnode); + while (!hlist_nulls_empty(&init_net.ct.hash[i])) { + h = hlist_nulls_entry(init_net.ct.hash[i].first, + struct nf_conntrack_tuple_hash, hnnode); + hlist_nulls_del_rcu(&h->hnnode); bucket = __hash_conntrack(&h->tuple, hashsize, rnd); - hlist_add_head(&h->hnode, &hash[bucket]); + hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]); } } old_size = nf_conntrack_htable_size; @@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void) nf_conntrack_cachep = kmem_cache_create("nf_conntrack", sizeof(struct nf_conn), - 0, 0, NULL); + 0, SLAB_DESTROY_BY_RCU, NULL); if (!nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); ret = -ENOMEM; @@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net) if (ret < 0) goto err_ecache; net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size, - &net->ct.hash_vmalloc); + &net->ct.hash_vmalloc, 1); if (!net->ct.hash) { ret = -ENOMEM; printk(KERN_ERR "Unable to create nf_conntrack_hash\n"); diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 357ba39..3940f99 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net) net->ct.expect_count = 0; net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize, - &net->ct.expect_vmalloc); + &net->ct.expect_vmalloc, 0); if (net->ct.expect_hash == NULL) goto err1; diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index a51bdac..6066144 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, struct nf_conntrack_tuple_hash *h; struct nf_conntrack_expect *exp; const struct hlist_node *n, *next; + const struct hlist_nulls_node *nn; unsigned int i; /* Get rid of expectations */ @@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, } /* Get rid of expecteds, set helpers to NULL. */ - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) + hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) unhelp(h, me); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry(h, n, &net->ct.hash[i], hnode) + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) unhelp(h, me); } } @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void) nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, - &nf_ct_helper_vmalloc); + &nf_ct_helper_vmalloc, 0); if (!nf_ct_helper_hash) return -ENOMEM; diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 1b75c9e..6520c94 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -19,6 +19,7 @@ #include <linux/module.h> #include <linux/kernel.h> #include <linux/rculist.h> +#include <linux/rculist_nulls.h> #include <linux/types.h> #include <linux/timer.h> #include <linux/skbuff.h> @@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) { struct nf_conn *ct, *last; struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; @@ -544,8 +545,8 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) last = (struct nf_conn *)cb->args[1]; for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) { restart: - hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], + hnnode) { if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) continue; ct = nf_ct_tuplehash_to_ctrack(h); diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 4da54b0..f768368 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -44,30 +44,30 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { + while (is_a_nulls(head)) { if (++st->bucket >= nf_conntrack_htable_size) return NULL; head = rcu_dereference(net->ct.hash[st->bucket].first); @@ -75,9 +75,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-25 3:53 ` Eric Dumazet @ 2009-03-25 13:39 ` Patrick McHardy 2009-03-25 13:44 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 13:39 UTC (permalink / raw) To: Eric Dumazet; +Cc: Netfilter Developers, Linux Netdev List Eric Dumazet wrote: > I have a litle problem on __nf_conntrack_find() being exported. > > Problem is that with SLAB_DESTROY_BY_RCU we must take a reference on object > to recheck it. So ideally only nf_conntrack_find_get() should be used, > or callers of __nf_conntrack_find() should lock nf_conntrack_lock > (as properly done for example in net/netfilter/nf_conntrack_netlink.c, line 1292) > > Here is preliminary patch for review (not tested at all, its 4h50 am here :) ) > > Could you help me, by checking __nf_conntrack_find() use in net/netfilter/xt_connlimit.c ? > and line 1246 of net/netfilter/nf_conntrack_netlink.c > > This part is a litle bit gray for me. :) In case of xt_connlimit, it seems fine to just take a reference. In case of ctnetlink, keeping the unreferenced lookup under the lock seems fine. We unfortunately have to export some internals like nf_conntrack lock for ctnetlink anyways, so I don't think it would be worth to change it to take references and unexport the lookup function. > +/* > + * Warning : > + * - Caller must take a reference on returned object > + * and recheck nf_ct_tuple_equal(tuple, &h->tuple) > + * OR > + * - Caller must lock nf_conntrack_lock before calling this function > + */ > struct nf_conntrack_tuple_hash * > __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) > { > struct nf_conntrack_tuple_hash *h; > - struct hlist_node *n; > + struct hlist_nulls_node *n; > unsigned int hash = hash_conntrack(tuple); > > /* Disable BHs the entire time since we normally need to disable them > * at least once for the stats anyway. > */ > local_bh_disable(); > - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { > +begin: > + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { > if (nf_ct_tuple_equal(tuple, &h->tuple)) { > NF_CT_STAT_INC(net, found); > local_bh_enable(); > @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) > } > NF_CT_STAT_INC(net, searched); > } > + /* > + * if the nulls value we got at the end of this lookup is > + * not the expected one, we must restart lookup. > + * We probably met an item that was moved to another chain. > + */ > + if (get_nulls_value(n) != hash) > + goto begin; Are you sure this is enough? An entry might have been reused and added to the same chain I think, so I think we need to recheck the tuple. > local_bh_enable(); > > return NULL; > @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) > struct nf_conn *ct; > > rcu_read_lock(); > +begin: > h = __nf_conntrack_find(net, tuple); > if (h) { > ct = nf_ct_tuplehash_to_ctrack(h); > if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) > h = NULL; > + else { > + if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) { > + nf_ct_put(ct); > + goto begin; Ah I see, the hash comparison above is only an optimization? > + } > + } > } > rcu_read_unlock(); > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-25 13:39 ` Patrick McHardy @ 2009-03-25 13:44 ` Eric Dumazet 0 siblings, 0 replies; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 13:44 UTC (permalink / raw) To: Patrick McHardy; +Cc: Netfilter Developers, Linux Netdev List Patrick McHardy a écrit : > Eric Dumazet wrote: >> I have a litle problem on __nf_conntrack_find() being exported. >> >> Problem is that with SLAB_DESTROY_BY_RCU we must take a reference on >> object >> to recheck it. So ideally only nf_conntrack_find_get() should be used, >> or callers of __nf_conntrack_find() should lock nf_conntrack_lock >> (as properly done for example in net/netfilter/nf_conntrack_netlink.c, >> line 1292) >> >> Here is preliminary patch for review (not tested at all, its 4h50 am >> here :) ) >> >> Could you help me, by checking __nf_conntrack_find() use in >> net/netfilter/xt_connlimit.c ? >> and line 1246 of net/netfilter/nf_conntrack_netlink.c >> >> This part is a litle bit gray for me. :) > > In case of xt_connlimit, it seems fine to just take a reference. > In case of ctnetlink, keeping the unreferenced lookup under the > lock seems fine. We unfortunately have to export some internals > like nf_conntrack lock for ctnetlink anyways, so I don't think > it would be worth to change it to take references and unexport > the lookup function. > >> +/* >> + * Warning : >> + * - Caller must take a reference on returned object >> + * and recheck nf_ct_tuple_equal(tuple, &h->tuple) >> + * OR >> + * - Caller must lock nf_conntrack_lock before calling this function >> + */ >> struct nf_conntrack_tuple_hash * >> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple >> *tuple) >> { >> struct nf_conntrack_tuple_hash *h; >> - struct hlist_node *n; >> + struct hlist_nulls_node *n; >> unsigned int hash = hash_conntrack(tuple); >> >> /* Disable BHs the entire time since we normally need to disable >> them >> * at least once for the stats anyway. >> */ >> local_bh_disable(); >> - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { >> +begin: >> + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { >> if (nf_ct_tuple_equal(tuple, &h->tuple)) { >> NF_CT_STAT_INC(net, found); >> local_bh_enable(); >> @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct >> nf_conntrack_tuple *tuple) >> } >> NF_CT_STAT_INC(net, searched); >> } >> + /* >> + * if the nulls value we got at the end of this lookup is >> + * not the expected one, we must restart lookup. >> + * We probably met an item that was moved to another chain. >> + */ >> + if (get_nulls_value(n) != hash) >> + goto begin; > > Are you sure this is enough? An entry might have been reused and added > to the same chain I think, so I think we need to recheck the tuple. Yes, done in caller > >> local_bh_enable(); >> >> return NULL; >> @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const >> struct nf_conntrack_tuple *tuple) >> struct nf_conn *ct; >> >> rcu_read_lock(); >> +begin: >> h = __nf_conntrack_find(net, tuple); >> if (h) { >> ct = nf_ct_tuplehash_to_ctrack(h); >> if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) >> h = NULL; >> + else { >> + if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) { >> + nf_ct_put(ct); >> + goto begin; > > Ah I see, the hash comparison above is only an optimization? > >> + } >> + } >> } >> rcu_read_unlock(); >> > > check net/ipv4/udp.c for an example (__udp4_lib_lookup()) In case of UDP, key check is not returning true/false, but a score. So UDP case is a litle bit more complex than conntrack case. rcu_read_lock(); begin: result = NULL; badness = -1; sk_nulls_for_each_rcu(sk, node, &hslot->head) { score = compute_score(sk, net, saddr, hnum, sport, daddr, dport, dif); if (score > badness) { result = sk; badness = score; } } /* * if the nulls value we got at the end of this lookup is * not the expected one, we must restart lookup. * We probably met an item that was moved to another chain. */ if (get_nulls_value(node) != hash) goto begin; if (result) { if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt))) result = NULL; else if (unlikely(compute_score(result, net, saddr, hnum, sport, daddr, dport, dif) < badness)) { sock_put(result); goto begin; } } rcu_read_unlock(); return result; ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 12:07 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet 2009-03-24 12:25 ` Eric Dumazet @ 2009-03-24 13:20 ` Joakim Tjernlund 2009-03-24 13:28 ` Patrick McHardy 2009-03-24 13:29 ` Eric Dumazet 2009-03-24 15:17 ` Maxime Bizon 2 siblings, 2 replies; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-24 13:20 UTC (permalink / raw) To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16: > > Joakim Tjernlund a écrit : > > Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53: > >> Joakim Tjernlund a écrit : > >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: > >>>> Joakim Tjernlund wrote: > >>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > >>>>> > >>>>> > >>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a > >>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet > >>>>>>> to the board I see: > >>>>>>> > >>>>>> That means that something is leaking conntrack references, most > >>> likely > >>>>>> by leaking skbs. Since I haven't seen any other reports, my guess > >>> would > >>>>>> be the ucc_geth driver. > >>>>>> > >>>>> Mucking around with the ucc_geth driver I found that if I: > >>>>> - Move TX from IRQ to NAPI context > >>>>> - double the weight. > >>>>> - after booting up, wait a few mins until the JFFS2 GC kernel > > thread > >>> has > >>>>> stopped > >>>>> scanning the FS > >>>>> > >>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops. > >>>>> Does this seem right to you guys? > >>>> No. As I said, something seems to be leaking packets. You should be > >>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo. > >>>> If that *doesn't* show any signs of a leak, please run "conntrack -E" > >>>> to capture the conntrack events before the "table full" message > >>>> appears and post the output. > >>> skbuff does not differ much, but others do > >>> > >>> Before ping: > >>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 > > 0 > >>> : slabdata 0 0 0 > >>> skbuff_head_cache 20 20 192 20 1 : tunables 120 60 > > 0 > >>> : slabdata 1 1 0 > >>> size-64 731 767 64 59 1 : tunables 120 60 > > 0 > >>> : slabdata 13 13 0 > >>> nf_conntrack 10 19 208 19 1 : tunables 120 60 > > 0 > >>> : slabdata 1 1 0 > >>> > >>> During ping: > >>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 27 > > 0 > >>> : slabdata 0 0 0 > >>> skbuff_head_cache 40 40 192 20 1 : tunables 120 60 > > 0 > >>> : slabdata 2 2 0 > >>> size-64 8909 8909 64 59 1 : tunables 120 60 > > 0 > >>> : slabdata 151 151 0 > >>> nf_conntrack 5111 5111 208 19 1 : tunables 120 60 > > 0 > >>> : slabdata 269 269 0 > >>> > >>> This feels more like the freeing of conntrack objects are delayed and > >>> builds up when ping flooding. > >>> > >>> Don't have "conntrack -E" for my embedded board so that will have to > > wait > >>> a bit longer. > >> I dont understand how your ping can use so many conntrack entries... > >> > >> Then, as I said yesterday, I believe you have a RCU delay, because of > >> a misbehaving driver or something... > >> > >> grep RCU .config > > grep RCU .config > > # RCU Subsystem > > CONFIG_CLASSIC_RCU=y > > # CONFIG_TREE_RCU is not set > > # CONFIG_PREEMPT_RCU is not set > > # CONFIG_TREE_RCU_TRACE is not set > > # CONFIG_PREEMPT_RCU_TRACE is not set > > # CONFIG_RCU_TORTURE_TEST is not set > > # CONFIG_RCU_CPU_STALL_DETECTOR is not set > > > >> grep CONFIG_SMP .config > > grep CONFIG_SMP .config > > # CONFIG_SMP is not set > > > >> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line > > 80) > >> as a workaround. It should force a quiescent state after 1000 freed > > conntracks. > > > > right, doing this almost killed all conntrack messages, had to stress it > > pretty > > hard before I saw handful "nf_conntrack: table full, dropping packet" > > > > RCU is not my cup of tea, do you have any ideas were to look? > > In a stress situation, you feed more deleted conntracks to call_rcu() than > the blimit (10 real freeing per RCU softirq invocation). > > So with default qhimark being 10000, this means about 10000 conntracks > can sit in RCU (per CPU) before being really freed. > > Only when hitting 10000, RCU enters a special mode to free all queued items, instead > of a small batch of 10 > > To solve your problem we can : > > 1) reduce qhimark from 10000 to 1000 (for example) > Probably should be done to reduce some spikes in RCU code when freeing > whole 10000 elements... > OR > 2) change conntrack tunable (max conntrack entries on your machine) > OR > 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count > in nf_conntrack_free() instead of callback. > > [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() The patch fixes the problem and the system feels a bit more responsive too, thanks. I guess I should probably do both 1) and 3) as my board is pretty slow too. Been trying to figure out a good value for NAPI weigth too. Currently my HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX processing to NAPI context AND increase weigth to 32, the system is a lot more responsive during ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 16? Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 13:20 ` Joakim Tjernlund @ 2009-03-24 13:28 ` Patrick McHardy 2009-03-24 13:29 ` Eric Dumazet 1 sibling, 0 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-24 13:28 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: Eric Dumazet, avorontsov, netdev, Paul E. McKenney Joakim Tjernlund wrote: > Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16: >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() > > The patch fixes the problem and the system feels a bit more responsive > too, thanks. Applied, thanks everyone. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 13:20 ` Joakim Tjernlund 2009-03-24 13:28 ` Patrick McHardy @ 2009-03-24 13:29 ` Eric Dumazet 2009-03-24 13:41 ` Joakim Tjernlund 1 sibling, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 13:29 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney Joakim Tjernlund a écrit : > Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16: >> Joakim Tjernlund a écrit : >>> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53: >>>> Joakim Tjernlund a écrit : >>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15: >>>>>> Joakim Tjernlund wrote: >>>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: >>>>>>> >>>>>>> >>>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>>>>>> to the board I see: >>>>>>>>> >>>>>>>> That means that something is leaking conntrack references, most >>>>> likely >>>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess > >>>>> would >>>>>>>> be the ucc_geth driver. >>>>>>>> >>>>>>> Mucking around with the ucc_geth driver I found that if I: >>>>>>> - Move TX from IRQ to NAPI context >>>>>>> - double the weight. >>>>>>> - after booting up, wait a few mins until the JFFS2 GC kernel >>> thread >>>>> has >>>>>>> stopped >>>>>>> scanning the FS >>>>>>> >>>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops. >>>>>>> Does this seem right to you guys? >>>>>> No. As I said, something seems to be leaking packets. You should be >>>>>> able to confirm that by checking the sk_buff slabs in > /proc/slabinfo. >>>>>> If that *doesn't* show any signs of a leak, please run "conntrack > -E" >>>>>> to capture the conntrack events before the "table full" message >>>>>> appears and post the output. >>>>> skbuff does not differ much, but others do >>>>> >>>>> Before ping: >>>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 > 27 >>> 0 >>>>> : slabdata 0 0 0 >>>>> skbuff_head_cache 20 20 192 20 1 : tunables 120 > 60 >>> 0 >>>>> : slabdata 1 1 0 >>>>> size-64 731 767 64 59 1 : tunables 120 > 60 >>> 0 >>>>> : slabdata 13 13 0 >>>>> nf_conntrack 10 19 208 19 1 : tunables 120 > 60 >>> 0 >>>>> : slabdata 1 1 0 >>>>> >>>>> During ping: >>>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 > 27 >>> 0 >>>>> : slabdata 0 0 0 >>>>> skbuff_head_cache 40 40 192 20 1 : tunables 120 > 60 >>> 0 >>>>> : slabdata 2 2 0 >>>>> size-64 8909 8909 64 59 1 : tunables 120 > 60 >>> 0 >>>>> : slabdata 151 151 0 >>>>> nf_conntrack 5111 5111 208 19 1 : tunables 120 > 60 >>> 0 >>>>> : slabdata 269 269 0 >>>>> >>>>> This feels more like the freeing of conntrack objects are delayed > and >>>>> builds up when ping flooding. >>>>> >>>>> Don't have "conntrack -E" for my embedded board so that will have to > >>> wait >>>>> a bit longer. >>>> I dont understand how your ping can use so many conntrack entries... >>>> >>>> Then, as I said yesterday, I believe you have a RCU delay, because of >>>> a misbehaving driver or something... >>>> >>>> grep RCU .config >>> grep RCU .config >>> # RCU Subsystem >>> CONFIG_CLASSIC_RCU=y >>> # CONFIG_TREE_RCU is not set >>> # CONFIG_PREEMPT_RCU is not set >>> # CONFIG_TREE_RCU_TRACE is not set >>> # CONFIG_PREEMPT_RCU_TRACE is not set >>> # CONFIG_RCU_TORTURE_TEST is not set >>> # CONFIG_RCU_CPU_STALL_DETECTOR is not set >>> >>>> grep CONFIG_SMP .config >>> grep CONFIG_SMP .config >>> # CONFIG_SMP is not set >>> >>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c > (line >>> 80) >>>> as a workaround. It should force a quiescent state after 1000 freed >>> conntracks. >>> >>> right, doing this almost killed all conntrack messages, had to stress > it >>> pretty >>> hard before I saw handful "nf_conntrack: table full, dropping packet" >>> >>> RCU is not my cup of tea, do you have any ideas were to look? >> In a stress situation, you feed more deleted conntracks to call_rcu() > than >> the blimit (10 real freeing per RCU softirq invocation). >> >> So with default qhimark being 10000, this means about 10000 conntracks >> can sit in RCU (per CPU) before being really freed. >> >> Only when hitting 10000, RCU enters a special mode to free all queued > items, instead >> of a small batch of 10 >> >> To solve your problem we can : >> >> 1) reduce qhimark from 10000 to 1000 (for example) >> Probably should be done to reduce some spikes in RCU code when > freeing >> whole 10000 elements... >> OR >> 2) change conntrack tunable (max conntrack entries on your machine) >> OR >> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count >> in nf_conntrack_free() instead of callback. >> >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() > > The patch fixes the problem and the system feels a bit more responsive > too, thanks. > I guess I should probably do both 1) and 3) as my board is pretty slow > too. > > Been trying to figure out a good value for NAPI weigth too. Currently my > HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX > processing > to NAPI context AND increase weigth to 32, the system is a lot more > responsive during > ping flooding. Does weigth 32 make sense when the HW TX and RX queues are > 16? If you only have one NIC, I dont understand why changing weight should make a difference. Are you referring to dev_weight or netdev_budget ? # cat /proc/sys/net/core/dev_weight 64 # cat /proc/sys/net/core/netdev_budget 300 ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 13:29 ` Eric Dumazet @ 2009-03-24 13:41 ` Joakim Tjernlund 0 siblings, 0 replies; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-24 13:41 UTC (permalink / raw) To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 14:29:29: [SNIP] > >>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c > > (line > >>> 80) > >>>> as a workaround. It should force a quiescent state after 1000 freed > >>> conntracks. > >>> > >>> right, doing this almost killed all conntrack messages, had to stress > > it > >>> pretty > >>> hard before I saw handful "nf_conntrack: table full, dropping packet" > >>> > >>> RCU is not my cup of tea, do you have any ideas were to look? > >> In a stress situation, you feed more deleted conntracks to call_rcu() > > than > >> the blimit (10 real freeing per RCU softirq invocation). > >> > >> So with default qhimark being 10000, this means about 10000 conntracks > >> can sit in RCU (per CPU) before being really freed. > >> > >> Only when hitting 10000, RCU enters a special mode to free all queued > > items, instead > >> of a small batch of 10 > >> > >> To solve your problem we can : > >> > >> 1) reduce qhimark from 10000 to 1000 (for example) > >> Probably should be done to reduce some spikes in RCU code when > > freeing > >> whole 10000 elements... > >> OR > >> 2) change conntrack tunable (max conntrack entries on your machine) > >> OR > >> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count > >> in nf_conntrack_free() instead of callback. > >> > >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() > > > > The patch fixes the problem and the system feels a bit more responsive > > too, thanks. > > I guess I should probably do both 1) and 3) as my board is pretty slow > > too. > > > > Been trying to figure out a good value for NAPI weigth too. Currently my > > HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX > > processing > > to NAPI context AND increase weigth to 32, the system is a lot more > > responsive during > > ping flooding. Does weigth 32 make sense when the HW TX and RX queues are > > 16? > > If you only have one NIC, I dont understand why changing weight should make > a difference. Are you referring to dev_weight or netdev_budget ? > > # cat /proc/sys/net/core/dev_weight > 64 > # cat /proc/sys/net/core/netdev_budget > 300 I mean this call in ucc_geth: netif_napi_add(dev, &ugeth->napi, ucc_geth_poll, UCC_GETH_DEV_WEIGHT); UCC_GETH_DEV_WEIGHT is 16 Noticed that rcuclassic.c has a module_param(qhimark, int, 0); But I can't figure out hot to set this qhimark from the cmdline. rcuclassic.c is not a module(I don't use modules at all) Jocke Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 12:07 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet 2009-03-24 12:25 ` Eric Dumazet 2009-03-24 13:20 ` Joakim Tjernlund @ 2009-03-24 15:17 ` Maxime Bizon 2009-03-24 15:21 ` Patrick McHardy ` (2 more replies) 2 siblings, 3 replies; 44+ messages in thread From: Maxime Bizon @ 2009-03-24 15:17 UTC (permalink / raw) To: Eric Dumazet Cc: Joakim Tjernlund, avorontsov, Patrick McHardy, netdev, Paul E. McKenney On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote: Hi Eric, > We use RCU to defer freeing of conntrack structures. In DOS situation, > RCU might accumulate about 10.000 elements per CPU in its internal > queues. To get accurate conntrack counts (at the expense of slightly > more RAM used), we might consider conntrack counter not taking into > account "about to be freed elements, waiting in RCU queues". We thus > decrement it in nf_conntrack_free(), not in the RCU callback. Your patch fixes the problem on my board too (embedded mips router 250Mhz), thanks. Yet I'm concerned about what you said concerning RAM usage. I have a very small amount on memory left on my board (less than 4M), and I tuned ip route cache size and nf_conntrack_max to make sure I won't go OOM. With your patch, does it mean 10000 conntrack entries can be allocated while nf_conntrack_max is say only 2048 ? Regards, -- Maxime ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 15:17 ` Maxime Bizon @ 2009-03-24 15:21 ` Patrick McHardy 2009-03-24 15:27 ` Eric Dumazet 2009-03-24 18:29 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Joakim Tjernlund 2 siblings, 0 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-24 15:21 UTC (permalink / raw) To: mbizon; +Cc: Eric Dumazet, Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney Maxime Bizon wrote: > On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote: > >> We use RCU to defer freeing of conntrack structures. In DOS situation, >> RCU might accumulate about 10.000 elements per CPU in its internal >> queues. To get accurate conntrack counts (at the expense of slightly >> more RAM used), we might consider conntrack counter not taking into >> account "about to be freed elements, waiting in RCU queues". We thus >> decrement it in nf_conntrack_free(), not in the RCU callback. > > Your patch fixes the problem on my board too (embedded mips router > 250Mhz), thanks. > > Yet I'm concerned about what you said concerning RAM usage. I have a > very small amount on memory left on my board (less than 4M), and I tuned > ip route cache size and nf_conntrack_max to make sure I won't go OOM. > > With your patch, does it mean 10000 conntrack entries can be allocated > while nf_conntrack_max is say only 2048 ? Temporarily under worst-case circumstances, yes. Eric is already working on his proposed improvement though :) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 15:17 ` Maxime Bizon 2009-03-24 15:21 ` Patrick McHardy @ 2009-03-24 15:27 ` Eric Dumazet 2009-03-24 19:54 ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet 2009-03-25 17:53 ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet 2009-03-24 18:29 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Joakim Tjernlund 2 siblings, 2 replies; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 15:27 UTC (permalink / raw) To: mbizon, Paul E. McKenney Cc: Joakim Tjernlund, avorontsov, Patrick McHardy, netdev Maxime Bizon a écrit : > On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote: > > Hi Eric, > >> We use RCU to defer freeing of conntrack structures. In DOS situation, >> RCU might accumulate about 10.000 elements per CPU in its internal >> queues. To get accurate conntrack counts (at the expense of slightly >> more RAM used), we might consider conntrack counter not taking into >> account "about to be freed elements, waiting in RCU queues". We thus >> decrement it in nf_conntrack_free(), not in the RCU callback. > > Your patch fixes the problem on my board too (embedded mips router > 250Mhz), thanks. > > Yet I'm concerned about what you said concerning RAM usage. I have a > very small amount on memory left on my board (less than 4M), and I tuned > ip route cache size and nf_conntrack_max to make sure I won't go OOM. > > With your patch, does it mean 10000 conntrack entries can be allocated > while nf_conntrack_max is say only 2048 ? Well... yes, RCU can have this 'interesting' OOM property. For small machines, you really want to lower RCU parameters, because as you said, we also push route cache entries in RCU queue, my patch being applied or not (But using call_rcu_bh(), so we have lower latencies I think) We are working on a SLAB_DESTROY_BY_RCU implementation so that conntrack wont use call_rcu() anymore, give us a couple of days :) Paul, could we have /sys knobs to be able to tune qhimark, blimit & qlowmark ? Thanks ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() 2009-03-24 15:27 ` Eric Dumazet @ 2009-03-24 19:54 ` Eric Dumazet 2009-03-25 16:26 ` Patrick McHardy 2009-03-25 17:53 ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet 1 sibling, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-24 19:54 UTC (permalink / raw) To: Patrick McHardy Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev Eric Dumazet a écrit : > > We are working on a SLAB_DESTROY_BY_RCU implementation so that > conntrack wont use call_rcu() anymore, give us a couple of days :) > While working on this stuff, I found one suspect use of hlist_add_head() Its not a hot path, I believe following patch would make sure nothing wrong happens. If a chain contains element A and B, then we might build a new table with a new chain containing B and A (in this reverse order), and a cpu could see A->next = B (new pointer), B->next = A (old pointer) Thanks [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous. Without any barrier, one CPU could see a loop while doing its lookup. Its true new table cannot be seen by another cpu, but previous table is still readable. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 55befe5..54e983f 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1121,7 +1121,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) struct nf_conntrack_tuple_hash, hnode); hlist_del_rcu(&h->hnode); bucket = __hash_conntrack(&h->tuple, hashsize, rnd); - hlist_add_head(&h->hnode, &hash[bucket]); + hlist_add_head_rcu(&h->hnode, &hash[bucket]); } } old_size = nf_conntrack_htable_size; ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() 2009-03-24 19:54 ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet @ 2009-03-25 16:26 ` Patrick McHardy 0 siblings, 0 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 16:26 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev Eric Dumazet wrote: > While working on this stuff, I found one suspect use of hlist_add_head() > > Its not a hot path, I believe following patch would make sure nothing > wrong happens. > > If a chain contains element A and B, then we might build a new table > with a new chain containing B and A (in this reverse order), and > a cpu could see A->next = B (new pointer), B->next = A (old pointer) I think you're right. > [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() > > Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous. > Without any barrier, one CPU could see a loop while doing its lookup. > Its true new table cannot be seen by another cpu, but previous table is still > readable. Applied, thanks Eric. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-24 15:27 ` Eric Dumazet 2009-03-24 19:54 ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet @ 2009-03-25 17:53 ` Eric Dumazet 2009-03-25 18:05 ` Patrick McHardy 1 sibling, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 17:53 UTC (permalink / raw) To: Patrick McHardy Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Hi Patrick Here is the patch I had the time to test this time... No problem so far on my machine. I did a UDP flood stress. Thank you [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. This permits an easy conversion from call_rcu() based hash lists to a SLAB_DESTROY_BY_RCU one. Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. First, it doesnt fill RCU queues (up to 10000 elements per cpu). This reduces OOM possibility, if queued elements are not taken into account This reduces latency problems when RCU queue size hits hilimit and triggers emergency mode. - It allows fast reuse of just freed elements, permitting better use of CPU cache. - We delete rcu_head from "struct nf_conn", shrinking size of this structure by 8 or 16 bytes. This patch only takes care of "struct nf_conn". call_rcu() is still used for less critical conntrack parts, that may be converted later if necessary. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/netfilter/nf_conntrack.h | 14 - include/net/netfilter/nf_conntrack_tuple.h | 6 include/net/netns/conntrack.h | 5 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 16 - net/ipv4/netfilter/nf_nat_core.c | 2 net/netfilter/nf_conntrack_core.c | 123 +++++----- net/netfilter/nf_conntrack_expect.c | 2 net/netfilter/nf_conntrack_helper.c | 7 net/netfilter/nf_conntrack_netlink.c | 10 net/netfilter/nf_conntrack_standalone.c | 16 - net/netfilter/xt_connlimit.c | 4 11 files changed, 114 insertions(+), 91 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 4dfb793..6c3f964 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -91,8 +91,7 @@ struct nf_conn_help { #include <net/netfilter/ipv4/nf_conntrack_ipv4.h> #include <net/netfilter/ipv6/nf_conntrack_ipv6.h> -struct nf_conn -{ +struct nf_conn { /* Usage count in here is 1 for hash table/destruct timer, 1 per skb, plus 1 for any connection(s) we are `master' for */ struct nf_conntrack ct_general; @@ -126,7 +125,6 @@ struct nf_conn #ifdef CONFIG_NET_NS struct net *ct_net; #endif - struct rcu_head rcu; }; static inline struct nf_conn * @@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct) extern int nf_ct_l3proto_try_module_get(unsigned short l3proto); extern void nf_ct_l3proto_module_put(unsigned short l3proto); -extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced); -extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, - unsigned int size); +/* + * Allocate a hashtable of hlist_head (if nulls == 0), + * or hlist_nulls_head (if nulls == 1) + */ +extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls); + +extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size); extern struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple); diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h index f2f6aa7..2628c15 100644 --- a/include/net/netfilter/nf_conntrack_tuple.h +++ b/include/net/netfilter/nf_conntrack_tuple.h @@ -12,6 +12,7 @@ #include <linux/netfilter/x_tables.h> #include <linux/netfilter/nf_conntrack_tuple_common.h> +#include <linux/list_nulls.h> /* A `tuple' is a structure containing the information to uniquely identify a connection. ie. if two packets have the same tuple, they @@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t) ((enum ip_conntrack_dir)(h)->tuple.dst.dir) /* Connections have two entries in the hash table: one for each way */ -struct nf_conntrack_tuple_hash -{ - struct hlist_node hnode; +struct nf_conntrack_tuple_hash { + struct hlist_nulls_node hnnode; struct nf_conntrack_tuple tuple; }; diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index f4498a6..9dc5840 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -2,6 +2,7 @@ #define __NETNS_CONNTRACK_H #include <linux/list.h> +#include <linux/list_nulls.h> #include <asm/atomic.h> struct ctl_table_header; @@ -10,9 +11,9 @@ struct nf_conntrack_ecache; struct netns_ct { atomic_t count; unsigned int expect_count; - struct hlist_head *hash; + struct hlist_nulls_head *hash; struct hlist_head *expect_hash; - struct hlist_head unconfirmed; + struct hlist_nulls_head unconfirmed; struct ip_conntrack_stat *stat; #ifdef CONFIG_NF_CONNTRACK_EVENTS struct nf_conntrack_ecache *ecache; diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c index 6ba5c55..fcbcf62 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c @@ -25,30 +25,30 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { + while (is_a_nulls(head)) { if (++st->bucket >= nf_conntrack_htable_size) return NULL; head = rcu_dereference(net->ct.hash[st->bucket].first); @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c index a65cf69..fe65187 100644 --- a/net/ipv4/netfilter/nf_nat_core.c +++ b/net/ipv4/netfilter/nf_nat_core.c @@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct, static int __net_init nf_nat_net_init(struct net *net) { net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size, - &net->ipv4.nat_vmalloced); + &net->ipv4.nat_vmalloced, 0); if (!net->ipv4.nat_bysource) return -ENOMEM; return 0; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 54e983f..9ed7a6b 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -29,6 +29,7 @@ #include <linux/netdevice.h> #include <linux/socket.h> #include <linux/mm.h> +#include <linux/rculist_nulls.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_l3proto.h> @@ -163,8 +164,8 @@ static void clean_from_lists(struct nf_conn *ct) { pr_debug("clean_from_lists(%p)\n", ct); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode); /* Destroy all pending expectations */ nf_ct_remove_expectations(ct); @@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct) /* We overload first tuple to link into unconfirmed list. */ if (!nf_ct_is_confirmed(ct)) { - BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode)); - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); } NF_CT_STAT_INC(net, delete); @@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack) nf_ct_put(ct); } +/* + * Warning : + * - Caller must take a reference on returned object + * and recheck nf_ct_tuple_equal(tuple, &h->tuple) + * OR + * - Caller must lock nf_conntrack_lock before calling this function + */ struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) { struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we normally need to disable them * at least once for the stats anyway. */ local_bh_disable(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { +begin: + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); local_bh_enable(); @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) } NF_CT_STAT_INC(net, searched); } + /* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(n) != hash) + goto begin; local_bh_enable(); return NULL; @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) struct nf_conn *ct; rcu_read_lock(); +begin: h = __nf_conntrack_find(net, tuple); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) h = NULL; + else { + if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) { + nf_ct_put(ct); + goto begin; + } + } } rcu_read_unlock(); @@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct, { struct net *net = nf_ct_net(ct); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.hash[hash]); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode, &net->ct.hash[repl_hash]); } @@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct nf_conn_help *help; - struct hlist_node *n; + struct hlist_nulls_node *n; enum ip_conntrack_info ctinfo; struct net *net; @@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb) /* See if there's one in the list already, including reverse: NAT could have grabbed it without realizing, since we're not in the hash. If there is, we lost race. */ - hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, &h->tuple)) goto out; - hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, &h->tuple)) goto out; /* Remove from unconfirmed list */ - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); __nf_conntrack_hash_insert(ct, hash, repl_hash); /* Timer relative to confirmation time, not original @@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, { struct net *net = nf_ct_net(ignored_conntrack); struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we need to disable them at * least once for the stats anyway. */ rcu_read_lock_bh(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack && nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); @@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash) /* Use oldest entry, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; struct nf_conn *ct = NULL, *tmp; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int i, cnt = 0; int dropped = 0; rcu_read_lock(); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], + hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); if (!test_bit(IPS_ASSURED_BIT, &tmp->status)) ct = tmp; @@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net, #ifdef CONFIG_NET_NS ct->ct_net = net; #endif - INIT_RCU_HEAD(&ct->rcu); return ct; } EXPORT_SYMBOL_GPL(nf_conntrack_alloc); -static void nf_conntrack_free_rcu(struct rcu_head *head) -{ - struct nf_conn *ct = container_of(head, struct nf_conn, rcu); - - nf_ct_ext_free(ct); - kmem_cache_free(nf_conntrack_cachep, ct); -} - void nf_conntrack_free(struct nf_conn *ct) { struct net *net = nf_ct_net(ct); nf_ct_ext_destroy(ct); atomic_dec(&net->ct.count); - call_rcu(&ct->rcu, nf_conntrack_free_rcu); + nf_ct_ext_free(ct); + kmem_cache_free(nf_conntrack_cachep, ct); } EXPORT_SYMBOL_GPL(nf_conntrack_free); @@ -594,7 +609,7 @@ init_conntrack(struct net *net, } /* Overload tuple linked list to put us in unconfirmed list. */ - hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.unconfirmed); spin_unlock_bh(&nf_conntrack_lock); @@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), { struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; - struct hlist_node *n; + struct hlist_nulls_node *n; spin_lock_bh(&nf_conntrack_lock); for (; *bucket < nf_conntrack_htable_size; (*bucket)++) { - hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) goto found; } } - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) set_bit(IPS_DYING_BIT, &ct->status); @@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data) return 1; } -void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size) +void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size) { if (vmalloced) vfree(hash); @@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net) } } -struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced) +void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls) { - struct hlist_head *hash; - unsigned int size, i; + struct hlist_nulls_head *hash; + unsigned int nr_slots, i; + size_t sz; *vmalloced = 0; - size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head)); - hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN, - get_order(sizeof(struct hlist_head) - * size)); + BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head)); + nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head)); + sz = nr_slots * sizeof(struct hlist_nulls_head); + hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, + get_order(sz)); if (!hash) { *vmalloced = 1; printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n"); - hash = vmalloc(sizeof(struct hlist_head) * size); + hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL); } - if (hash) - for (i = 0; i < size; i++) - INIT_HLIST_HEAD(&hash[i]); + if (hash && nulls) + for (i = 0; i < nr_slots; i++) + INIT_HLIST_NULLS_HEAD(&hash[i], i); return hash; } @@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) int i, bucket, vmalloced, old_vmalloced; unsigned int hashsize, old_size; int rnd; - struct hlist_head *hash, *old_hash; + struct hlist_nulls_head *hash, *old_hash; struct nf_conntrack_tuple_hash *h; /* On boot, we can set this without any fancy locking. */ @@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) if (!hashsize) return -EINVAL; - hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced); + hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1); if (!hash) return -ENOMEM; @@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) */ spin_lock_bh(&nf_conntrack_lock); for (i = 0; i < nf_conntrack_htable_size; i++) { - while (!hlist_empty(&init_net.ct.hash[i])) { - h = hlist_entry(init_net.ct.hash[i].first, - struct nf_conntrack_tuple_hash, hnode); - hlist_del_rcu(&h->hnode); + while (!hlist_nulls_empty(&init_net.ct.hash[i])) { + h = hlist_nulls_entry(init_net.ct.hash[i].first, + struct nf_conntrack_tuple_hash, hnnode); + hlist_nulls_del_rcu(&h->hnode); bucket = __hash_conntrack(&h->tuple, hashsize, rnd); - hlist_add_head_rcu(&h->hnode, &hash[bucket]); + hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]); } } old_size = nf_conntrack_htable_size; @@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void) nf_conntrack_cachep = kmem_cache_create("nf_conntrack", sizeof(struct nf_conn), - 0, 0, NULL); + 0, SLAB_DESTROY_BY_RCU, NULL); if (!nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); ret = -ENOMEM; @@ -1202,7 +1219,7 @@ static int nf_conntrack_init_net(struct net *net) int ret; atomic_set(&net->ct.count, 0); - INIT_HLIST_HEAD(&net->ct.unconfirmed); + INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0); net->ct.stat = alloc_percpu(struct ip_conntrack_stat); if (!net->ct.stat) { ret = -ENOMEM; @@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net) if (ret < 0) goto err_ecache; net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size, - &net->ct.hash_vmalloc); + &net->ct.hash_vmalloc, 1); if (!net->ct.hash) { ret = -ENOMEM; printk(KERN_ERR "Unable to create nf_conntrack_hash\n"); diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 357ba39..3940f99 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net) net->ct.expect_count = 0; net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize, - &net->ct.expect_vmalloc); + &net->ct.expect_vmalloc, 0); if (net->ct.expect_hash == NULL) goto err1; diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index a51bdac..6066144 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, struct nf_conntrack_tuple_hash *h; struct nf_conntrack_expect *exp; const struct hlist_node *n, *next; + const struct hlist_nulls_node *nn; unsigned int i; /* Get rid of expectations */ @@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, } /* Get rid of expecteds, set helpers to NULL. */ - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) + hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) unhelp(h, me); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry(h, n, &net->ct.hash[i], hnode) + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) unhelp(h, me); } } @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void) nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, - &nf_ct_helper_vmalloc); + &nf_ct_helper_vmalloc, 0); if (!nf_ct_helper_hash) return -ENOMEM; diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 1b75c9e..e73272b 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -19,6 +19,7 @@ #include <linux/module.h> #include <linux/kernel.h> #include <linux/rculist.h> +#include <linux/rculist_nulls.h> #include <linux/types.h> #include <linux/timer.h> #include <linux/skbuff.h> @@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) { struct nf_conn *ct, *last; struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; @@ -544,8 +545,8 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) last = (struct nf_conn *)cb->args[1]; for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) { restart: - hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], + hnnode) { if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) continue; ct = nf_ct_tuplehash_to_ctrack(h); @@ -1242,13 +1243,12 @@ ctnetlink_create_conntrack(struct nlattr *cda[], if (err < 0) goto err2; - master_h = __nf_conntrack_find(&init_net, &master); + master_h = nf_conntrack_find_get(&init_net, &master); if (master_h == NULL) { err = -ENOENT; goto err2; } master_ct = nf_ct_tuplehash_to_ctrack(master_h); - nf_conntrack_get(&master_ct->ct_general); __set_bit(IPS_EXPECTED_BIT, &ct->status); ct->master = master_ct; } diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 4da54b0..f768368 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -44,30 +44,30 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { + while (is_a_nulls(head)) { if (++st->bucket >= nf_conntrack_htable_size) return NULL; head = rcu_dereference(net->ct.hash[st->bucket].first); @@ -75,9 +75,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index 7f404cc..3bf47d8 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -123,7 +123,7 @@ static int count_them(struct xt_connlimit_data *data, /* check the saved connections */ list_for_each_entry_safe(conn, tmp, hash, list) { - found = __nf_conntrack_find(&init_net, &conn->tuple); + found = nf_conntrack_find_get(&init_net, &conn->tuple); found_ct = NULL; if (found != NULL) @@ -151,6 +151,7 @@ static int count_them(struct xt_connlimit_data *data, * we do not care about connections which are * closed already -> ditch it */ + nf_ct_put(found_ct); list_del(&conn->list); kfree(conn); continue; @@ -160,6 +161,7 @@ static int count_them(struct xt_connlimit_data *data, match->family)) /* same source network -> be counted! */ ++matches; + nf_ct_put(found_ct); } rcu_read_unlock(); ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 17:53 ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet @ 2009-03-25 18:05 ` Patrick McHardy 2009-03-25 18:06 ` Patrick McHardy 2009-03-25 18:15 ` Eric Dumazet 0 siblings, 2 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 18:05 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Eric Dumazet wrote: > Hi Patrick > > Here is the patch I had the time to test this time... > No problem so far on my machine. > I did a UDP flood stress. Thanks Eric. Most parts looks good, just two questions below: > diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c > index 6ba5c55..fcbcf62 100644 > --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c > +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c > @@ -25,30 +25,30 @@ struct ct_iter_state { > unsigned int bucket; > }; > > -static struct hlist_node *ct_get_first(struct seq_file *seq) > +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) > { > struct net *net = seq_file_net(seq); > struct ct_iter_state *st = seq->private; > - struct hlist_node *n; > + struct hlist_nulls_node *n; > > for (st->bucket = 0; > st->bucket < nf_conntrack_htable_size; > st->bucket++) { > n = rcu_dereference(net->ct.hash[st->bucket].first); > - if (n) > + if (!is_a_nulls(n)) > return n; > } > return NULL; > } Don't we need to make sure the entry is not reused while dumping it? > diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c > index a51bdac..6066144 100644 > --- a/net/netfilter/nf_conntrack_helper.c > +++ b/net/netfilter/nf_conntrack_helper.c > @@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, > struct nf_conntrack_tuple_hash *h; > struct nf_conntrack_expect *exp; > const struct hlist_node *n, *next; > + const struct hlist_nulls_node *nn; > unsigned int i; > > /* Get rid of expectations */ > @@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, > } > > /* Get rid of expecteds, set helpers to NULL. */ > - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) > + hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) > unhelp(h, me); > for (i = 0; i < nf_conntrack_htable_size; i++) { > - hlist_for_each_entry(h, n, &net->ct.hash[i], hnode) > + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) > unhelp(h, me); > } > } > @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void) > > nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ > nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, > - &nf_ct_helper_vmalloc); > + &nf_ct_helper_vmalloc, 0); This should be "1" I think since it wants a hlist_nulls hash. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 18:05 ` Patrick McHardy @ 2009-03-25 18:06 ` Patrick McHardy 2009-03-25 18:15 ` Eric Dumazet 1 sibling, 0 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 18:06 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Patrick McHardy wrote: >> @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void) >> >> nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ >> nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, >> - &nf_ct_helper_vmalloc); >> + &nf_ct_helper_vmalloc, 0); > > This should be "1" I think since it wants a hlist_nulls hash. OK I just realized my mistake, please ignore :) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 18:05 ` Patrick McHardy 2009-03-25 18:06 ` Patrick McHardy @ 2009-03-25 18:15 ` Eric Dumazet 2009-03-25 18:24 ` Patrick McHardy 1 sibling, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 18:15 UTC (permalink / raw) To: Patrick McHardy Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Patrick McHardy a écrit : > Eric Dumazet wrote: >> Hi Patrick >> >> Here is the patch I had the time to test this time... >> No problem so far on my machine. >> I did a UDP flood stress. > > Thanks Eric. Most parts looks good, just two questions below: > >> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> index 6ba5c55..fcbcf62 100644 >> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> @@ -25,30 +25,30 @@ struct ct_iter_state { >> unsigned int bucket; >> }; >> >> -static struct hlist_node *ct_get_first(struct seq_file *seq) >> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) >> { >> struct net *net = seq_file_net(seq); >> struct ct_iter_state *st = seq->private; >> - struct hlist_node *n; >> + struct hlist_nulls_node *n; >> >> for (st->bucket = 0; >> st->bucket < nf_conntrack_htable_size; >> st->bucket++) { >> n = rcu_dereference(net->ct.hash[st->bucket].first); >> - if (n) >> + if (!is_a_nulls(n)) >> return n; >> } >> return NULL; >> } > > Don't we need to make sure the entry is not reused while dumping > it? > Ah yes, I forgot that for UDP/TCP I had to change locking on this part. Because messing with reference count was crazy... But in UDP/TCP we have different spinlock for each chain, so hold time was small enough. So I guess that with central conntrack lock, we need to take references on entries while dumping them. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 18:15 ` Eric Dumazet @ 2009-03-25 18:24 ` Patrick McHardy 2009-03-25 18:53 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 18:24 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Eric Dumazet wrote: > Patrick McHardy a écrit : >>> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) >>> { >>> struct net *net = seq_file_net(seq); >>> struct ct_iter_state *st = seq->private; >>> - struct hlist_node *n; >>> + struct hlist_nulls_node *n; >>> >>> for (st->bucket = 0; >>> st->bucket < nf_conntrack_htable_size; >>> st->bucket++) { >>> n = rcu_dereference(net->ct.hash[st->bucket].first); >>> - if (n) >>> + if (!is_a_nulls(n)) >>> return n; >>> } >>> return NULL; >>> } >> Don't we need to make sure the entry is not reused while dumping >> it? >> > > Ah yes, I forgot that for UDP/TCP I had to change locking on this part. > Because messing with reference count was crazy... > But in UDP/TCP we have different spinlock for each chain, so hold time > was small enough. > > So I guess that with central conntrack lock, we need to take references on entries > while dumping them. Yes, I think so too. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 18:24 ` Patrick McHardy @ 2009-03-25 18:53 ` Eric Dumazet 2009-03-25 19:00 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 18:53 UTC (permalink / raw) To: Patrick McHardy Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Patrick McHardy a écrit : > Eric Dumazet wrote: >> Patrick McHardy a écrit : >>>> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) >>>> { >>>> struct net *net = seq_file_net(seq); >>>> struct ct_iter_state *st = seq->private; >>>> - struct hlist_node *n; >>>> + struct hlist_nulls_node *n; >>>> >>>> for (st->bucket = 0; >>>> st->bucket < nf_conntrack_htable_size; >>>> st->bucket++) { >>>> n = rcu_dereference(net->ct.hash[st->bucket].first); >>>> - if (n) >>>> + if (!is_a_nulls(n)) >>>> return n; >>>> } >>>> return NULL; >>>> } >>> Don't we need to make sure the entry is not reused while dumping >>> it? >>> >> >> Ah yes, I forgot that for UDP/TCP I had to change locking on this part. >> Because messing with reference count was crazy... >> But in UDP/TCP we have different spinlock for each chain, so hold time >> was small enough. >> >> So I guess that with central conntrack lock, we need to take >> references on entries >> while dumping them. > > Yes, I think so too. Here is take 2 of the patch with proper ref counting on dumping. Thank you [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. This permits an easy conversion from call_rcu() based hash lists to a SLAB_DESTROY_BY_RCU one. Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. First, it doesnt fill RCU queues (up to 10000 elements per cpu). This reduces OOM possibility, if queued elements are not taken into account This reduces latency problems when RCU queue size hits hilimit and triggers emergency mode. - It allows fast reuse of just freed elements, permitting better use of CPU cache. - We delete rcu_head from "struct nf_conn", shrinking size of this structure by 8 or 16 bytes. This patch only takes care of "struct nf_conn". call_rcu() is still used for less critical conntrack parts, that may be converted later if necessary. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/netfilter/nf_conntrack.h | 14 - include/net/netfilter/nf_conntrack_tuple.h | 6 include/net/netns/conntrack.h | 5 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 57 ++-- net/ipv4/netfilter/nf_nat_core.c | 2 net/netfilter/nf_conntrack_core.c | 123 +++++----- net/netfilter/nf_conntrack_expect.c | 2 net/netfilter/nf_conntrack_helper.c | 7 net/netfilter/nf_conntrack_netlink.c | 20 - net/netfilter/nf_conntrack_standalone.c | 51 ++-- net/netfilter/xt_connlimit.c | 6 11 files changed, 166 insertions(+), 127 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 4dfb793..6c3f964 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -91,8 +91,7 @@ struct nf_conn_help { #include <net/netfilter/ipv4/nf_conntrack_ipv4.h> #include <net/netfilter/ipv6/nf_conntrack_ipv6.h> -struct nf_conn -{ +struct nf_conn { /* Usage count in here is 1 for hash table/destruct timer, 1 per skb, plus 1 for any connection(s) we are `master' for */ struct nf_conntrack ct_general; @@ -126,7 +125,6 @@ struct nf_conn #ifdef CONFIG_NET_NS struct net *ct_net; #endif - struct rcu_head rcu; }; static inline struct nf_conn * @@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct) extern int nf_ct_l3proto_try_module_get(unsigned short l3proto); extern void nf_ct_l3proto_module_put(unsigned short l3proto); -extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced); -extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, - unsigned int size); +/* + * Allocate a hashtable of hlist_head (if nulls == 0), + * or hlist_nulls_head (if nulls == 1) + */ +extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls); + +extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size); extern struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple); diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h index f2f6aa7..2628c15 100644 --- a/include/net/netfilter/nf_conntrack_tuple.h +++ b/include/net/netfilter/nf_conntrack_tuple.h @@ -12,6 +12,7 @@ #include <linux/netfilter/x_tables.h> #include <linux/netfilter/nf_conntrack_tuple_common.h> +#include <linux/list_nulls.h> /* A `tuple' is a structure containing the information to uniquely identify a connection. ie. if two packets have the same tuple, they @@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t) ((enum ip_conntrack_dir)(h)->tuple.dst.dir) /* Connections have two entries in the hash table: one for each way */ -struct nf_conntrack_tuple_hash -{ - struct hlist_node hnode; +struct nf_conntrack_tuple_hash { + struct hlist_nulls_node hnnode; struct nf_conntrack_tuple tuple; }; diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index f4498a6..9dc5840 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -2,6 +2,7 @@ #define __NETNS_CONNTRACK_H #include <linux/list.h> +#include <linux/list_nulls.h> #include <asm/atomic.h> struct ctl_table_header; @@ -10,9 +11,9 @@ struct nf_conntrack_ecache; struct netns_ct { atomic_t count; unsigned int expect_count; - struct hlist_head *hash; + struct hlist_nulls_head *hash; struct hlist_head *expect_hash; - struct hlist_head unconfirmed; + struct hlist_nulls_head unconfirmed; struct ip_conntrack_stat *stat; #ifdef CONFIG_NF_CONNTRACK_EVENTS struct nf_conntrack_ecache *ecache; diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c index 6ba5c55..0b870b9 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c @@ -25,30 +25,30 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { + while (is_a_nulls(head)) { if (++st->bucket >= nf_conntrack_htable_size) return NULL; head = rcu_dereference(net->ct.hash[st->bucket].first); @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) @@ -87,69 +87,76 @@ static void ct_seq_stop(struct seq_file *s, void *v) static int ct_seq_show(struct seq_file *s, void *v) { - const struct nf_conntrack_tuple_hash *hash = v; - const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); + struct nf_conntrack_tuple_hash *hash = v; + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); const struct nf_conntrack_l3proto *l3proto; const struct nf_conntrack_l4proto *l4proto; + int ret = 0; NF_CT_ASSERT(ct); + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) + return 0; + /* we only want to print DIR_ORIGINAL */ if (NF_CT_DIRECTION(hash)) - return 0; + goto release; if (nf_ct_l3num(ct) != AF_INET) - return 0; + goto release; l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct)); NF_CT_ASSERT(l3proto); l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); NF_CT_ASSERT(l4proto); + ret = -ENOSPC; if (seq_printf(s, "%-8s %u %ld ", l4proto->name, nf_ct_protonum(ct), timer_pending(&ct->timeout) ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0) - return -ENOSPC; + goto release; if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct)) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL)) - return -ENOSPC; + goto release; if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status))) if (seq_printf(s, "[UNREPLIED] ")) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_REPLY)) - return -ENOSPC; + goto release; if (test_bit(IPS_ASSURED_BIT, &ct->status)) if (seq_printf(s, "[ASSURED] ")) - return -ENOSPC; + goto release; #ifdef CONFIG_NF_CONNTRACK_MARK if (seq_printf(s, "mark=%u ", ct->mark)) - return -ENOSPC; + goto release; #endif #ifdef CONFIG_NF_CONNTRACK_SECMARK if (seq_printf(s, "secmark=%u ", ct->secmark)) - return -ENOSPC; + goto release; #endif if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use))) - return -ENOSPC; - - return 0; + goto release; + ret = 0; +release: + nf_ct_put(ct); + return ret; } static const struct seq_operations ct_seq_ops = { diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c index a65cf69..fe65187 100644 --- a/net/ipv4/netfilter/nf_nat_core.c +++ b/net/ipv4/netfilter/nf_nat_core.c @@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct, static int __net_init nf_nat_net_init(struct net *net) { net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size, - &net->ipv4.nat_vmalloced); + &net->ipv4.nat_vmalloced, 0); if (!net->ipv4.nat_bysource) return -ENOMEM; return 0; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 54e983f..c55bbdc 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -29,6 +29,7 @@ #include <linux/netdevice.h> #include <linux/socket.h> #include <linux/mm.h> +#include <linux/rculist_nulls.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_l3proto.h> @@ -163,8 +164,8 @@ static void clean_from_lists(struct nf_conn *ct) { pr_debug("clean_from_lists(%p)\n", ct); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode); /* Destroy all pending expectations */ nf_ct_remove_expectations(ct); @@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct) /* We overload first tuple to link into unconfirmed list. */ if (!nf_ct_is_confirmed(ct)) { - BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode)); - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); } NF_CT_STAT_INC(net, delete); @@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack) nf_ct_put(ct); } +/* + * Warning : + * - Caller must take a reference on returned object + * and recheck nf_ct_tuple_equal(tuple, &h->tuple) + * OR + * - Caller must lock nf_conntrack_lock before calling this function + */ struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) { struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we normally need to disable them * at least once for the stats anyway. */ local_bh_disable(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { +begin: + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); local_bh_enable(); @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) } NF_CT_STAT_INC(net, searched); } + /* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(n) != hash) + goto begin; local_bh_enable(); return NULL; @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) struct nf_conn *ct; rcu_read_lock(); +begin: h = __nf_conntrack_find(net, tuple); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) h = NULL; + else { + if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) { + nf_ct_put(ct); + goto begin; + } + } } rcu_read_unlock(); @@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct, { struct net *net = nf_ct_net(ct); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.hash[hash]); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode, &net->ct.hash[repl_hash]); } @@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct nf_conn_help *help; - struct hlist_node *n; + struct hlist_nulls_node *n; enum ip_conntrack_info ctinfo; struct net *net; @@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb) /* See if there's one in the list already, including reverse: NAT could have grabbed it without realizing, since we're not in the hash. If there is, we lost race. */ - hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, &h->tuple)) goto out; - hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, &h->tuple)) goto out; /* Remove from unconfirmed list */ - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); __nf_conntrack_hash_insert(ct, hash, repl_hash); /* Timer relative to confirmation time, not original @@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, { struct net *net = nf_ct_net(ignored_conntrack); struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we need to disable them at * least once for the stats anyway. */ rcu_read_lock_bh(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack && nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); @@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash) /* Use oldest entry, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; struct nf_conn *ct = NULL, *tmp; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int i, cnt = 0; int dropped = 0; rcu_read_lock(); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], + hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); if (!test_bit(IPS_ASSURED_BIT, &tmp->status)) ct = tmp; @@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net, #ifdef CONFIG_NET_NS ct->ct_net = net; #endif - INIT_RCU_HEAD(&ct->rcu); return ct; } EXPORT_SYMBOL_GPL(nf_conntrack_alloc); -static void nf_conntrack_free_rcu(struct rcu_head *head) -{ - struct nf_conn *ct = container_of(head, struct nf_conn, rcu); - - nf_ct_ext_free(ct); - kmem_cache_free(nf_conntrack_cachep, ct); -} - void nf_conntrack_free(struct nf_conn *ct) { struct net *net = nf_ct_net(ct); nf_ct_ext_destroy(ct); atomic_dec(&net->ct.count); - call_rcu(&ct->rcu, nf_conntrack_free_rcu); + nf_ct_ext_free(ct); + kmem_cache_free(nf_conntrack_cachep, ct); } EXPORT_SYMBOL_GPL(nf_conntrack_free); @@ -594,7 +609,7 @@ init_conntrack(struct net *net, } /* Overload tuple linked list to put us in unconfirmed list. */ - hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.unconfirmed); spin_unlock_bh(&nf_conntrack_lock); @@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), { struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; - struct hlist_node *n; + struct hlist_nulls_node *n; spin_lock_bh(&nf_conntrack_lock); for (; *bucket < nf_conntrack_htable_size; (*bucket)++) { - hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) goto found; } } - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) set_bit(IPS_DYING_BIT, &ct->status); @@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data) return 1; } -void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size) +void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size) { if (vmalloced) vfree(hash); @@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net) } } -struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced) +void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls) { - struct hlist_head *hash; - unsigned int size, i; + struct hlist_nulls_head *hash; + unsigned int nr_slots, i; + size_t sz; *vmalloced = 0; - size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head)); - hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN, - get_order(sizeof(struct hlist_head) - * size)); + BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head)); + nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head)); + sz = nr_slots * sizeof(struct hlist_nulls_head); + hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, + get_order(sz)); if (!hash) { *vmalloced = 1; printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n"); - hash = vmalloc(sizeof(struct hlist_head) * size); + hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL); } - if (hash) - for (i = 0; i < size; i++) - INIT_HLIST_HEAD(&hash[i]); + if (hash && nulls) + for (i = 0; i < nr_slots; i++) + INIT_HLIST_NULLS_HEAD(&hash[i], i); return hash; } @@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) int i, bucket, vmalloced, old_vmalloced; unsigned int hashsize, old_size; int rnd; - struct hlist_head *hash, *old_hash; + struct hlist_nulls_head *hash, *old_hash; struct nf_conntrack_tuple_hash *h; /* On boot, we can set this without any fancy locking. */ @@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) if (!hashsize) return -EINVAL; - hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced); + hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1); if (!hash) return -ENOMEM; @@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) */ spin_lock_bh(&nf_conntrack_lock); for (i = 0; i < nf_conntrack_htable_size; i++) { - while (!hlist_empty(&init_net.ct.hash[i])) { - h = hlist_entry(init_net.ct.hash[i].first, - struct nf_conntrack_tuple_hash, hnode); - hlist_del_rcu(&h->hnode); + while (!hlist_nulls_empty(&init_net.ct.hash[i])) { + h = hlist_nulls_entry(init_net.ct.hash[i].first, + struct nf_conntrack_tuple_hash, hnnode); + hlist_nulls_del_rcu(&h->hnnode); bucket = __hash_conntrack(&h->tuple, hashsize, rnd); - hlist_add_head_rcu(&h->hnode, &hash[bucket]); + hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]); } } old_size = nf_conntrack_htable_size; @@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void) nf_conntrack_cachep = kmem_cache_create("nf_conntrack", sizeof(struct nf_conn), - 0, 0, NULL); + 0, SLAB_DESTROY_BY_RCU, NULL); if (!nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); ret = -ENOMEM; @@ -1202,7 +1219,7 @@ static int nf_conntrack_init_net(struct net *net) int ret; atomic_set(&net->ct.count, 0); - INIT_HLIST_HEAD(&net->ct.unconfirmed); + INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0); net->ct.stat = alloc_percpu(struct ip_conntrack_stat); if (!net->ct.stat) { ret = -ENOMEM; @@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net) if (ret < 0) goto err_ecache; net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size, - &net->ct.hash_vmalloc); + &net->ct.hash_vmalloc, 1); if (!net->ct.hash) { ret = -ENOMEM; printk(KERN_ERR "Unable to create nf_conntrack_hash\n"); diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 357ba39..3940f99 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net) net->ct.expect_count = 0; net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize, - &net->ct.expect_vmalloc); + &net->ct.expect_vmalloc, 0); if (net->ct.expect_hash == NULL) goto err1; diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index a51bdac..6066144 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, struct nf_conntrack_tuple_hash *h; struct nf_conntrack_expect *exp; const struct hlist_node *n, *next; + const struct hlist_nulls_node *nn; unsigned int i; /* Get rid of expectations */ @@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, } /* Get rid of expecteds, set helpers to NULL. */ - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) + hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) unhelp(h, me); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry(h, n, &net->ct.hash[i], hnode) + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) unhelp(h, me); } } @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void) nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, - &nf_ct_helper_vmalloc); + &nf_ct_helper_vmalloc, 0); if (!nf_ct_helper_hash) return -ENOMEM; diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 1b75c9e..349bbef 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -19,6 +19,7 @@ #include <linux/module.h> #include <linux/kernel.h> #include <linux/rculist.h> +#include <linux/rculist_nulls.h> #include <linux/types.h> #include <linux/timer.h> #include <linux/skbuff.h> @@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) { struct nf_conn *ct, *last; struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; @@ -544,27 +545,27 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) last = (struct nf_conn *)cb->args[1]; for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) { restart: - hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], + hnnode) { if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) continue; ct = nf_ct_tuplehash_to_ctrack(h); + if (!atomic_inc_not_zero(&ct->ct_general.use)) + continue; /* Dump entries of a given L3 protocol number. * If it is not specified, ie. l3proto == 0, * then dump everything. */ if (l3proto && nf_ct_l3num(ct) != l3proto) - continue; + goto releasect; if (cb->args[1]) { if (ct != last) - continue; + goto releasect; cb->args[1] = 0; } if (ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).pid, cb->nlh->nlmsg_seq, IPCTNL_MSG_CT_NEW, 1, ct) < 0) { - if (!atomic_inc_not_zero(&ct->ct_general.use)) - continue; cb->args[1] = (unsigned long)ct; goto out; } @@ -577,6 +578,8 @@ restart: if (acct) memset(acct, 0, sizeof(struct nf_conn_counter[IP_CT_DIR_MAX])); } +releasect: + nf_ct_put(ct); } if (cb->args[1]) { cb->args[1] = 0; @@ -1242,13 +1245,12 @@ ctnetlink_create_conntrack(struct nlattr *cda[], if (err < 0) goto err2; - master_h = __nf_conntrack_find(&init_net, &master); + master_h = nf_conntrack_find_get(&init_net, &master); if (master_h == NULL) { err = -ENOENT; goto err2; } master_ct = nf_ct_tuplehash_to_ctrack(master_h); - nf_conntrack_get(&master_ct->ct_general); __set_bit(IPS_EXPECTED_BIT, &ct->status); ct->master = master_ct; } diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 4da54b0..3f10a1e 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -44,30 +44,30 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { + while (is_a_nulls(head)) { if (++st->bucket >= nf_conntrack_htable_size) return NULL; head = rcu_dereference(net->ct.hash[st->bucket].first); @@ -75,9 +75,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) @@ -107,67 +107,74 @@ static void ct_seq_stop(struct seq_file *s, void *v) /* return 0 on success, 1 in case of error */ static int ct_seq_show(struct seq_file *s, void *v) { - const struct nf_conntrack_tuple_hash *hash = v; - const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); + struct nf_conntrack_tuple_hash *hash = v; + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); const struct nf_conntrack_l3proto *l3proto; const struct nf_conntrack_l4proto *l4proto; + int ret = 0; NF_CT_ASSERT(ct); + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) + return 0; /* we only want to print DIR_ORIGINAL */ if (NF_CT_DIRECTION(hash)) - return 0; + goto release; l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct)); NF_CT_ASSERT(l3proto); l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); NF_CT_ASSERT(l4proto); + ret = -ENOSPC; if (seq_printf(s, "%-8s %u %-8s %u %ld ", l3proto->name, nf_ct_l3num(ct), l4proto->name, nf_ct_protonum(ct), timer_pending(&ct->timeout) ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0) - return -ENOSPC; + goto release; if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct)) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL)) - return -ENOSPC; + goto release; if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status))) if (seq_printf(s, "[UNREPLIED] ")) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_REPLY)) - return -ENOSPC; + goto release; if (test_bit(IPS_ASSURED_BIT, &ct->status)) if (seq_printf(s, "[ASSURED] ")) - return -ENOSPC; + goto release; #if defined(CONFIG_NF_CONNTRACK_MARK) if (seq_printf(s, "mark=%u ", ct->mark)) - return -ENOSPC; + goto release; #endif #ifdef CONFIG_NF_CONNTRACK_SECMARK if (seq_printf(s, "secmark=%u ", ct->secmark)) - return -ENOSPC; + goto release; #endif if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use))) - return -ENOSPC; + goto release; + ret = 0; +release: + nf_ct_put(ct); return 0; } diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index 7f404cc..6809809 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -108,7 +108,7 @@ static int count_them(struct xt_connlimit_data *data, const struct nf_conntrack_tuple_hash *found; struct xt_connlimit_conn *conn; struct xt_connlimit_conn *tmp; - const struct nf_conn *found_ct; + struct nf_conn *found_ct; struct list_head *hash; bool addit = true; int matches = 0; @@ -123,7 +123,7 @@ static int count_them(struct xt_connlimit_data *data, /* check the saved connections */ list_for_each_entry_safe(conn, tmp, hash, list) { - found = __nf_conntrack_find(&init_net, &conn->tuple); + found = nf_conntrack_find_get(&init_net, &conn->tuple); found_ct = NULL; if (found != NULL) @@ -151,6 +151,7 @@ static int count_them(struct xt_connlimit_data *data, * we do not care about connections which are * closed already -> ditch it */ + nf_ct_put(found_ct); list_del(&conn->list); kfree(conn); continue; @@ -160,6 +161,7 @@ static int count_them(struct xt_connlimit_data *data, match->family)) /* same source network -> be counted! */ ++matches; + nf_ct_put(found_ct); } rcu_read_unlock(); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 18:53 ` Eric Dumazet @ 2009-03-25 19:00 ` Patrick McHardy 2009-03-25 19:17 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 19:00 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Eric Dumazet wrote: > Here is take 2 of the patch with proper ref counting on dumping. Thanks, one final question about the seq-file handling: > diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c > index 6ba5c55..0b870b9 100644 > --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c > +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c > @@ -25,30 +25,30 @@ struct ct_iter_state { > unsigned int bucket; > }; > > -static struct hlist_node *ct_get_first(struct seq_file *seq) > +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) > { > struct net *net = seq_file_net(seq); > struct ct_iter_state *st = seq->private; > - struct hlist_node *n; > + struct hlist_nulls_node *n; > > for (st->bucket = 0; > st->bucket < nf_conntrack_htable_size; > st->bucket++) { > n = rcu_dereference(net->ct.hash[st->bucket].first); > - if (n) > + if (!is_a_nulls(n)) > return n; > } > return NULL; > } > > -static struct hlist_node *ct_get_next(struct seq_file *seq, > - struct hlist_node *head) > +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, > + struct hlist_nulls_node *head) > { > struct net *net = seq_file_net(seq); > struct ct_iter_state *st = seq->private; > > head = rcu_dereference(head->next); > - while (head == NULL) { > + while (is_a_nulls(head)) { > if (++st->bucket >= nf_conntrack_htable_size) > return NULL; > head = rcu_dereference(net->ct.hash[st->bucket].first); > @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq, > return head; > } > > -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) > +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) > { > - struct hlist_node *head = ct_get_first(seq); > + struct hlist_nulls_node *head = ct_get_first(seq); > > if (head) > while (pos && (head = ct_get_next(seq, head))) > @@ -87,69 +87,76 @@ static void ct_seq_stop(struct seq_file *s, void *v) > > static int ct_seq_show(struct seq_file *s, void *v) > { > - const struct nf_conntrack_tuple_hash *hash = v; > - const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); > + struct nf_conntrack_tuple_hash *hash = v; > + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); > const struct nf_conntrack_l3proto *l3proto; > const struct nf_conntrack_l4proto *l4proto; > + int ret = 0; > > NF_CT_ASSERT(ct); > + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) > + return 0; Can we assume the next pointer still points to the next entry in the same chain after the refcount dropped to zero? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 19:00 ` Patrick McHardy @ 2009-03-25 19:17 ` Eric Dumazet 2009-03-25 19:41 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 19:17 UTC (permalink / raw) To: Patrick McHardy Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Patrick McHardy a écrit : > Eric Dumazet wrote: >> Here is take 2 of the patch with proper ref counting on dumping. > > Thanks, one final question about the seq-file handling: > >> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> index 6ba5c55..0b870b9 100644 >> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c >> @@ -25,30 +25,30 @@ struct ct_iter_state { >> unsigned int bucket; >> }; >> >> -static struct hlist_node *ct_get_first(struct seq_file *seq) >> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) >> { >> struct net *net = seq_file_net(seq); >> struct ct_iter_state *st = seq->private; >> - struct hlist_node *n; >> + struct hlist_nulls_node *n; >> >> for (st->bucket = 0; >> st->bucket < nf_conntrack_htable_size; >> st->bucket++) { >> n = rcu_dereference(net->ct.hash[st->bucket].first); >> - if (n) >> + if (!is_a_nulls(n)) >> return n; >> } >> return NULL; >> } >> >> -static struct hlist_node *ct_get_next(struct seq_file *seq, >> - struct hlist_node *head) >> +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, >> + struct hlist_nulls_node *head) >> { >> struct net *net = seq_file_net(seq); >> struct ct_iter_state *st = seq->private; >> >> head = rcu_dereference(head->next); >> - while (head == NULL) { >> + while (is_a_nulls(head)) { >> if (++st->bucket >= nf_conntrack_htable_size) >> return NULL; >> head = rcu_dereference(net->ct.hash[st->bucket].first); >> @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct >> seq_file *seq, >> return head; >> } >> >> -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) >> +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, >> loff_t pos) >> { >> - struct hlist_node *head = ct_get_first(seq); >> + struct hlist_nulls_node *head = ct_get_first(seq); >> >> if (head) >> while (pos && (head = ct_get_next(seq, head))) >> @@ -87,69 +87,76 @@ static void ct_seq_stop(struct seq_file *s, void *v) >> >> static int ct_seq_show(struct seq_file *s, void *v) >> { >> - const struct nf_conntrack_tuple_hash *hash = v; >> - const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); >> + struct nf_conntrack_tuple_hash *hash = v; >> + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); >> const struct nf_conntrack_l3proto *l3proto; >> const struct nf_conntrack_l4proto *l4proto; >> + int ret = 0; >> >> NF_CT_ASSERT(ct); >> + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) >> + return 0; > > Can we assume the next pointer still points to the next entry > in the same chain after the refcount dropped to zero? > > > We are looking chain N. If we cannot atomic_inc() refcount, we got some deleted entry. If we could atomic_inc, we can meet an entry that just moved to another chain X When hitting its end, we continue the search to the N+1 chain so we only skip the end of previous chain (N). We can 'forget' some entries, we can print several time one given entry. We could solve this by : 1) Checking hash value : if not one expected -> Going back to head of chain N, (potentially re-printing already handled entries) So it is not a *perfect* solution. 2) Use a locking to forbid writers (as done in UDP/TCP), but it is expensive and wont solve other problem : We wont avoid emitting same entry several time anyway (this is a flaw of current seq_file handling, since we 'count' entries to be skiped, and this is wrong if some entries were deleted or inserted meanwhile) We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure we should care... Also, current resizing code can give to a /proc/net/ip_conntrack reader a problem, since hash table can switch while its doing its dumping : many entries might be lost or regiven... ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 19:17 ` Eric Dumazet @ 2009-03-25 19:41 ` Patrick McHardy 2009-03-25 19:58 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 19:41 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Eric Dumazet wrote: > Patrick McHardy a écrit : >>> NF_CT_ASSERT(ct); >>> + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) >>> + return 0; >> Can we assume the next pointer still points to the next entry >> in the same chain after the refcount dropped to zero? >> > > We are looking chain N. > If we cannot atomic_inc() refcount, we got some deleted entry. > If we could atomic_inc, we can meet an entry that just moved to another chain X > > When hitting its end, we continue the search to the N+1 chain so we only > skip the end of previous chain (N). We can 'forget' some entries, we can print > several time one given entry. > > > We could solve this by : > > 1) Checking hash value : if not one expected -> > Going back to head of chain N, (potentially re-printing already handled entries) > So it is not a *perfect* solution. > > 2) Use a locking to forbid writers (as done in UDP/TCP), but it is expensive and > wont solve other problem : > > We wont avoid emitting same entry several time anyway (this is a flaw of > current seq_file handling, since we 'count' entries to be skiped, and this is > wrong if some entries were deleted or inserted meanwhile) > > We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure we should care... I think double entries are not a problem, as you say, there are already other cases where this can happen. But I think we should try our best that every entry present at the start and still present at the end of a dump is also contained in the dump, otherwise the guantees seem to weak to still be useful. Your first proposal would do exactly that, right? > Also, current resizing code can give to a /proc/net/ip_conntrack reader a problem, since > hash table can switch while its doing its dumping : many entries might be lost or regiven... Thats true. But its a very rare operation, so I think its mainly a documentation issue. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 19:41 ` Patrick McHardy @ 2009-03-25 19:58 ` Eric Dumazet 2009-03-25 20:10 ` Patrick McHardy 0 siblings, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-25 19:58 UTC (permalink / raw) To: Patrick McHardy Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Patrick McHardy a écrit : > Eric Dumazet wrote: >> Patrick McHardy a écrit : >>>> NF_CT_ASSERT(ct); >>>> + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) >>>> + return 0; >>> Can we assume the next pointer still points to the next entry >>> in the same chain after the refcount dropped to zero? >>> >> >> We are looking chain N. >> If we cannot atomic_inc() refcount, we got some deleted entry. >> If we could atomic_inc, we can meet an entry that just moved to >> another chain X >> >> When hitting its end, we continue the search to the N+1 chain so we >> only skip the end of previous chain (N). We can 'forget' some entries, >> we can print >> several time one given entry. >> >> >> We could solve this by : >> >> 1) Checking hash value : if not one expected -> Going back to head >> of chain N, (potentially re-printing already handled entries) >> So it is not a *perfect* solution. >> >> 2) Use a locking to forbid writers (as done in UDP/TCP), but it is >> expensive and >> wont solve other problem : >> >> We wont avoid emitting same entry several time anyway (this is a flaw >> of current seq_file handling, since we 'count' entries to be skiped, >> and this is >> wrong if some entries were deleted or inserted meanwhile) >> >> We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure >> we should care... > > I think double entries are not a problem, as you say, there > are already other cases where this can happen. But I think we > should try our best that every entry present at the start and > still present at the end of a dump is also contained in the > dump, otherwise the guantees seem to weak to still be useful. > Your first proposal would do exactly that, right? If your concern is to not forget entries, and we are allowed to print some entries several times, then we can just check the final "nulls" value, and if we find a different value than expected for chain N, go back to begining of chain N. No need to check hash value (this could help not print several time same entry, we dont care that much) + while (is_a_nulls(head)) { + if (likely(get_nulls_value(head) == st->bucket)) { + if (++st->bucket >= nf_conntrack_htable_size) + return NULL; + } Thank you [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. This permits an easy conversion from call_rcu() based hash lists to a SLAB_DESTROY_BY_RCU one. Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. First, it doesnt fill RCU queues (up to 10000 elements per cpu). This reduces OOM possibility, if queued elements are not taken into account This reduces latency problems when RCU queue size hits hilimit and triggers emergency mode. - It allows fast reuse of just freed elements, permitting better use of CPU cache. - We delete rcu_head from "struct nf_conn", shrinking size of this structure by 8 or 16 bytes. This patch only takes care of "struct nf_conn". call_rcu() is still used for less critical conntrack parts, that may be converted later if necessary. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/netfilter/nf_conntrack.h | 14 - include/net/netfilter/nf_conntrack_tuple.h | 6 include/net/netns/conntrack.h | 5 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 63 ++--- net/ipv4/netfilter/nf_nat_core.c | 2 net/netfilter/nf_conntrack_core.c | 123 +++++----- net/netfilter/nf_conntrack_expect.c | 2 net/netfilter/nf_conntrack_helper.c | 7 net/netfilter/nf_conntrack_netlink.c | 20 - net/netfilter/nf_conntrack_standalone.c | 57 ++-- net/netfilter/xt_connlimit.c | 6 11 files changed, 174 insertions(+), 131 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 4dfb793..6c3f964 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -91,8 +91,7 @@ struct nf_conn_help { #include <net/netfilter/ipv4/nf_conntrack_ipv4.h> #include <net/netfilter/ipv6/nf_conntrack_ipv6.h> -struct nf_conn -{ +struct nf_conn { /* Usage count in here is 1 for hash table/destruct timer, 1 per skb, plus 1 for any connection(s) we are `master' for */ struct nf_conntrack ct_general; @@ -126,7 +125,6 @@ struct nf_conn #ifdef CONFIG_NET_NS struct net *ct_net; #endif - struct rcu_head rcu; }; static inline struct nf_conn * @@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct) extern int nf_ct_l3proto_try_module_get(unsigned short l3proto); extern void nf_ct_l3proto_module_put(unsigned short l3proto); -extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced); -extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, - unsigned int size); +/* + * Allocate a hashtable of hlist_head (if nulls == 0), + * or hlist_nulls_head (if nulls == 1) + */ +extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls); + +extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size); extern struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple); diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h index f2f6aa7..2628c15 100644 --- a/include/net/netfilter/nf_conntrack_tuple.h +++ b/include/net/netfilter/nf_conntrack_tuple.h @@ -12,6 +12,7 @@ #include <linux/netfilter/x_tables.h> #include <linux/netfilter/nf_conntrack_tuple_common.h> +#include <linux/list_nulls.h> /* A `tuple' is a structure containing the information to uniquely identify a connection. ie. if two packets have the same tuple, they @@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t) ((enum ip_conntrack_dir)(h)->tuple.dst.dir) /* Connections have two entries in the hash table: one for each way */ -struct nf_conntrack_tuple_hash -{ - struct hlist_node hnode; +struct nf_conntrack_tuple_hash { + struct hlist_nulls_node hnnode; struct nf_conntrack_tuple tuple; }; diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index f4498a6..9dc5840 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -2,6 +2,7 @@ #define __NETNS_CONNTRACK_H #include <linux/list.h> +#include <linux/list_nulls.h> #include <asm/atomic.h> struct ctl_table_header; @@ -10,9 +11,9 @@ struct nf_conntrack_ecache; struct netns_ct { atomic_t count; unsigned int expect_count; - struct hlist_head *hash; + struct hlist_nulls_head *hash; struct hlist_head *expect_hash; - struct hlist_head unconfirmed; + struct hlist_nulls_head unconfirmed; struct ip_conntrack_stat *stat; #ifdef CONFIG_NF_CONNTRACK_EVENTS struct nf_conntrack_ecache *ecache; diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c index 6ba5c55..8668a3d 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c @@ -25,40 +25,42 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { - if (++st->bucket >= nf_conntrack_htable_size) - return NULL; + while (is_a_nulls(head)) { + if (likely(get_nulls_value(head) == st->bucket)) { + if (++st->bucket >= nf_conntrack_htable_size) + return NULL; + } head = rcu_dereference(net->ct.hash[st->bucket].first); } return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) @@ -87,69 +89,76 @@ static void ct_seq_stop(struct seq_file *s, void *v) static int ct_seq_show(struct seq_file *s, void *v) { - const struct nf_conntrack_tuple_hash *hash = v; - const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); + struct nf_conntrack_tuple_hash *hash = v; + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); const struct nf_conntrack_l3proto *l3proto; const struct nf_conntrack_l4proto *l4proto; + int ret = 0; NF_CT_ASSERT(ct); + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) + return 0; + /* we only want to print DIR_ORIGINAL */ if (NF_CT_DIRECTION(hash)) - return 0; + goto release; if (nf_ct_l3num(ct) != AF_INET) - return 0; + goto release; l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct)); NF_CT_ASSERT(l3proto); l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); NF_CT_ASSERT(l4proto); + ret = -ENOSPC; if (seq_printf(s, "%-8s %u %ld ", l4proto->name, nf_ct_protonum(ct), timer_pending(&ct->timeout) ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0) - return -ENOSPC; + goto release; if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct)) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL)) - return -ENOSPC; + goto release; if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status))) if (seq_printf(s, "[UNREPLIED] ")) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_REPLY)) - return -ENOSPC; + goto release; if (test_bit(IPS_ASSURED_BIT, &ct->status)) if (seq_printf(s, "[ASSURED] ")) - return -ENOSPC; + goto release; #ifdef CONFIG_NF_CONNTRACK_MARK if (seq_printf(s, "mark=%u ", ct->mark)) - return -ENOSPC; + goto release; #endif #ifdef CONFIG_NF_CONNTRACK_SECMARK if (seq_printf(s, "secmark=%u ", ct->secmark)) - return -ENOSPC; + goto release; #endif if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use))) - return -ENOSPC; - - return 0; + goto release; + ret = 0; +release: + nf_ct_put(ct); + return ret; } static const struct seq_operations ct_seq_ops = { diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c index a65cf69..fe65187 100644 --- a/net/ipv4/netfilter/nf_nat_core.c +++ b/net/ipv4/netfilter/nf_nat_core.c @@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct, static int __net_init nf_nat_net_init(struct net *net) { net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size, - &net->ipv4.nat_vmalloced); + &net->ipv4.nat_vmalloced, 0); if (!net->ipv4.nat_bysource) return -ENOMEM; return 0; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 54e983f..c55bbdc 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -29,6 +29,7 @@ #include <linux/netdevice.h> #include <linux/socket.h> #include <linux/mm.h> +#include <linux/rculist_nulls.h> #include <net/netfilter/nf_conntrack.h> #include <net/netfilter/nf_conntrack_l3proto.h> @@ -163,8 +164,8 @@ static void clean_from_lists(struct nf_conn *ct) { pr_debug("clean_from_lists(%p)\n", ct); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); - hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode); /* Destroy all pending expectations */ nf_ct_remove_expectations(ct); @@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct) /* We overload first tuple to link into unconfirmed list. */ if (!nf_ct_is_confirmed(ct)) { - BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode)); - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); } NF_CT_STAT_INC(net, delete); @@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack) nf_ct_put(ct); } +/* + * Warning : + * - Caller must take a reference on returned object + * and recheck nf_ct_tuple_equal(tuple, &h->tuple) + * OR + * - Caller must lock nf_conntrack_lock before calling this function + */ struct nf_conntrack_tuple_hash * __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) { struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we normally need to disable them * at least once for the stats anyway. */ local_bh_disable(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { +begin: + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); local_bh_enable(); @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple) } NF_CT_STAT_INC(net, searched); } + /* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(n) != hash) + goto begin; local_bh_enable(); return NULL; @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) struct nf_conn *ct; rcu_read_lock(); +begin: h = __nf_conntrack_find(net, tuple); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) h = NULL; + else { + if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) { + nf_ct_put(ct); + goto begin; + } + } } rcu_read_unlock(); @@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct, { struct net *net = nf_ct_net(ct); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.hash[hash]); - hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode, &net->ct.hash[repl_hash]); } @@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct nf_conn_help *help; - struct hlist_node *n; + struct hlist_nulls_node *n; enum ip_conntrack_info ctinfo; struct net *net; @@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb) /* See if there's one in the list already, including reverse: NAT could have grabbed it without realizing, since we're not in the hash. If there is, we lost race. */ - hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, &h->tuple)) goto out; - hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode) + hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode) if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, &h->tuple)) goto out; /* Remove from unconfirmed list */ - hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); __nf_conntrack_hash_insert(ct, hash, repl_hash); /* Timer relative to confirmation time, not original @@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, { struct net *net = nf_ct_net(ignored_conntrack); struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int hash = hash_conntrack(tuple); /* Disable BHs the entire time since we need to disable them at * least once for the stats anyway. */ rcu_read_lock_bh(); - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack && nf_ct_tuple_equal(tuple, &h->tuple)) { NF_CT_STAT_INC(net, found); @@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash) /* Use oldest entry, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; struct nf_conn *ct = NULL, *tmp; - struct hlist_node *n; + struct hlist_nulls_node *n; unsigned int i, cnt = 0; int dropped = 0; rcu_read_lock(); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], + hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); if (!test_bit(IPS_ASSURED_BIT, &tmp->status)) ct = tmp; @@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net, #ifdef CONFIG_NET_NS ct->ct_net = net; #endif - INIT_RCU_HEAD(&ct->rcu); return ct; } EXPORT_SYMBOL_GPL(nf_conntrack_alloc); -static void nf_conntrack_free_rcu(struct rcu_head *head) -{ - struct nf_conn *ct = container_of(head, struct nf_conn, rcu); - - nf_ct_ext_free(ct); - kmem_cache_free(nf_conntrack_cachep, ct); -} - void nf_conntrack_free(struct nf_conn *ct) { struct net *net = nf_ct_net(ct); nf_ct_ext_destroy(ct); atomic_dec(&net->ct.count); - call_rcu(&ct->rcu, nf_conntrack_free_rcu); + nf_ct_ext_free(ct); + kmem_cache_free(nf_conntrack_cachep, ct); } EXPORT_SYMBOL_GPL(nf_conntrack_free); @@ -594,7 +609,7 @@ init_conntrack(struct net *net, } /* Overload tuple linked list to put us in unconfirmed list. */ - hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode, + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &net->ct.unconfirmed); spin_unlock_bh(&nf_conntrack_lock); @@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), { struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; - struct hlist_node *n; + struct hlist_nulls_node *n; spin_lock_bh(&nf_conntrack_lock); for (; *bucket < nf_conntrack_htable_size; (*bucket)++) { - hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) goto found; } } - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) { + hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) set_bit(IPS_DYING_BIT, &ct->status); @@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data) return 1; } -void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size) +void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size) { if (vmalloced) vfree(hash); @@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net) } } -struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced) +void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls) { - struct hlist_head *hash; - unsigned int size, i; + struct hlist_nulls_head *hash; + unsigned int nr_slots, i; + size_t sz; *vmalloced = 0; - size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head)); - hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN, - get_order(sizeof(struct hlist_head) - * size)); + BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head)); + nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head)); + sz = nr_slots * sizeof(struct hlist_nulls_head); + hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, + get_order(sz)); if (!hash) { *vmalloced = 1; printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n"); - hash = vmalloc(sizeof(struct hlist_head) * size); + hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL); } - if (hash) - for (i = 0; i < size; i++) - INIT_HLIST_HEAD(&hash[i]); + if (hash && nulls) + for (i = 0; i < nr_slots; i++) + INIT_HLIST_NULLS_HEAD(&hash[i], i); return hash; } @@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) int i, bucket, vmalloced, old_vmalloced; unsigned int hashsize, old_size; int rnd; - struct hlist_head *hash, *old_hash; + struct hlist_nulls_head *hash, *old_hash; struct nf_conntrack_tuple_hash *h; /* On boot, we can set this without any fancy locking. */ @@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) if (!hashsize) return -EINVAL; - hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced); + hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1); if (!hash) return -ENOMEM; @@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) */ spin_lock_bh(&nf_conntrack_lock); for (i = 0; i < nf_conntrack_htable_size; i++) { - while (!hlist_empty(&init_net.ct.hash[i])) { - h = hlist_entry(init_net.ct.hash[i].first, - struct nf_conntrack_tuple_hash, hnode); - hlist_del_rcu(&h->hnode); + while (!hlist_nulls_empty(&init_net.ct.hash[i])) { + h = hlist_nulls_entry(init_net.ct.hash[i].first, + struct nf_conntrack_tuple_hash, hnnode); + hlist_nulls_del_rcu(&h->hnnode); bucket = __hash_conntrack(&h->tuple, hashsize, rnd); - hlist_add_head_rcu(&h->hnode, &hash[bucket]); + hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]); } } old_size = nf_conntrack_htable_size; @@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void) nf_conntrack_cachep = kmem_cache_create("nf_conntrack", sizeof(struct nf_conn), - 0, 0, NULL); + 0, SLAB_DESTROY_BY_RCU, NULL); if (!nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); ret = -ENOMEM; @@ -1202,7 +1219,7 @@ static int nf_conntrack_init_net(struct net *net) int ret; atomic_set(&net->ct.count, 0); - INIT_HLIST_HEAD(&net->ct.unconfirmed); + INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0); net->ct.stat = alloc_percpu(struct ip_conntrack_stat); if (!net->ct.stat) { ret = -ENOMEM; @@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net) if (ret < 0) goto err_ecache; net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size, - &net->ct.hash_vmalloc); + &net->ct.hash_vmalloc, 1); if (!net->ct.hash) { ret = -ENOMEM; printk(KERN_ERR "Unable to create nf_conntrack_hash\n"); diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 357ba39..3940f99 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net) net->ct.expect_count = 0; net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize, - &net->ct.expect_vmalloc); + &net->ct.expect_vmalloc, 0); if (net->ct.expect_hash == NULL) goto err1; diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index a51bdac..6066144 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, struct nf_conntrack_tuple_hash *h; struct nf_conntrack_expect *exp; const struct hlist_node *n, *next; + const struct hlist_nulls_node *nn; unsigned int i; /* Get rid of expectations */ @@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, } /* Get rid of expecteds, set helpers to NULL. */ - hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) + hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) unhelp(h, me); for (i = 0; i < nf_conntrack_htable_size; i++) { - hlist_for_each_entry(h, n, &net->ct.hash[i], hnode) + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) unhelp(h, me); } } @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void) nf_ct_helper_hsize = 1; /* gets rounded up to use one page */ nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, - &nf_ct_helper_vmalloc); + &nf_ct_helper_vmalloc, 0); if (!nf_ct_helper_hash) return -ENOMEM; diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 1b75c9e..349bbef 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -19,6 +19,7 @@ #include <linux/module.h> #include <linux/kernel.h> #include <linux/rculist.h> +#include <linux/rculist_nulls.h> #include <linux/types.h> #include <linux/timer.h> #include <linux/skbuff.h> @@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) { struct nf_conn *ct, *last; struct nf_conntrack_tuple_hash *h; - struct hlist_node *n; + struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; @@ -544,27 +545,27 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) last = (struct nf_conn *)cb->args[1]; for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) { restart: - hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], - hnode) { + hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]], + hnnode) { if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) continue; ct = nf_ct_tuplehash_to_ctrack(h); + if (!atomic_inc_not_zero(&ct->ct_general.use)) + continue; /* Dump entries of a given L3 protocol number. * If it is not specified, ie. l3proto == 0, * then dump everything. */ if (l3proto && nf_ct_l3num(ct) != l3proto) - continue; + goto releasect; if (cb->args[1]) { if (ct != last) - continue; + goto releasect; cb->args[1] = 0; } if (ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).pid, cb->nlh->nlmsg_seq, IPCTNL_MSG_CT_NEW, 1, ct) < 0) { - if (!atomic_inc_not_zero(&ct->ct_general.use)) - continue; cb->args[1] = (unsigned long)ct; goto out; } @@ -577,6 +578,8 @@ restart: if (acct) memset(acct, 0, sizeof(struct nf_conn_counter[IP_CT_DIR_MAX])); } +releasect: + nf_ct_put(ct); } if (cb->args[1]) { cb->args[1] = 0; @@ -1242,13 +1245,12 @@ ctnetlink_create_conntrack(struct nlattr *cda[], if (err < 0) goto err2; - master_h = __nf_conntrack_find(&init_net, &master); + master_h = nf_conntrack_find_get(&init_net, &master); if (master_h == NULL) { err = -ENOENT; goto err2; } master_ct = nf_ct_tuplehash_to_ctrack(master_h); - nf_conntrack_get(&master_ct->ct_general); __set_bit(IPS_EXPECTED_BIT, &ct->status); ct->master = master_ct; } diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 4da54b0..1935153 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -44,40 +44,42 @@ struct ct_iter_state { unsigned int bucket; }; -static struct hlist_node *ct_get_first(struct seq_file *seq) +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; - struct hlist_node *n; + struct hlist_nulls_node *n; for (st->bucket = 0; st->bucket < nf_conntrack_htable_size; st->bucket++) { n = rcu_dereference(net->ct.hash[st->bucket].first); - if (n) + if (!is_a_nulls(n)) return n; } return NULL; } -static struct hlist_node *ct_get_next(struct seq_file *seq, - struct hlist_node *head) +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq, + struct hlist_nulls_node *head) { struct net *net = seq_file_net(seq); struct ct_iter_state *st = seq->private; head = rcu_dereference(head->next); - while (head == NULL) { - if (++st->bucket >= nf_conntrack_htable_size) - return NULL; + while (is_a_nulls(head)) { + if (likely(get_nulls_value(head) == st->bucket)) { + if (++st->bucket >= nf_conntrack_htable_size) + return NULL; + } head = rcu_dereference(net->ct.hash[st->bucket].first); } return head; } -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos) +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos) { - struct hlist_node *head = ct_get_first(seq); + struct hlist_nulls_node *head = ct_get_first(seq); if (head) while (pos && (head = ct_get_next(seq, head))) @@ -107,67 +109,74 @@ static void ct_seq_stop(struct seq_file *s, void *v) /* return 0 on success, 1 in case of error */ static int ct_seq_show(struct seq_file *s, void *v) { - const struct nf_conntrack_tuple_hash *hash = v; - const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); + struct nf_conntrack_tuple_hash *hash = v; + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash); const struct nf_conntrack_l3proto *l3proto; const struct nf_conntrack_l4proto *l4proto; + int ret = 0; NF_CT_ASSERT(ct); + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) + return 0; /* we only want to print DIR_ORIGINAL */ if (NF_CT_DIRECTION(hash)) - return 0; + goto release; l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct)); NF_CT_ASSERT(l3proto); l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); NF_CT_ASSERT(l4proto); + ret = -ENOSPC; if (seq_printf(s, "%-8s %u %-8s %u %ld ", l3proto->name, nf_ct_l3num(ct), l4proto->name, nf_ct_protonum(ct), timer_pending(&ct->timeout) ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0) - return -ENOSPC; + goto release; if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct)) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL)) - return -ENOSPC; + goto release; if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status))) if (seq_printf(s, "[UNREPLIED] ")) - return -ENOSPC; + goto release; if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, l3proto, l4proto)) - return -ENOSPC; + goto release; if (seq_print_acct(s, ct, IP_CT_DIR_REPLY)) - return -ENOSPC; + goto release; if (test_bit(IPS_ASSURED_BIT, &ct->status)) if (seq_printf(s, "[ASSURED] ")) - return -ENOSPC; + goto release; #if defined(CONFIG_NF_CONNTRACK_MARK) if (seq_printf(s, "mark=%u ", ct->mark)) - return -ENOSPC; + goto release; #endif #ifdef CONFIG_NF_CONNTRACK_SECMARK if (seq_printf(s, "secmark=%u ", ct->secmark)) - return -ENOSPC; + goto release; #endif if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use))) - return -ENOSPC; + goto release; + ret = 0; +release: + nf_ct_put(ct); return 0; } diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index 7f404cc..6809809 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -108,7 +108,7 @@ static int count_them(struct xt_connlimit_data *data, const struct nf_conntrack_tuple_hash *found; struct xt_connlimit_conn *conn; struct xt_connlimit_conn *tmp; - const struct nf_conn *found_ct; + struct nf_conn *found_ct; struct list_head *hash; bool addit = true; int matches = 0; @@ -123,7 +123,7 @@ static int count_them(struct xt_connlimit_data *data, /* check the saved connections */ list_for_each_entry_safe(conn, tmp, hash, list) { - found = __nf_conntrack_find(&init_net, &conn->tuple); + found = nf_conntrack_find_get(&init_net, &conn->tuple); found_ct = NULL; if (found != NULL) @@ -151,6 +151,7 @@ static int count_them(struct xt_connlimit_data *data, * we do not care about connections which are * closed already -> ditch it */ + nf_ct_put(found_ct); list_del(&conn->list); kfree(conn); continue; @@ -160,6 +161,7 @@ static int count_them(struct xt_connlimit_data *data, match->family)) /* same source network -> be counted! */ ++matches; + nf_ct_put(found_ct); } rcu_read_unlock(); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs 2009-03-25 19:58 ` Eric Dumazet @ 2009-03-25 20:10 ` Patrick McHardy 0 siblings, 0 replies; 44+ messages in thread From: Patrick McHardy @ 2009-03-25 20:10 UTC (permalink / raw) To: Eric Dumazet Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev, Netfilter Developers Eric Dumazet wrote: > Patrick McHardy a écrit : >> I think double entries are not a problem, as you say, there >> are already other cases where this can happen. But I think we >> should try our best that every entry present at the start and >> still present at the end of a dump is also contained in the >> dump, otherwise the guantees seem to weak to still be useful. >> Your first proposal would do exactly that, right? > > If your concern is to not forget entries, and we are allowed to print some entries several times, > then we can just check the final "nulls" value, and if we find a different value than expected for > chain N, go back to begining of chain N. > > No need to check hash value (this could help not print several time same entry, we dont care that much) > > + while (is_a_nulls(head)) { > + if (likely(get_nulls_value(head) == st->bucket)) { > + if (++st->bucket >= nf_conntrack_htable_size) > + return NULL; > + } Looks perfect :) > [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() > > Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. > > This permits an easy conversion from call_rcu() based hash lists to a > SLAB_DESTROY_BY_RCU one. > > Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. > > First, it doesnt fill RCU queues (up to 10000 elements per cpu). > This reduces OOM possibility, if queued elements are not taken into account > This reduces latency problems when RCU queue size hits hilimit and triggers > emergency mode. > > - It allows fast reuse of just freed elements, permitting better use of > CPU cache. > > - We delete rcu_head from "struct nf_conn", shrinking size of this structure > by 8 or 16 bytes. > > This patch only takes care of "struct nf_conn". > call_rcu() is still used for less critical conntrack parts, that may > be converted later if necessary. Applied, thanks a lot. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() 2009-03-24 15:17 ` Maxime Bizon 2009-03-24 15:21 ` Patrick McHardy 2009-03-24 15:27 ` Eric Dumazet @ 2009-03-24 18:29 ` Joakim Tjernlund 2 siblings, 0 replies; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-24 18:29 UTC (permalink / raw) To: mbizon; +Cc: avorontsov, Eric Dumazet, Patrick McHardy, netdev, Paul E. McKenney Maxime Bizon <mbizon@freebox.fr> wrote on 24/03/2009 16:17:30: > > > On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote: > > Hi Eric, > > > We use RCU to defer freeing of conntrack structures. In DOS situation, > > RCU might accumulate about 10.000 elements per CPU in its internal > > queues. To get accurate conntrack counts (at the expense of slightly > > more RAM used), we might consider conntrack counter not taking into > > account "about to be freed elements, waiting in RCU queues". We thus > > decrement it in nf_conntrack_free(), not in the RCU callback. > > Your patch fixes the problem on my board too (embedded mips router > 250Mhz), thanks. > > Yet I'm concerned about what you said concerning RAM usage. I have a > very small amount on memory left on my board (less than 4M), and I tuned > ip route cache size and nf_conntrack_max to make sure I won't go OOM. > > With your patch, does it mean 10000 conntrack entries can be allocated > while nf_conntrack_max is say only 2048 ? Just add "rcuclassic.qhimark=2048" to your cmdline. Jocke ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 17:42 ` Joakim Tjernlund 2009-03-23 17:49 ` Patrick McHardy @ 2009-03-23 17:49 ` Eric Dumazet 2009-03-23 18:04 ` Joakim Tjernlund 1 sibling, 1 reply; 44+ messages in thread From: Eric Dumazet @ 2009-03-23 17:49 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: Patrick McHardy, avorontsov, netdev Joakim Tjernlund a écrit : > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > >> Joakim Tjernlund wrote: >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: >>>> Joakim Tjernlund wrote: >>>>> doing a "ping -f -l 3" on my host towards my board on linus tree as > of >>>>> Friday results in lots of: >>>>> nf_conntrack: table full, dropping packet. >>>>> nf_conntrack: table full, dropping packet. >>>>> nf_conntrack: table full, dropping packet. >>>>> __ratelimit: 11 callbacks suppressed >>>>> nf_conntrack: table full, dropping packet. >>>>> nf_conntrack: table full, dropping packet. >>>>> nf_conntrack: table full, dropping packet. >>>>> nf_conntrack: table full, dropping packet. >>>>> >>>>> for ucc_geth on a MPC832x. >>>>> This really looks strange to me, ideas? >>>> What does /proc/net/netfilter/nf_conntrack show? >>> There is no /proc/net/netfilter/nf_conntrack. There is a >>> /proc/net/nf_conntrack though and it is empty. If I telnet >>> to the board I see: >> That means that something is leaking conntrack references, most likely >> by leaking skbs. Since I haven't seen any other reports, my guess would >> be the ucc_geth driver. >> > > Mucking around with the ucc_geth driver I found that if I: > - Move TX from IRQ to NAPI context > - double the weight. > - after booting up, wait a few mins until the JFFS2 GC kernel thread has > stopped > scanning the FS > > Then the "nf_conntrack: table full, dropping packet." msgs stops. > Does this seem right to you guys? > How many cpus do you have ? What kernel version do you use ? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 17:49 ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet @ 2009-03-23 18:04 ` Joakim Tjernlund 2009-03-23 18:08 ` Eric Dumazet 0 siblings, 1 reply; 44+ messages in thread From: Joakim Tjernlund @ 2009-03-23 18:04 UTC (permalink / raw) To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev Eric Dumazet <dada1@cosmosbay.com> wrote on 23/03/2009 18:49:38: > > Joakim Tjernlund a écrit : > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: > > > >> Joakim Tjernlund wrote: > >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: > >>>> Joakim Tjernlund wrote: > >>>>> doing a "ping -f -l 3" on my host towards my board on linus tree as > > of > >>>>> Friday results in lots of: > >>>>> nf_conntrack: table full, dropping packet. > >>>>> nf_conntrack: table full, dropping packet. > >>>>> nf_conntrack: table full, dropping packet. > >>>>> __ratelimit: 11 callbacks suppressed > >>>>> nf_conntrack: table full, dropping packet. > >>>>> nf_conntrack: table full, dropping packet. > >>>>> nf_conntrack: table full, dropping packet. > >>>>> nf_conntrack: table full, dropping packet. > >>>>> > >>>>> for ucc_geth on a MPC832x. > >>>>> This really looks strange to me, ideas? > >>>> What does /proc/net/netfilter/nf_conntrack show? > >>> There is no /proc/net/netfilter/nf_conntrack. There is a > >>> /proc/net/nf_conntrack though and it is empty. If I telnet > >>> to the board I see: > >> That means that something is leaking conntrack references, most likely > >> by leaking skbs. Since I haven't seen any other reports, my guess would > >> be the ucc_geth driver. > >> > > > > Mucking around with the ucc_geth driver I found that if I: > > - Move TX from IRQ to NAPI context > > - double the weight. > > - after booting up, wait a few mins until the JFFS2 GC kernel thread has > > stopped > > scanning the FS > > > > Then the "nf_conntrack: table full, dropping packet." msgs stops. > > Does this seem right to you guys? > > > > How many cpus do you have ? Just one, it is an embedded board running at 266 MHz > > What kernel version do you use ? Linus tree as of Friday ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: ucc_geth: nf_conntrack: table full, dropping packet. 2009-03-23 18:04 ` Joakim Tjernlund @ 2009-03-23 18:08 ` Eric Dumazet 0 siblings, 0 replies; 44+ messages in thread From: Eric Dumazet @ 2009-03-23 18:08 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev Joakim Tjernlund a écrit : > Eric Dumazet <dada1@cosmosbay.com> wrote on 23/03/2009 18:49:38: > >> Joakim Tjernlund a écrit : >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33: >>> >>>> Joakim Tjernlund wrote: >>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45: >>>>>> Joakim Tjernlund wrote: >>>>>>> doing a "ping -f -l 3" on my host towards my board on linus tree > as >>> of >>>>>>> Friday results in lots of: >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> __ratelimit: 11 callbacks suppressed >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> nf_conntrack: table full, dropping packet. >>>>>>> >>>>>>> for ucc_geth on a MPC832x. >>>>>>> This really looks strange to me, ideas? >>>>>> What does /proc/net/netfilter/nf_conntrack show? >>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>> to the board I see: >>>> That means that something is leaking conntrack references, most > likely >>>> by leaking skbs. Since I haven't seen any other reports, my guess > would >>>> be the ucc_geth driver. >>>> >>> Mucking around with the ucc_geth driver I found that if I: >>> - Move TX from IRQ to NAPI context >>> - double the weight. >>> - after booting up, wait a few mins until the JFFS2 GC kernel thread > has >>> stopped >>> scanning the FS >>> >>> Then the "nf_conntrack: table full, dropping packet." msgs stops. >>> Does this seem right to you guys? >>> >> How many cpus do you have ? > > Just one, it is an embedded board running at 266 MHz > >> What kernel version do you use ? > > Linus tree as of Friday > > I suspect RCU problem. Maybe the GC kernel threads blocks synchronize_rcu() ? ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2009-03-25 20:10 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-23 10:42 ucc_geth: nf_conntrack: table full, dropping packet Joakim Tjernlund
2009-03-23 12:15 ` Patrick McHardy
2009-03-23 12:25 ` Joakim Tjernlund
2009-03-23 12:29 ` Patrick McHardy
2009-03-23 12:59 ` Joakim Tjernlund
[not found] ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>
2009-03-23 13:09 ` Joakim Tjernlund
2009-03-23 17:42 ` Joakim Tjernlund
2009-03-23 17:49 ` Patrick McHardy
2009-03-24 8:22 ` Joakim Tjernlund
2009-03-24 9:12 ` Eric Dumazet
2009-03-24 10:55 ` Joakim Tjernlund
2009-03-24 12:07 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet
2009-03-24 12:25 ` Eric Dumazet
2009-03-24 12:43 ` Patrick McHardy
2009-03-24 13:32 ` Eric Dumazet
2009-03-24 13:38 ` Patrick McHardy
2009-03-24 13:47 ` Eric Dumazet
[not found] ` <49C8F871.9070600@cosmosbay.com>
[not found] ` <49C8F8E0.9050502@trash.net>
2009-03-25 3:53 ` Eric Dumazet
2009-03-25 13:39 ` Patrick McHardy
2009-03-25 13:44 ` Eric Dumazet
2009-03-24 13:20 ` Joakim Tjernlund
2009-03-24 13:28 ` Patrick McHardy
2009-03-24 13:29 ` Eric Dumazet
2009-03-24 13:41 ` Joakim Tjernlund
2009-03-24 15:17 ` Maxime Bizon
2009-03-24 15:21 ` Patrick McHardy
2009-03-24 15:27 ` Eric Dumazet
2009-03-24 19:54 ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet
2009-03-25 16:26 ` Patrick McHardy
2009-03-25 17:53 ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet
2009-03-25 18:05 ` Patrick McHardy
2009-03-25 18:06 ` Patrick McHardy
2009-03-25 18:15 ` Eric Dumazet
2009-03-25 18:24 ` Patrick McHardy
2009-03-25 18:53 ` Eric Dumazet
2009-03-25 19:00 ` Patrick McHardy
2009-03-25 19:17 ` Eric Dumazet
2009-03-25 19:41 ` Patrick McHardy
2009-03-25 19:58 ` Eric Dumazet
2009-03-25 20:10 ` Patrick McHardy
2009-03-24 18:29 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Joakim Tjernlund
2009-03-23 17:49 ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet
2009-03-23 18:04 ` Joakim Tjernlund
2009-03-23 18:08 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).