ucc_geth: nf_conntrack: table full, dropping packet.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ucc_geth: nf_conntrack: table full, dropping packet.
@ 2009-03-23 10:42 Joakim Tjernlund
  2009-03-23 12:15 ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-23 10:42 UTC (permalink / raw)
  To: netdev, avorontsov

doing a "ping -f -l 3" on my host towards my board on linus tree as of 
Friday results in lots of:
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
__ratelimit: 11 callbacks suppressed
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.

for ucc_geth on a MPC832x.
This really looks strange to me, ideas?

 Jocke

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 10:42 ucc_geth: nf_conntrack: table full, dropping packet Joakim Tjernlund
@ 2009-03-23 12:15 ` Patrick McHardy
  2009-03-23 12:25   ` Joakim Tjernlund
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-23 12:15 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev, avorontsov

Joakim Tjernlund wrote:
> doing a "ping -f -l 3" on my host towards my board on linus tree as of 
> Friday results in lots of:
> nf_conntrack: table full, dropping packet.
> nf_conntrack: table full, dropping packet.
> nf_conntrack: table full, dropping packet.
> __ratelimit: 11 callbacks suppressed
> nf_conntrack: table full, dropping packet.
> nf_conntrack: table full, dropping packet.
> nf_conntrack: table full, dropping packet.
> nf_conntrack: table full, dropping packet.
> 
> for ucc_geth on a MPC832x.
> This really looks strange to me, ideas?

What does /proc/net/netfilter/nf_conntrack show?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 12:15 ` Patrick McHardy
@ 2009-03-23 12:25   ` Joakim Tjernlund
  2009-03-23 12:29     ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-23 12:25 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: avorontsov, netdev

Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
> 
> Joakim Tjernlund wrote:
> > doing a "ping -f -l 3" on my host towards my board on linus tree as of 

> > Friday results in lots of:
> > nf_conntrack: table full, dropping packet.
> > nf_conntrack: table full, dropping packet.
> > nf_conntrack: table full, dropping packet.
> > __ratelimit: 11 callbacks suppressed
> > nf_conntrack: table full, dropping packet.
> > nf_conntrack: table full, dropping packet.
> > nf_conntrack: table full, dropping packet.
> > nf_conntrack: table full, dropping packet.
> > 
> > for ucc_geth on a MPC832x.
> > This really looks strange to me, ideas?
> 
> What does /proc/net/netfilter/nf_conntrack show?

There is no /proc/net/netfilter/nf_conntrack. There is a
/proc/net/nf_conntrack though and it is empty. If I telnet
to the board I see:
ipv4     2 tcp      6 431990 ESTABLISHED src=192.168.1.15 dst=192.168.1.17 
sport=56445 dport=23 src=192.168.1.17 dst=192.168.1.15 sport=23 
dport=56445 [ASSURED] use=1
ipv4     2 udp      17 7 src=192.168.1.228 dst=192.168.1.255 sport=138 
dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.228 sport=138 
dport=138 use=1
ipv4     2 udp      17 20 src=127.0.0.1 dst=127.0.0.1 sport=34261 dport=53 
[UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=34261 use=1
ipv4     2 udp      17 2 src=192.168.1.199 dst=192.168.1.255 sport=138 
dport=138 [UNREPLIED] src=192.168.1.255 dst=192.168.1.199 sport=138 
dport=138 use=1
ipv4     2 udp      17 20 src=127.0.0.1 dst=127.0.0.1 sport=40417 dport=53 
[UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=53 dport=40417 use=1

 Jocke


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 12:25   ` Joakim Tjernlund
@ 2009-03-23 12:29     ` Patrick McHardy
  2009-03-23 12:59       ` Joakim Tjernlund
                         ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-23 12:29 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: avorontsov, netdev

Joakim Tjernlund wrote:
> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
>> Joakim Tjernlund wrote:
>>> doing a "ping -f -l 3" on my host towards my board on linus tree as of 
> 
>>> Friday results in lots of:
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> __ratelimit: 11 callbacks suppressed
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>>
>>> for ucc_geth on a MPC832x.
>>> This really looks strange to me, ideas?
>> What does /proc/net/netfilter/nf_conntrack show?
> 
> There is no /proc/net/netfilter/nf_conntrack. There is a
> /proc/net/nf_conntrack though and it is empty. If I telnet
> to the board I see:

That means that something is leaking conntrack references, most likely
by leaking skbs. Since I haven't seen any other reports, my guess would
be the ucc_geth driver.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 12:29     ` Patrick McHardy
@ 2009-03-23 12:59       ` Joakim Tjernlund
       [not found]       ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>
  2009-03-23 17:42       ` Joakim Tjernlund
  2 siblings, 0 replies; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-23 12:59 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: avorontsov, netdev

Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> Joakim Tjernlund wrote:
> > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
> >> Joakim Tjernlund wrote:
> >>> doing a "ping -f -l 3" on my host towards my board on linus tree as 
of 
> > 
> >>> Friday results in lots of:
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> __ratelimit: 11 callbacks suppressed
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>>
> >>> for ucc_geth on a MPC832x.
> >>> This really looks strange to me, ideas?
> >> What does /proc/net/netfilter/nf_conntrack show?
> > 
> > There is no /proc/net/netfilter/nf_conntrack. There is a
> > /proc/net/nf_conntrack though and it is empty. If I telnet
> > to the board I see:
> 
> That means that something is leaking conntrack references, most likely
> by leaking skbs. Since I haven't seen any other reports, my guess would
> be the ucc_geth driver.

hmm, I cannot see what in the ucc_geth driver is possibly "leaking". One 
thing
I do notice is that the board becomes almost unresponsive during the ping 
flood.
Perhaps it is building up a backlog of conntracks during the ping flood?

 Jocke


^ permalink raw reply	[flat|nested] 44+ messages in thread

[parent not found: <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>]

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
       [not found]       ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>
@ 2009-03-23 13:09         ` Joakim Tjernlund
  0 siblings, 0 replies; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-23 13:09 UTC (permalink / raw)
  To: leoli; +Cc: avorontsov, Patrick McHardy, netdev

Joakim Tjernlund/Transmode wrote on 23/03/2009 13:59:15:
> 
> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> > Joakim Tjernlund wrote:
> > > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
> > >> Joakim Tjernlund wrote:
> > >>> doing a "ping -f -l 3" on my host towards my board on linus tree 
as of 
> > > 
> > >>> Friday results in lots of:
> > >>> nf_conntrack: table full, dropping packet.
> > >>> nf_conntrack: table full, dropping packet.
> > >>> nf_conntrack: table full, dropping packet.
> > >>> __ratelimit: 11 callbacks suppressed
> > >>> nf_conntrack: table full, dropping packet.
> > >>> nf_conntrack: table full, dropping packet.
> > >>> nf_conntrack: table full, dropping packet.
> > >>> nf_conntrack: table full, dropping packet.
> > >>>
> > >>> for ucc_geth on a MPC832x.
> > >>> This really looks strange to me, ideas?
> > >> What does /proc/net/netfilter/nf_conntrack show?
> > > 
> > > There is no /proc/net/netfilter/nf_conntrack. There is a
> > > /proc/net/nf_conntrack though and it is empty. If I telnet
> > > to the board I see:
> > 
> > That means that something is leaking conntrack references, most likely
> > by leaking skbs. Since I haven't seen any other reports, my guess 
would
> > be the ucc_geth driver.

> hmm, I cannot see what in the ucc_geth driver is possibly "leaking". One 
thing
> I do notice is that the board becomes almost unresponsive during the 
ping flood.
> Perhaps it is building up a backlog of conntracks during the ping flood?
> 
>  Jocke

next hmm, ethtool -S eth0 shows:
...
 rx-mismatch-drop-frames: 13
...

13 matches the number of lost frames after I stop the ping flood.

The MPC832x manual says this(if i found the correct counter):

REBASE+2C MisMatchDrop 32 Counts number of frames dropped due to MAC 
filtering process, (e.g.
                          Address Mismatch, Type mismatch) and that would 
otherwise
                          considered good frame that would be transferred 
to upper layers

Why would a ping flood result in Address Mismatch or Type mismatch?

Leo?

  Jocke

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 12:29     ` Patrick McHardy
  2009-03-23 12:59       ` Joakim Tjernlund
       [not found]       ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>
@ 2009-03-23 17:42       ` Joakim Tjernlund
  2009-03-23 17:49         ` Patrick McHardy
  2009-03-23 17:49         ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet
  2 siblings, 2 replies; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-23 17:42 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: avorontsov, netdev

Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:

> 
> Joakim Tjernlund wrote:
> > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
> >> Joakim Tjernlund wrote:
> >>> doing a "ping -f -l 3" on my host towards my board on linus tree as 
of 
> > 
> >>> Friday results in lots of:
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> __ratelimit: 11 callbacks suppressed
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>> nf_conntrack: table full, dropping packet.
> >>>
> >>> for ucc_geth on a MPC832x.
> >>> This really looks strange to me, ideas?
> >> What does /proc/net/netfilter/nf_conntrack show?
> > 
> > There is no /proc/net/netfilter/nf_conntrack. There is a
> > /proc/net/nf_conntrack though and it is empty. If I telnet
> > to the board I see:
> 
> That means that something is leaking conntrack references, most likely
> by leaking skbs. Since I haven't seen any other reports, my guess would
> be the ucc_geth driver.
> 

Mucking around with the ucc_geth driver I found that if I:
 - Move TX from IRQ to NAPI context
 - double the weight.
 - after booting up, wait a few mins until the JFFS2 GC kernel thread has 
stopped
   scanning the FS 

Then the "nf_conntrack: table full, dropping packet." msgs stops.
Does this seem right to you guys?

 Jocke

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 17:42       ` Joakim Tjernlund
@ 2009-03-23 17:49         ` Patrick McHardy
  2009-03-24  8:22           ` Joakim Tjernlund
  2009-03-23 17:49         ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet
  1 sibling, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-23 17:49 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: avorontsov, netdev

Joakim Tjernlund wrote:
> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>
>   
>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>> to the board I see:
>>>       
>> That means that something is leaking conntrack references, most likely
>> by leaking skbs. Since I haven't seen any other reports, my guess would
>> be the ucc_geth driver.
>>     
>
> Mucking around with the ucc_geth driver I found that if I:
>  - Move TX from IRQ to NAPI context
>  - double the weight.
>  - after booting up, wait a few mins until the JFFS2 GC kernel thread has 
> stopped
>    scanning the FS 
>
> Then the "nf_conntrack: table full, dropping packet." msgs stops.
> Does this seem right to you guys?

No. As I said, something seems to be leaking packets. You should be
able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
If that *doesn't* show any signs of a leak, please run "conntrack -E"
to capture the conntrack events before the "table full" message
appears and post the output.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 17:49         ` Patrick McHardy
@ 2009-03-24  8:22           ` Joakim Tjernlund
  2009-03-24  9:12             ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-24  8:22 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: avorontsov, netdev

Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
> 
> Joakim Tjernlund wrote:
> > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> >
> > 
> >>> There is no /proc/net/netfilter/nf_conntrack. There is a
> >>> /proc/net/nf_conntrack though and it is empty. If I telnet
> >>> to the board I see:
> >>> 
> >> That means that something is leaking conntrack references, most 
likely
> >> by leaking skbs. Since I haven't seen any other reports, my guess 
would
> >> be the ucc_geth driver.
> >> 
> >
> > Mucking around with the ucc_geth driver I found that if I:
> >  - Move TX from IRQ to NAPI context
> >  - double the weight.
> >  - after booting up, wait a few mins until the JFFS2 GC kernel thread 
has 
> > stopped
> >    scanning the FS 
> >
> > Then the "nf_conntrack: table full, dropping packet." msgs stops.
> > Does this seem right to you guys?
> 
> No. As I said, something seems to be leaking packets. You should be
> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
> If that *doesn't* show any signs of a leak, please run "conntrack -E"
> to capture the conntrack events before the "table full" message
> appears and post the output.

skbuff does not differ much, but others do

Before ping:
  skbuff_fclone_cache    0      0    352   11    1 : tunables   54   27 0 
: slabdata      0      0      0
  skbuff_head_cache     20     20    192   20    1 : tunables  120   60 0 
: slabdata      1      1      0
  size-64              731    767     64   59    1 : tunables  120   60 0 
: slabdata     13     13      0
  nf_conntrack          10     19    208   19    1 : tunables  120   60 0 
: slabdata      1      1      0

During ping: 
  skbuff_fclone_cache    0      0    352   11    1 : tunables   54   27 0 
: slabdata      0      0      0
  skbuff_head_cache     40     40    192   20    1 : tunables  120   60 0 
: slabdata      2      2      0
  size-64             8909   8909     64   59    1 : tunables  120   60 0 
: slabdata    151    151      0
  nf_conntrack        5111   5111    208   19    1 : tunables  120   60 0 
: slabdata    269    269      0

This feels more like the freeing of conntrack objects are delayed and 
builds up when ping flooding.

Don't have "conntrack -E" for my embedded board so that will have to wait 
a bit longer.

 Jocke

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-24  8:22           ` Joakim Tjernlund
@ 2009-03-24  9:12             ` Eric Dumazet
  2009-03-24 10:55               ` Joakim Tjernlund
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24  9:12 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Patrick McHardy, avorontsov, netdev, Paul E. McKenney

Joakim Tjernlund a écrit :
> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>> Joakim Tjernlund wrote:
>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>
>>>
>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>> to the board I see:
>>>>>
>>>> That means that something is leaking conntrack references, most 
> likely
>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
> would
>>>> be the ucc_geth driver.
>>>>
>>> Mucking around with the ucc_geth driver I found that if I:
>>>  - Move TX from IRQ to NAPI context
>>>  - double the weight.
>>>  - after booting up, wait a few mins until the JFFS2 GC kernel thread 
> has 
>>> stopped
>>>    scanning the FS 
>>>
>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>> Does this seem right to you guys?
>> No. As I said, something seems to be leaking packets. You should be
>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
>> If that *doesn't* show any signs of a leak, please run "conntrack -E"
>> to capture the conntrack events before the "table full" message
>> appears and post the output.
> 
> skbuff does not differ much, but others do
> 
> Before ping:
>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54   27 0 
> : slabdata      0      0      0
>   skbuff_head_cache     20     20    192   20    1 : tunables  120   60 0 
> : slabdata      1      1      0
>   size-64              731    767     64   59    1 : tunables  120   60 0 
> : slabdata     13     13      0
>   nf_conntrack          10     19    208   19    1 : tunables  120   60 0 
> : slabdata      1      1      0
> 
> During ping: 
>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54   27 0 
> : slabdata      0      0      0
>   skbuff_head_cache     40     40    192   20    1 : tunables  120   60 0 
> : slabdata      2      2      0
>   size-64             8909   8909     64   59    1 : tunables  120   60 0 
> : slabdata    151    151      0
>   nf_conntrack        5111   5111    208   19    1 : tunables  120   60 0 
> : slabdata    269    269      0
> 
> This feels more like the freeing of conntrack objects are delayed and 
> builds up when ping flooding.
> 
> Don't have "conntrack -E" for my embedded board so that will have to wait 
> a bit longer.

I dont understand how your ping can use so many conntrack entries...

Then, as I said yesterday, I believe you have a RCU delay, because of
a misbehaving driver or something...

grep RCU .config
grep CONFIG_SMP .config

You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 80)
as a workaround. It should force a quiescent state after 1000 freed conntracks.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-24  9:12             ` Eric Dumazet
@ 2009-03-24 10:55               ` Joakim Tjernlund
  2009-03-24 12:07                 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-24 10:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney

Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
> 
> Joakim Tjernlund a écrit :
> > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
> >> Joakim Tjernlund wrote:
> >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> >>>
> >>>
> >>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
> >>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
> >>>>> to the board I see:
> >>>>>
> >>>> That means that something is leaking conntrack references, most 
> > likely
> >>>> by leaking skbs. Since I haven't seen any other reports, my guess 
> > would
> >>>> be the ucc_geth driver.
> >>>>
> >>> Mucking around with the ucc_geth driver I found that if I:
> >>>  - Move TX from IRQ to NAPI context
> >>>  - double the weight.
> >>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
thread 
> > has 
> >>> stopped
> >>>    scanning the FS 
> >>>
> >>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
> >>> Does this seem right to you guys?
> >> No. As I said, something seems to be leaking packets. You should be
> >> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
> >> If that *doesn't* show any signs of a leak, please run "conntrack -E"
> >> to capture the conntrack events before the "table full" message
> >> appears and post the output.
> > 
> > skbuff does not differ much, but others do
> > 
> > Before ping:
> >   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
0 
> > : slabdata      0      0      0
> >   skbuff_head_cache     20     20    192   20    1 : tunables  120 60 
0 
> > : slabdata      1      1      0
> >   size-64              731    767     64   59    1 : tunables  120 60 
0 
> > : slabdata     13     13      0
> >   nf_conntrack          10     19    208   19    1 : tunables  120 60 
0 
> > : slabdata      1      1      0
> > 
> > During ping: 
> >   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
0 
> > : slabdata      0      0      0
> >   skbuff_head_cache     40     40    192   20    1 : tunables  120 60 
0 
> > : slabdata      2      2      0
> >   size-64             8909   8909     64   59    1 : tunables  120 60 
0 
> > : slabdata    151    151      0
> >   nf_conntrack        5111   5111    208   19    1 : tunables  120 60 
0 
> > : slabdata    269    269      0
> > 
> > This feels more like the freeing of conntrack objects are delayed and 
> > builds up when ping flooding.
> > 
> > Don't have "conntrack -E" for my embedded board so that will have to 
wait 
> > a bit longer.
> 
> I dont understand how your ping can use so many conntrack entries...
> 
> Then, as I said yesterday, I believe you have a RCU delay, because of
> a misbehaving driver or something...
> 
> grep RCU .config
grep RCU .config
# RCU Subsystem
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set

> grep CONFIG_SMP .config
grep CONFIG_SMP .config
# CONFIG_SMP is not set

> 
> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 
80)
> as a workaround. It should force a quiescent state after 1000 freed 
conntracks.

right, doing this almost killed all conntrack messages, had to stress it 
pretty
hard before I saw handful "nf_conntrack: table full, dropping packet"

RCU is not my cup of tea, do you have any ideas were to look?

 Jocke


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 10:55               ` Joakim Tjernlund
@ 2009-03-24 12:07                 ` Eric Dumazet
  2009-03-24 12:25                   ` Eric Dumazet
                                     ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 12:07 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney

Joakim Tjernlund a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
>> Joakim Tjernlund a écrit :
>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>>>> Joakim Tjernlund wrote:
>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>>>
>>>>>
>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>> to the board I see:
>>>>>>>
>>>>>> That means that something is leaking conntrack references, most 
>>> likely
>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
>>> would
>>>>>> be the ucc_geth driver.
>>>>>>
>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>  - Move TX from IRQ to NAPI context
>>>>>  - double the weight.
>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
> thread 
>>> has 
>>>>> stopped
>>>>>    scanning the FS 
>>>>>
>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>> Does this seem right to you guys?
>>>> No. As I said, something seems to be leaking packets. You should be
>>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
>>>> If that *doesn't* show any signs of a leak, please run "conntrack -E"
>>>> to capture the conntrack events before the "table full" message
>>>> appears and post the output.
>>> skbuff does not differ much, but others do
>>>
>>> Before ping:
>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
> 0 
>>> : slabdata      0      0      0
>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 60 
> 0 
>>> : slabdata      1      1      0
>>>   size-64              731    767     64   59    1 : tunables  120 60 
> 0 
>>> : slabdata     13     13      0
>>>   nf_conntrack          10     19    208   19    1 : tunables  120 60 
> 0 
>>> : slabdata      1      1      0
>>>
>>> During ping: 
>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
> 0 
>>> : slabdata      0      0      0
>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 60 
> 0 
>>> : slabdata      2      2      0
>>>   size-64             8909   8909     64   59    1 : tunables  120 60 
> 0 
>>> : slabdata    151    151      0
>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 60 
> 0 
>>> : slabdata    269    269      0
>>>
>>> This feels more like the freeing of conntrack objects are delayed and 
>>> builds up when ping flooding.
>>>
>>> Don't have "conntrack -E" for my embedded board so that will have to 
> wait 
>>> a bit longer.
>> I dont understand how your ping can use so many conntrack entries...
>>
>> Then, as I said yesterday, I believe you have a RCU delay, because of
>> a misbehaving driver or something...
>>
>> grep RCU .config
> grep RCU .config
> # RCU Subsystem
> CONFIG_CLASSIC_RCU=y
> # CONFIG_TREE_RCU is not set
> # CONFIG_PREEMPT_RCU is not set
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PREEMPT_RCU_TRACE is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> 
>> grep CONFIG_SMP .config
> grep CONFIG_SMP .config
> # CONFIG_SMP is not set
> 
>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 
> 80)
>> as a workaround. It should force a quiescent state after 1000 freed 
> conntracks.
> 
> right, doing this almost killed all conntrack messages, had to stress it 
> pretty
> hard before I saw handful "nf_conntrack: table full, dropping packet"
> 
> RCU is not my cup of tea, do you have any ideas were to look?

In a stress situation, you feed more deleted conntracks to call_rcu() than
the blimit (10 real freeing per RCU softirq invocation). 

So with default qhimark being 10000, this means about 10000 conntracks
can sit in RCU (per CPU) before being really freed.

Only when hitting 10000, RCU enters a special mode to free all queued items, instead
of a small batch of 10

To solve your problem we can :

1) reduce qhimark from 10000 to 1000 (for example)
   Probably should be done to reduce some spikes in RCU code when freeing
   whole 10000 elements...
OR
2) change conntrack tunable (max conntrack entries on your machine)
OR
3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
  in nf_conntrack_free() instead of callback.

[PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f4935e3..6478dc7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 static void nf_conntrack_free_rcu(struct rcu_head *head)
 {
 	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_free(ct);
 	kmem_cache_free(nf_conntrack_cachep, ct);
-	atomic_dec(&net->ct.count);
 }
 
 void nf_conntrack_free(struct nf_conn *ct)
 {
+	struct net *net = nf_ct_net(ct);
+
 	nf_ct_ext_destroy(ct);
+	atomic_dec(&net->ct.count);
 	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 12:07                 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet
@ 2009-03-24 12:25                   ` Eric Dumazet
  2009-03-24 12:43                     ` Patrick McHardy
  2009-03-24 13:20                   ` Joakim Tjernlund
  2009-03-24 15:17                   ` Maxime Bizon
  2 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 12:25 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney

Eric Dumazet a écrit :
> Joakim Tjernlund a écrit :
>> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
>>> Joakim Tjernlund a écrit :
>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>>>>> Joakim Tjernlund wrote:
>>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>>>>
>>>>>>
>>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>>> to the board I see:
>>>>>>>>
>>>>>>> That means that something is leaking conntrack references, most 
>>>> likely
>>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
>>>> would
>>>>>>> be the ucc_geth driver.
>>>>>>>
>>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>>  - Move TX from IRQ to NAPI context
>>>>>>  - double the weight.
>>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
>> thread 
>>>> has 
>>>>>> stopped
>>>>>>    scanning the FS 
>>>>>>
>>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>>> Does this seem right to you guys?
>>>>> No. As I said, something seems to be leaking packets. You should be
>>>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
>>>>> If that *doesn't* show any signs of a leak, please run "conntrack -E"
>>>>> to capture the conntrack events before the "table full" message
>>>>> appears and post the output.
>>>> skbuff does not differ much, but others do
>>>>
>>>> Before ping:
>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
>> 0 
>>>> : slabdata      0      0      0
>>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 60 
>> 0 
>>>> : slabdata      1      1      0
>>>>   size-64              731    767     64   59    1 : tunables  120 60 
>> 0 
>>>> : slabdata     13     13      0
>>>>   nf_conntrack          10     19    208   19    1 : tunables  120 60 
>> 0 
>>>> : slabdata      1      1      0
>>>>
>>>> During ping: 
>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
>> 0 
>>>> : slabdata      0      0      0
>>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 60 
>> 0 
>>>> : slabdata      2      2      0
>>>>   size-64             8909   8909     64   59    1 : tunables  120 60 
>> 0 
>>>> : slabdata    151    151      0
>>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 60 
>> 0 
>>>> : slabdata    269    269      0
>>>>
>>>> This feels more like the freeing of conntrack objects are delayed and 
>>>> builds up when ping flooding.
>>>>
>>>> Don't have "conntrack -E" for my embedded board so that will have to 
>> wait 
>>>> a bit longer.
>>> I dont understand how your ping can use so many conntrack entries...
>>>
>>> Then, as I said yesterday, I believe you have a RCU delay, because of
>>> a misbehaving driver or something...
>>>
>>> grep RCU .config
>> grep RCU .config
>> # RCU Subsystem
>> CONFIG_CLASSIC_RCU=y
>> # CONFIG_TREE_RCU is not set
>> # CONFIG_PREEMPT_RCU is not set
>> # CONFIG_TREE_RCU_TRACE is not set
>> # CONFIG_PREEMPT_RCU_TRACE is not set
>> # CONFIG_RCU_TORTURE_TEST is not set
>> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
>>
>>> grep CONFIG_SMP .config
>> grep CONFIG_SMP .config
>> # CONFIG_SMP is not set
>>
>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 
>> 80)
>>> as a workaround. It should force a quiescent state after 1000 freed 
>> conntracks.
>>
>> right, doing this almost killed all conntrack messages, had to stress it 
>> pretty
>> hard before I saw handful "nf_conntrack: table full, dropping packet"
>>
>> RCU is not my cup of tea, do you have any ideas were to look?
> 
> In a stress situation, you feed more deleted conntracks to call_rcu() than
> the blimit (10 real freeing per RCU softirq invocation). 
> 
> So with default qhimark being 10000, this means about 10000 conntracks
> can sit in RCU (per CPU) before being really freed.
> 
> Only when hitting 10000, RCU enters a special mode to free all queued items, instead
> of a small batch of 10
> 
> To solve your problem we can :
> 
> 1) reduce qhimark from 10000 to 1000 (for example)
>    Probably should be done to reduce some spikes in RCU code when freeing
>    whole 10000 elements...
> OR
> 2) change conntrack tunable (max conntrack entries on your machine)
> OR
> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>   in nf_conntrack_free() instead of callback.
> 
> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> 
> We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
> accumulate about 10.000 elements per CPU in its internal queues. To get accurate
> conntrack counts (at the expense of slightly more RAM used), we might consider
> conntrack counter not taking into account "about to be freed elements, waiting
> in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
> callback.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index f4935e3..6478dc7 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
>  static void nf_conntrack_free_rcu(struct rcu_head *head)
>  {
>  	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
> -	struct net *net = nf_ct_net(ct);
>  
>  	nf_ct_ext_free(ct);
>  	kmem_cache_free(nf_conntrack_cachep, ct);
> -	atomic_dec(&net->ct.count);
>  }
>  
>  void nf_conntrack_free(struct nf_conn *ct)
>  {
> +	struct net *net = nf_ct_net(ct);
> +
>  	nf_ct_ext_destroy(ct);
> +	atomic_dec(&net->ct.count);
>  	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_free);

I forgot to say this is what we do for 'struct file' freeing as well. We
decrement nr_files in file_free(), not in file_free_rcu()

static inline void file_free_rcu(struct rcu_head *head)
{
        struct file *f = container_of(head, struct file, f_u.fu_rcuhead);

        put_cred(f->f_cred);
        kmem_cache_free(filp_cachep, f);
}

static inline void file_free(struct file *f)
{
        percpu_counter_dec(&nr_files);      <<<< HERE >>>>
        file_check_state(f);
        call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 12:25                   ` Eric Dumazet
@ 2009-03-24 12:43                     ` Patrick McHardy
  2009-03-24 13:32                       ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-24 12:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney

Eric Dumazet wrote:
>> In a stress situation, you feed more deleted conntracks to call_rcu() than
>> the blimit (10 real freeing per RCU softirq invocation). 
>>
>> So with default qhimark being 10000, this means about 10000 conntracks
>> can sit in RCU (per CPU) before being really freed.
>>
>> Only when hitting 10000, RCU enters a special mode to free all queued items, instead
>> of a small batch of 10
>>
>> To solve your problem we can :
>>
>> 1) reduce qhimark from 10000 to 1000 (for example)
>>    Probably should be done to reduce some spikes in RCU code when freeing
>>    whole 10000 elements...
>> OR
>> 2) change conntrack tunable (max conntrack entries on your machine)
>> OR
>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>>   in nf_conntrack_free() instead of callback.
>>
>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
>>
>> We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
>> accumulate about 10.000 elements per CPU in its internal queues. To get accurate
>> conntrack counts (at the expense of slightly more RAM used), we might consider
>> conntrack counter not taking into account "about to be freed elements, waiting
>> in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
>> callback.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>>
>> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
>> index f4935e3..6478dc7 100644
>> --- a/net/netfilter/nf_conntrack_core.c
>> +++ b/net/netfilter/nf_conntrack_core.c
>> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
>>  static void nf_conntrack_free_rcu(struct rcu_head *head)
>>  {
>>  	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
>> -	struct net *net = nf_ct_net(ct);
>>  
>>  	nf_ct_ext_free(ct);
>>  	kmem_cache_free(nf_conntrack_cachep, ct);
>> -	atomic_dec(&net->ct.count);
>>  }
>>  
>>  void nf_conntrack_free(struct nf_conn *ct)
>>  {
>> +	struct net *net = nf_ct_net(ct);
>> +
>>  	nf_ct_ext_destroy(ct);
>> +	atomic_dec(&net->ct.count);
>>  	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
>>  }
>>  EXPORT_SYMBOL_GPL(nf_conntrack_free);
> 
> I forgot to say this is what we do for 'struct file' freeing as well. We
> decrement nr_files in file_free(), not in file_free_rcu()


While temporarily exceeding the limit by up to 10000 entries is
quite a lot, I guess the important thing is that it can't grow
unbounded, so I think this patch is fine.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 12:43                     ` Patrick McHardy
@ 2009-03-24 13:32                       ` Eric Dumazet
  2009-03-24 13:38                         ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 13:32 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>>> In a stress situation, you feed more deleted conntracks to call_rcu()
>>> than
>>> the blimit (10 real freeing per RCU softirq invocation).
>>> So with default qhimark being 10000, this means about 10000 conntracks
>>> can sit in RCU (per CPU) before being really freed.
>>>
>>> Only when hitting 10000, RCU enters a special mode to free all queued
>>> items, instead
>>> of a small batch of 10
>>>
>>> To solve your problem we can :
>>>
>>> 1) reduce qhimark from 10000 to 1000 (for example)
>>>    Probably should be done to reduce some spikes in RCU code when
>>> freeing
>>>    whole 10000 elements...
>>> OR
>>> 2) change conntrack tunable (max conntrack entries on your machine)
>>> OR
>>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>>>   in nf_conntrack_free() instead of callback.
>>>
>>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
>>>
>>> We use RCU to defer freeing of conntrack structures. In DOS
>>> situation, RCU might
>>> accumulate about 10.000 elements per CPU in its internal queues. To
>>> get accurate
>>> conntrack counts (at the expense of slightly more RAM used), we might
>>> consider
>>> conntrack counter not taking into account "about to be freed
>>> elements, waiting
>>> in RCU queues". We thus decrement it in nf_conntrack_free(), not in
>>> the RCU
>>> callback.
>>>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>
>>>
>>> diff --git a/net/netfilter/nf_conntrack_core.c
>>> b/net/netfilter/nf_conntrack_core.c
>>> index f4935e3..6478dc7 100644
>>> --- a/net/netfilter/nf_conntrack_core.c
>>> +++ b/net/netfilter/nf_conntrack_core.c
>>> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
>>>  static void nf_conntrack_free_rcu(struct rcu_head *head)
>>>  {
>>>      struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
>>> -    struct net *net = nf_ct_net(ct);
>>>  
>>>      nf_ct_ext_free(ct);
>>>      kmem_cache_free(nf_conntrack_cachep, ct);
>>> -    atomic_dec(&net->ct.count);
>>>  }
>>>  
>>>  void nf_conntrack_free(struct nf_conn *ct)
>>>  {
>>> +    struct net *net = nf_ct_net(ct);
>>> +
>>>      nf_ct_ext_destroy(ct);
>>> +    atomic_dec(&net->ct.count);
>>>      call_rcu(&ct->rcu, nf_conntrack_free_rcu);
>>>  }
>>>  EXPORT_SYMBOL_GPL(nf_conntrack_free);
>>
>> I forgot to say this is what we do for 'struct file' freeing as well. We
>> decrement nr_files in file_free(), not in file_free_rcu()
> 
> 
> While temporarily exceeding the limit by up to 10000 entries is
> quite a lot, I guess the important thing is that it can't grow
> unbounded, so I think this patch is fine.
> 
> 

Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing
problem. That would better use CPU caches as well...


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 13:32                       ` Eric Dumazet
@ 2009-03-24 13:38                         ` Patrick McHardy
  2009-03-24 13:47                           ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-24 13:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>>> I forgot to say this is what we do for 'struct file' freeing as well. We
>>> decrement nr_files in file_free(), not in file_free_rcu()
>>
>> While temporarily exceeding the limit by up to 10000 entries is
>> quite a lot, I guess the important thing is that it can't grow
>> unbounded, so I think this patch is fine.
>>
> 
> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing
> problem. That would better use CPU caches as well...

I'm not sure I understand the rules correctly, but we'd still
have to wait for the grace period before an object can be reused,
no?




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 13:38                         ` Patrick McHardy
@ 2009-03-24 13:47                           ` Eric Dumazet
       [not found]                             ` <49C8F871.9070600@cosmosbay.com>
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 13:47 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Joakim Tjernlund, avorontsov, netdev, Paul E. McKenney

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Patrick McHardy a écrit :
>>>> I forgot to say this is what we do for 'struct file' freeing as
>>>> well. We
>>>> decrement nr_files in file_free(), not in file_free_rcu()
>>>
>>> While temporarily exceeding the limit by up to 10000 entries is
>>> quite a lot, I guess the important thing is that it can't grow
>>> unbounded, so I think this patch is fine.
>>>
>>
>> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu()
>> queueing
>> problem. That would better use CPU caches as well...
> 
> I'm not sure I understand the rules correctly, but we'd still
> have to wait for the grace period before an object can be reused,
> no?

No we dont have to, but we must do additionnal checks after getting
a reference on object found on lookup.
(We must re-check the keys used during search)

This re-check is not very expensive since everything is hot in cpu cache.

Check Documentation/RCU/rculist_nulls.txt for some documentation.

1) Lookup algo
--------------

rcu_read_lock()
begin:
obj = lockless_lookup(key);
if (obj) {
  if (!try_get_ref(obj)) // might fail for free objects
    goto begin;
  /*
   * Because a writer could delete object, and a writer could
   * reuse these object before the RCU grace period, we
   * must check key after geting the reference on object
   */
  if (obj->key != key) { // not the object we expected
     put_ref(obj);
     goto begin;
   }
}
rcu_read_unlock();



^ permalink raw reply	[flat|nested] 44+ messages in thread

[parent not found: <49C8F871.9070600@cosmosbay.com>]

[parent not found: <49C8F8E0.9050502@trash.net>]

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
       [not found]                               ` <49C8F8E0.9050502@trash.net>
@ 2009-03-25  3:53                                 ` Eric Dumazet
  2009-03-25 13:39                                   ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25  3:53 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Developers, Linux Netdev List

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Eric Dumazet a écrit :
>>> Patrick McHardy a écrit :
>>>> Eric Dumazet wrote:
>>
>>>>> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu()
>>>>> queueing
>>>>> problem. That would better use CPU caches as well...
>>>> I'm not sure I understand the rules correctly, but we'd still
>>>> have to wait for the grace period before an object can be reused,
>>>> no?
>>> No we dont have to, but we must do additionnal checks after getting
>>> a reference on object found on lookup.
>>> (We must re-check the keys used during search)
>>>
>>> This re-check is not very expensive since everything is hot in cpu
>>> cache.
>>>
>>> Check Documentation/RCU/rculist_nulls.txt for some documentation.
>>>
>>
>> Patrick, I can work on this if you want, since this stuff is fresh
>> in my head, let me know if you already challenged it or not :)
> 
> I'm still looking at the details, if you want to take care of this,
> that would be great .)
> 
> 

I have a litle problem on __nf_conntrack_find() being exported.

Problem is that with SLAB_DESTROY_BY_RCU we must take a reference on object
to recheck it. So ideally  only nf_conntrack_find_get() should be used,
or callers of __nf_conntrack_find() should lock nf_conntrack_lock
(as properly done for example in net/netfilter/nf_conntrack_netlink.c, line 1292)

Here is preliminary patch for review (not tested at all, its 4h50 am here :) )

Could you help me, by checking __nf_conntrack_find() use in net/netfilter/xt_connlimit.c ?
and line 1246 of net/netfilter/nf_conntrack_netlink.c

This part is a litle bit gray for me. :)

Thank you

(Patch against nf-next-2.6 tree of course)

Eric

 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   16 -
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  121 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 net/netfilter/nf_conntrack_helper.c                   |    7
 net/netfilter/nf_conntrack_netlink.c                  |    7
 net/netfilter/nf_conntrack_standalone.c               |   16 -
 10 files changed, 109 insertions(+), 87 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 4dfb793..6c3f964 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -91,8 +91,7 @@ struct nf_conn_help {
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
-struct nf_conn
-{
+struct nf_conn {
 	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
            plus 1 for any connection(s) we are `master' for */
 	struct nf_conntrack ct_general;
@@ -126,7 +125,6 @@ struct nf_conn
 #ifdef CONFIG_NET_NS
 	struct net *ct_net;
 #endif
-	struct rcu_head rcu;
 };
 
 static inline struct nf_conn *
@@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct)
 extern int nf_ct_l3proto_try_module_get(unsigned short l3proto);
 extern void nf_ct_l3proto_module_put(unsigned short l3proto);
 
-extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced);
-extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced,
-				 unsigned int size);
+/*
+ * Allocate a hashtable of hlist_head (if nulls == 0),
+ * or hlist_nulls_head (if nulls == 1)
+ */
+extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls);
+
+extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size);
 
 extern struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple);
diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h
index f2f6aa7..2628c15 100644
--- a/include/net/netfilter/nf_conntrack_tuple.h
+++ b/include/net/netfilter/nf_conntrack_tuple.h
@@ -12,6 +12,7 @@
 
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_conntrack_tuple_common.h>
+#include <linux/list_nulls.h>
 
 /* A `tuple' is a structure containing the information to uniquely
   identify a connection.  ie. if two packets have the same tuple, they
@@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t)
 	((enum ip_conntrack_dir)(h)->tuple.dst.dir)
 
 /* Connections have two entries in the hash table: one for each way */
-struct nf_conntrack_tuple_hash
-{
-	struct hlist_node hnode;
+struct nf_conntrack_tuple_hash {
+	struct hlist_nulls_node hnnode;
 	struct nf_conntrack_tuple tuple;
 };
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..9dc5840 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -2,6 +2,7 @@
 #define __NETNS_CONNTRACK_H
 
 #include <linux/list.h>
+#include <linux/list_nulls.h>
 #include <asm/atomic.h>
 
 struct ctl_table_header;
@@ -10,9 +11,9 @@ struct nf_conntrack_ecache;
 struct netns_ct {
 	atomic_t		count;
 	unsigned int		expect_count;
-	struct hlist_head	*hash;
+	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
-	struct hlist_head	unconfirmed;
+	struct hlist_nulls_head	unconfirmed;
 	struct ip_conntrack_stat *stat;
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	struct nf_conntrack_ecache *ecache;
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 6ba5c55..fcbcf62 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -25,30 +25,30 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
+	while (is_a_nulls(head)) {
 		if (++st->bucket >= nf_conntrack_htable_size)
 			return NULL;
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
@@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index a65cf69..fe65187 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct,
 static int __net_init nf_nat_net_init(struct net *net)
 {
 	net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size,
-						      &net->ipv4.nat_vmalloced);
+						      &net->ipv4.nat_vmalloced, 0);
 	if (!net->ipv4.nat_bysource)
 		return -ENOMEM;
 	return 0;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 55befe5..9f714e9 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -29,6 +29,7 @@
 #include <linux/netdevice.h>
 #include <linux/socket.h>
 #include <linux/mm.h>
+#include <linux/rculist_nulls.h>
 
 #include <net/netfilter/nf_conntrack.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
@@ -163,8 +164,8 @@ static void
 clean_from_lists(struct nf_conn *ct)
 {
 	pr_debug("clean_from_lists(%p)\n", ct);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 
 	/* Destroy all pending expectations */
 	nf_ct_remove_expectations(ct);
@@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	/* We overload first tuple to link into unconfirmed list. */
 	if (!nf_ct_is_confirmed(ct)) {
-		BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode));
-		hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+		BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
+		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 	}
 
 	NF_CT_STAT_INC(net, delete);
@@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+/*
+ * Warning :
+ * - Caller must take a reference on returned object
+ *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
+ * OR
+ * - Caller must lock nf_conntrack_lock before calling this function
+ */
 struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
 	 */
 	local_bh_disable();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+begin:
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
@@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 		}
 		NF_CT_STAT_INC(net, searched);
 	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 * We probably met an item that was moved to another chain.
+	 */
+	if (get_nulls_value(n) != hash)
+		goto begin;
 	local_bh_enable();
 
 	return NULL;
@@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
 	struct nf_conn *ct;
 
 	rcu_read_lock();
+begin:
 	h = __nf_conntrack_find(net, tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
+		else {
+			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
+				nf_ct_put(ct);
+				goto begin;
+			}
+		}
 	}
 	rcu_read_unlock();
 
@@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 {
 	struct net *net = nf_ct_net(ct);
 
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 			   &net->ct.hash[hash]);
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
 			   &net->ct.hash[repl_hash]);
 }
 
@@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct nf_conn_help *help;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
 	struct net *net;
 
@@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-	hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 				      &h->tuple))
 			goto out;
-	hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				      &h->tuple))
 			goto out;
 
 	/* Remove from unconfirmed list */
-	hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 	/* Timer relative to confirmation time, not original
@@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 {
 	struct net *net = nf_ct_net(ignored_conntrack);
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we need to disable them at
 	 * least once for the stats anyway.
 	 */
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
 		    nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
@@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	/* Use oldest entry, which is roughly LRU */
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct = NULL, *tmp;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int i, cnt = 0;
 	int dropped = 0;
 
 	rcu_read_lock();
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash],
+					 hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 			if (!test_bit(IPS_ASSURED_BIT, &tmp->status))
 				ct = tmp;
@@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
-	INIT_RCU_HEAD(&ct->rcu);
 
 	return ct;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 
-static void nf_conntrack_free_rcu(struct rcu_head *head)
-{
-	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-
-	nf_ct_ext_free(ct);
-	kmem_cache_free(nf_conntrack_cachep, ct);
-}
-
 void nf_conntrack_free(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_destroy(ct);
 	atomic_dec(&net->ct.count);
-	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
+	nf_ct_ext_free(ct);
+	kmem_cache_free(nf_conntrack_cachep, ct);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);
 
@@ -594,7 +609,7 @@ init_conntrack(struct net *net,
 	}
 
 	/* Overload tuple linked list to put us in unconfirmed list. */
-	hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 		       &net->ct.unconfirmed);
 
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < nf_conntrack_htable_size; (*bucket)++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) {
+		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			if (iter(ct, data))
 				goto found;
 		}
 	}
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) {
+	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (iter(ct, data))
 			set_bit(IPS_DYING_BIT, &ct->status);
@@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data)
 	return 1;
 }
 
-void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size)
+void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size)
 {
 	if (vmalloced)
 		vfree(hash);
@@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net)
 	}
 }
 
-struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced)
+void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls)
 {
-	struct hlist_head *hash;
-	unsigned int size, i;
+	struct hlist_nulls_head *hash;
+	unsigned int nr_slots, i;
+	size_t sz;
 
 	*vmalloced = 0;
 
-	size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head));
-	hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN,
-				       get_order(sizeof(struct hlist_head)
-						 * size));
+	BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head));
+	nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head));
+	sz = nr_slots * sizeof(struct hlist_nulls_head);
+	hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
+					get_order(sz));
 	if (!hash) {
 		*vmalloced = 1;
 		printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n");
-		hash = vmalloc(sizeof(struct hlist_head) * size);
+		hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 	}
 
-	if (hash)
-		for (i = 0; i < size; i++)
-			INIT_HLIST_HEAD(&hash[i]);
+	if (hash && nulls)
+		for (i = 0; i < nr_slots; i++)
+			INIT_HLIST_NULLS_HEAD(&hash[i], i);
 
 	return hash;
 }
@@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	int i, bucket, vmalloced, old_vmalloced;
 	unsigned int hashsize, old_size;
 	int rnd;
-	struct hlist_head *hash, *old_hash;
+	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
 
 	/* On boot, we can set this without any fancy locking. */
@@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	if (!hashsize)
 		return -EINVAL;
 
-	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced);
+	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1);
 	if (!hash)
 		return -ENOMEM;
 
@@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	 */
 	spin_lock_bh(&nf_conntrack_lock);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		while (!hlist_empty(&init_net.ct.hash[i])) {
-			h = hlist_entry(init_net.ct.hash[i].first,
-					struct nf_conntrack_tuple_hash, hnode);
-			hlist_del_rcu(&h->hnode);
+		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
+			h = hlist_nulls_entry(init_net.ct.hash[i].first,
+					struct nf_conntrack_tuple_hash, hnnode);
+			hlist_nulls_del_rcu(&h->hnnode);
 			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
-			hlist_add_head(&h->hnode, &hash[bucket]);
+			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
 	old_size = nf_conntrack_htable_size;
@@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void)
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
 						sizeof(struct nf_conn),
-						0, 0, NULL);
+						0, SLAB_DESTROY_BY_RCU, NULL);
 	if (!nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
 		ret = -ENOMEM;
@@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net)
 	if (ret < 0)
 		goto err_ecache;
 	net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size,
-						  &net->ct.hash_vmalloc);
+					     &net->ct.hash_vmalloc, 1);
 	if (!net->ct.hash) {
 		ret = -ENOMEM;
 		printk(KERN_ERR "Unable to create nf_conntrack_hash\n");
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 357ba39..3940f99 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net)
 
 	net->ct.expect_count = 0;
 	net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize,
-						  &net->ct.expect_vmalloc);
+						  &net->ct.expect_vmalloc, 0);
 	if (net->ct.expect_hash == NULL)
 		goto err1;
 
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index a51bdac..6066144 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conntrack_expect *exp;
 	const struct hlist_node *n, *next;
+	const struct hlist_nulls_node *nn;
 	unsigned int i;
 
 	/* Get rid of expectations */
@@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	}
 
 	/* Get rid of expecteds, set helpers to NULL. */
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode)
+	hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode)
 		unhelp(h, me);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[i], hnode)
+		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
 			unhelp(h, me);
 	}
 }
@@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void)
 
 	nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
 	nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize,
-						  &nf_ct_helper_vmalloc);
+						  &nf_ct_helper_vmalloc, 0);
 	if (!nf_ct_helper_hash)
 		return -ENOMEM;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 1b75c9e..6520c94 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -19,6 +19,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/rculist.h>
+#include <linux/rculist_nulls.h>
 #include <linux/types.h>
 #include <linux/timer.h>
 #include <linux/skbuff.h>
@@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct nf_conn *ct, *last;
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
 
@@ -544,8 +545,8 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	last = (struct nf_conn *)cb->args[1];
 	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
-		hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
+					 hnnode) {
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 4da54b0..f768368 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -44,30 +44,30 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
+	while (is_a_nulls(head)) {
 		if (++st->bucket >= nf_conntrack_htable_size)
 			return NULL;
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
@@ -75,9 +75,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-25  3:53                                 ` Eric Dumazet
@ 2009-03-25 13:39                                   ` Patrick McHardy
  2009-03-25 13:44                                     ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 13:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netfilter Developers, Linux Netdev List

Eric Dumazet wrote:
> I have a litle problem on __nf_conntrack_find() being exported.
> 
> Problem is that with SLAB_DESTROY_BY_RCU we must take a reference on object
> to recheck it. So ideally  only nf_conntrack_find_get() should be used,
> or callers of __nf_conntrack_find() should lock nf_conntrack_lock
> (as properly done for example in net/netfilter/nf_conntrack_netlink.c, line 1292)
> 
> Here is preliminary patch for review (not tested at all, its 4h50 am here :) )
> 
> Could you help me, by checking __nf_conntrack_find() use in net/netfilter/xt_connlimit.c ?
> and line 1246 of net/netfilter/nf_conntrack_netlink.c
> 
> This part is a litle bit gray for me. :)

In case of xt_connlimit, it seems fine to just take a reference.
In case of ctnetlink, keeping the unreferenced lookup under the
lock seems fine. We unfortunately have to export some internals
like nf_conntrack lock for ctnetlink anyways, so I don't think
it would be worth to change it to take references and unexport
the lookup function.

> +/*
> + * Warning :
> + * - Caller must take a reference on returned object
> + *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
> + * OR
> + * - Caller must lock nf_conntrack_lock before calling this function
> + */
>  struct nf_conntrack_tuple_hash *
>  __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>  {
>  	struct nf_conntrack_tuple_hash *h;
> -	struct hlist_node *n;
> +	struct hlist_nulls_node *n;
>  	unsigned int hash = hash_conntrack(tuple);
>  
>  	/* Disable BHs the entire time since we normally need to disable them
>  	 * at least once for the stats anyway.
>  	 */
>  	local_bh_disable();
> -	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
> +begin:
> +	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>  		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>  			NF_CT_STAT_INC(net, found);
>  			local_bh_enable();
> @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>  		}
>  		NF_CT_STAT_INC(net, searched);
>  	}
> +	/*
> +	 * if the nulls value we got at the end of this lookup is
> +	 * not the expected one, we must restart lookup.
> +	 * We probably met an item that was moved to another chain.
> +	 */
> +	if (get_nulls_value(n) != hash)
> +		goto begin;

Are you sure this is enough? An entry might have been reused and added
to the same chain I think, so I think we need to recheck the tuple.

>  	local_bh_enable();
>  
>  	return NULL;
> @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
>  	struct nf_conn *ct;
>  
>  	rcu_read_lock();
> +begin:
>  	h = __nf_conntrack_find(net, tuple);
>  	if (h) {
>  		ct = nf_ct_tuplehash_to_ctrack(h);
>  		if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
>  			h = NULL;
> +		else {
> +			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
> +				nf_ct_put(ct);
> +				goto begin;

Ah I see, the hash comparison above is only an optimization?

> +			}
> +		}
>  	}
>  	rcu_read_unlock();
>  

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-25 13:39                                   ` Patrick McHardy
@ 2009-03-25 13:44                                     ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25 13:44 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Developers, Linux Netdev List

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> I have a litle problem on __nf_conntrack_find() being exported.
>>
>> Problem is that with SLAB_DESTROY_BY_RCU we must take a reference on
>> object
>> to recheck it. So ideally  only nf_conntrack_find_get() should be used,
>> or callers of __nf_conntrack_find() should lock nf_conntrack_lock
>> (as properly done for example in net/netfilter/nf_conntrack_netlink.c,
>> line 1292)
>>
>> Here is preliminary patch for review (not tested at all, its 4h50 am
>> here :) )
>>
>> Could you help me, by checking __nf_conntrack_find() use in
>> net/netfilter/xt_connlimit.c ?
>> and line 1246 of net/netfilter/nf_conntrack_netlink.c
>>
>> This part is a litle bit gray for me. :)
> 
> In case of xt_connlimit, it seems fine to just take a reference.
> In case of ctnetlink, keeping the unreferenced lookup under the
> lock seems fine. We unfortunately have to export some internals
> like nf_conntrack lock for ctnetlink anyways, so I don't think
> it would be worth to change it to take references and unexport
> the lookup function.
> 
>> +/*
>> + * Warning :
>> + * - Caller must take a reference on returned object
>> + *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
>> + * OR
>> + * - Caller must lock nf_conntrack_lock before calling this function
>> + */
>>  struct nf_conntrack_tuple_hash *
>>  __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple
>> *tuple)
>>  {
>>      struct nf_conntrack_tuple_hash *h;
>> -    struct hlist_node *n;
>> +    struct hlist_nulls_node *n;
>>      unsigned int hash = hash_conntrack(tuple);
>>  
>>      /* Disable BHs the entire time since we normally need to disable
>> them
>>       * at least once for the stats anyway.
>>       */
>>      local_bh_disable();
>> -    hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
>> +begin:
>> +    hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>>          if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>>              NF_CT_STAT_INC(net, found);
>>              local_bh_enable();
>> @@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct
>> nf_conntrack_tuple *tuple)
>>          }
>>          NF_CT_STAT_INC(net, searched);
>>      }
>> +    /*
>> +     * if the nulls value we got at the end of this lookup is
>> +     * not the expected one, we must restart lookup.
>> +     * We probably met an item that was moved to another chain.
>> +     */
>> +    if (get_nulls_value(n) != hash)
>> +        goto begin;
> 
> Are you sure this is enough? An entry might have been reused and added
> to the same chain I think, so I think we need to recheck the tuple.

Yes, done in caller

> 
>>      local_bh_enable();
>>  
>>      return NULL;
>> @@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const
>> struct nf_conntrack_tuple *tuple)
>>      struct nf_conn *ct;
>>  
>>      rcu_read_lock();
>> +begin:
>>      h = __nf_conntrack_find(net, tuple);
>>      if (h) {
>>          ct = nf_ct_tuplehash_to_ctrack(h);
>>          if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
>>              h = NULL;
>> +        else {
>> +            if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
>> +                nf_ct_put(ct);
>> +                goto begin;
> 
> Ah I see, the hash comparison above is only an optimization?
> 
>> +            }
>> +        }
>>      }
>>      rcu_read_unlock();
>>  
> 
> 

check net/ipv4/udp.c for an example (__udp4_lib_lookup())

In case of UDP, key check is not returning true/false, but a score.
So UDP case is a litle bit more complex than conntrack case.


        rcu_read_lock();
begin:
        result = NULL;
        badness = -1;
        sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                score = compute_score(sk, net, saddr, hnum, sport,
                                      daddr, dport, dif);
                if (score > badness) {
                        result = sk;
                        badness = score;
                }
        }
        /*
         * if the nulls value we got at the end of this lookup is
         * not the expected one, we must restart lookup.
         * We probably met an item that was moved to another chain.
         */
        if (get_nulls_value(node) != hash)
                goto begin;

        if (result) {
                if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                        result = NULL;
                else if (unlikely(compute_score(result, net, saddr, hnum, sport,
                                  daddr, dport, dif) < badness)) {
                        sock_put(result);
                        goto begin;
                }
        }
        rcu_read_unlock();
        return result;



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 12:07                 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet
  2009-03-24 12:25                   ` Eric Dumazet
@ 2009-03-24 13:20                   ` Joakim Tjernlund
  2009-03-24 13:28                     ` Patrick McHardy
  2009-03-24 13:29                     ` Eric Dumazet
  2009-03-24 15:17                   ` Maxime Bizon
  2 siblings, 2 replies; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-24 13:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney

Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16:
> 
> Joakim Tjernlund a écrit :
> > Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
> >> Joakim Tjernlund a écrit :
> >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
> >>>> Joakim Tjernlund wrote:
> >>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> >>>>>
> >>>>>
> >>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
> >>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
> >>>>>>> to the board I see:
> >>>>>>>
> >>>>>> That means that something is leaking conntrack references, most 
> >>> likely
> >>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 

> >>> would
> >>>>>> be the ucc_geth driver.
> >>>>>>
> >>>>> Mucking around with the ucc_geth driver I found that if I:
> >>>>>  - Move TX from IRQ to NAPI context
> >>>>>  - double the weight.
> >>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
> > thread 
> >>> has 
> >>>>> stopped
> >>>>>    scanning the FS 
> >>>>>
> >>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
> >>>>> Does this seem right to you guys?
> >>>> No. As I said, something seems to be leaking packets. You should be
> >>>> able to confirm that by checking the sk_buff slabs in 
/proc/slabinfo.
> >>>> If that *doesn't* show any signs of a leak, please run "conntrack 
-E"
> >>>> to capture the conntrack events before the "table full" message
> >>>> appears and post the output.
> >>> skbuff does not differ much, but others do
> >>>
> >>> Before ping:
> >>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
27 
> > 0 
> >>> : slabdata      0      0      0
> >>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 
60 
> > 0 
> >>> : slabdata      1      1      0
> >>>   size-64              731    767     64   59    1 : tunables  120 
60 
> > 0 
> >>> : slabdata     13     13      0
> >>>   nf_conntrack          10     19    208   19    1 : tunables  120 
60 
> > 0 
> >>> : slabdata      1      1      0
> >>>
> >>> During ping: 
> >>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
27 
> > 0 
> >>> : slabdata      0      0      0
> >>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 
60 
> > 0 
> >>> : slabdata      2      2      0
> >>>   size-64             8909   8909     64   59    1 : tunables  120 
60 
> > 0 
> >>> : slabdata    151    151      0
> >>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 
60 
> > 0 
> >>> : slabdata    269    269      0
> >>>
> >>> This feels more like the freeing of conntrack objects are delayed 
and 
> >>> builds up when ping flooding.
> >>>
> >>> Don't have "conntrack -E" for my embedded board so that will have to 

> > wait 
> >>> a bit longer.
> >> I dont understand how your ping can use so many conntrack entries...
> >>
> >> Then, as I said yesterday, I believe you have a RCU delay, because of
> >> a misbehaving driver or something...
> >>
> >> grep RCU .config
> > grep RCU .config
> > # RCU Subsystem
> > CONFIG_CLASSIC_RCU=y
> > # CONFIG_TREE_RCU is not set
> > # CONFIG_PREEMPT_RCU is not set
> > # CONFIG_TREE_RCU_TRACE is not set
> > # CONFIG_PREEMPT_RCU_TRACE is not set
> > # CONFIG_RCU_TORTURE_TEST is not set
> > # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> > 
> >> grep CONFIG_SMP .config
> > grep CONFIG_SMP .config
> > # CONFIG_SMP is not set
> > 
> >> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c 
(line 
> > 80)
> >> as a workaround. It should force a quiescent state after 1000 freed 
> > conntracks.
> > 
> > right, doing this almost killed all conntrack messages, had to stress 
it 
> > pretty
> > hard before I saw handful "nf_conntrack: table full, dropping packet"
> > 
> > RCU is not my cup of tea, do you have any ideas were to look?
> 
> In a stress situation, you feed more deleted conntracks to call_rcu() 
than
> the blimit (10 real freeing per RCU softirq invocation). 
> 
> So with default qhimark being 10000, this means about 10000 conntracks
> can sit in RCU (per CPU) before being really freed.
> 
> Only when hitting 10000, RCU enters a special mode to free all queued 
items, instead
> of a small batch of 10
> 
> To solve your problem we can :
> 
> 1) reduce qhimark from 10000 to 1000 (for example)
>    Probably should be done to reduce some spikes in RCU code when 
freeing
>    whole 10000 elements...
> OR
> 2) change conntrack tunable (max conntrack entries on your machine)
> OR
> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>   in nf_conntrack_free() instead of callback.
> 
> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

The patch fixes the problem and the system feels a bit more responsive 
too, thanks.
I guess I should probably do both 1) and 3) as my board is pretty slow 
too.

Been trying to figure out a good value for NAPI weigth too. Currently my
HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX 
processing
to NAPI context AND increase weigth to 32, the system is a lot more 
responsive during
ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 
16?

 Jocke


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 13:20                   ` Joakim Tjernlund
@ 2009-03-24 13:28                     ` Patrick McHardy
  2009-03-24 13:29                     ` Eric Dumazet
  1 sibling, 0 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-24 13:28 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Eric Dumazet, avorontsov, netdev, Paul E. McKenney

Joakim Tjernlund wrote:
> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16:
>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> 
> The patch fixes the problem and the system feels a bit more responsive 
> too, thanks.

Applied, thanks everyone.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 13:20                   ` Joakim Tjernlund
  2009-03-24 13:28                     ` Patrick McHardy
@ 2009-03-24 13:29                     ` Eric Dumazet
  2009-03-24 13:41                       ` Joakim Tjernlund
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 13:29 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney

Joakim Tjernlund a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16:
>> Joakim Tjernlund a écrit :
>>> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
>>>> Joakim Tjernlund a écrit :
>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>>>>>> Joakim Tjernlund wrote:
>>>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>>>>>
>>>>>>>
>>>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>>>> to the board I see:
>>>>>>>>>
>>>>>>>> That means that something is leaking conntrack references, most 
>>>>> likely
>>>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
> 
>>>>> would
>>>>>>>> be the ucc_geth driver.
>>>>>>>>
>>>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>>>  - Move TX from IRQ to NAPI context
>>>>>>>  - double the weight.
>>>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
>>> thread 
>>>>> has 
>>>>>>> stopped
>>>>>>>    scanning the FS 
>>>>>>>
>>>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>>>> Does this seem right to you guys?
>>>>>> No. As I said, something seems to be leaking packets. You should be
>>>>>> able to confirm that by checking the sk_buff slabs in 
> /proc/slabinfo.
>>>>>> If that *doesn't* show any signs of a leak, please run "conntrack 
> -E"
>>>>>> to capture the conntrack events before the "table full" message
>>>>>> appears and post the output.
>>>>> skbuff does not differ much, but others do
>>>>>
>>>>> Before ping:
>>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
> 27 
>>> 0 
>>>>> : slabdata      0      0      0
>>>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata      1      1      0
>>>>>   size-64              731    767     64   59    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata     13     13      0
>>>>>   nf_conntrack          10     19    208   19    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata      1      1      0
>>>>>
>>>>> During ping: 
>>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
> 27 
>>> 0 
>>>>> : slabdata      0      0      0
>>>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata      2      2      0
>>>>>   size-64             8909   8909     64   59    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata    151    151      0
>>>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata    269    269      0
>>>>>
>>>>> This feels more like the freeing of conntrack objects are delayed 
> and 
>>>>> builds up when ping flooding.
>>>>>
>>>>> Don't have "conntrack -E" for my embedded board so that will have to 
> 
>>> wait 
>>>>> a bit longer.
>>>> I dont understand how your ping can use so many conntrack entries...
>>>>
>>>> Then, as I said yesterday, I believe you have a RCU delay, because of
>>>> a misbehaving driver or something...
>>>>
>>>> grep RCU .config
>>> grep RCU .config
>>> # RCU Subsystem
>>> CONFIG_CLASSIC_RCU=y
>>> # CONFIG_TREE_RCU is not set
>>> # CONFIG_PREEMPT_RCU is not set
>>> # CONFIG_TREE_RCU_TRACE is not set
>>> # CONFIG_PREEMPT_RCU_TRACE is not set
>>> # CONFIG_RCU_TORTURE_TEST is not set
>>> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
>>>
>>>> grep CONFIG_SMP .config
>>> grep CONFIG_SMP .config
>>> # CONFIG_SMP is not set
>>>
>>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c 
> (line 
>>> 80)
>>>> as a workaround. It should force a quiescent state after 1000 freed 
>>> conntracks.
>>>
>>> right, doing this almost killed all conntrack messages, had to stress 
> it 
>>> pretty
>>> hard before I saw handful "nf_conntrack: table full, dropping packet"
>>>
>>> RCU is not my cup of tea, do you have any ideas were to look?
>> In a stress situation, you feed more deleted conntracks to call_rcu() 
> than
>> the blimit (10 real freeing per RCU softirq invocation). 
>>
>> So with default qhimark being 10000, this means about 10000 conntracks
>> can sit in RCU (per CPU) before being really freed.
>>
>> Only when hitting 10000, RCU enters a special mode to free all queued 
> items, instead
>> of a small batch of 10
>>
>> To solve your problem we can :
>>
>> 1) reduce qhimark from 10000 to 1000 (for example)
>>    Probably should be done to reduce some spikes in RCU code when 
> freeing
>>    whole 10000 elements...
>> OR
>> 2) change conntrack tunable (max conntrack entries on your machine)
>> OR
>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>>   in nf_conntrack_free() instead of callback.
>>
>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> 
> The patch fixes the problem and the system feels a bit more responsive 
> too, thanks.
> I guess I should probably do both 1) and 3) as my board is pretty slow 
> too.
> 
> Been trying to figure out a good value for NAPI weigth too. Currently my
> HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX 
> processing
> to NAPI context AND increase weigth to 32, the system is a lot more 
> responsive during
> ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 
> 16?

If you only have one NIC, I dont understand why changing weight should make
a difference. Are you referring to dev_weight or netdev_budget ?

# cat /proc/sys/net/core/dev_weight
64
# cat /proc/sys/net/core/netdev_budget
300



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 13:29                     ` Eric Dumazet
@ 2009-03-24 13:41                       ` Joakim Tjernlund
  0 siblings, 0 replies; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-24 13:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev, Paul E. McKenney

Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 14:29:29:

[SNIP]

> >>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c 
> > (line 
> >>> 80)
> >>>> as a workaround. It should force a quiescent state after 1000 freed 

> >>> conntracks.
> >>>
> >>> right, doing this almost killed all conntrack messages, had to 
stress 
> > it 
> >>> pretty
> >>> hard before I saw handful "nf_conntrack: table full, dropping 
packet"
> >>>
> >>> RCU is not my cup of tea, do you have any ideas were to look?
> >> In a stress situation, you feed more deleted conntracks to call_rcu() 

> > than
> >> the blimit (10 real freeing per RCU softirq invocation). 
> >>
> >> So with default qhimark being 10000, this means about 10000 
conntracks
> >> can sit in RCU (per CPU) before being really freed.
> >>
> >> Only when hitting 10000, RCU enters a special mode to free all queued 

> > items, instead
> >> of a small batch of 10
> >>
> >> To solve your problem we can :
> >>
> >> 1) reduce qhimark from 10000 to 1000 (for example)
> >>    Probably should be done to reduce some spikes in RCU code when 
> > freeing
> >>    whole 10000 elements...
> >> OR
> >> 2) change conntrack tunable (max conntrack entries on your machine)
> >> OR
> >> 3) change net/netfilter/nf_conntrack_core.c to decrement 
net->ct.count
> >>   in nf_conntrack_free() instead of callback.
> >>
> >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> > 
> > The patch fixes the problem and the system feels a bit more responsive 

> > too, thanks.
> > I guess I should probably do both 1) and 3) as my board is pretty slow 

> > too.
> > 
> > Been trying to figure out a good value for NAPI weigth too. Currently 
my
> > HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move 
TX 
> > processing
> > to NAPI context AND increase weigth to 32, the system is a lot more 
> > responsive during
> > ping flooding. Does weigth 32 make sense when the HW TX and RX queues 
are 
> > 16?
> 
> If you only have one NIC, I dont understand why changing weight should 
make
> a difference. Are you referring to dev_weight or netdev_budget ?
> 
> # cat /proc/sys/net/core/dev_weight
> 64
> # cat /proc/sys/net/core/netdev_budget
> 300

I mean this call in ucc_geth:
  netif_napi_add(dev, &ugeth->napi, ucc_geth_poll, UCC_GETH_DEV_WEIGHT);
UCC_GETH_DEV_WEIGHT is 16

Noticed that rcuclassic.c has a 
  module_param(qhimark, int, 0);
But I can't figure out hot to set this qhimark from the cmdline.
 rcuclassic.c is not a module(I don't use modules at all)

 Jocke
 Jocke



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 12:07                 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet
  2009-03-24 12:25                   ` Eric Dumazet
  2009-03-24 13:20                   ` Joakim Tjernlund
@ 2009-03-24 15:17                   ` Maxime Bizon
  2009-03-24 15:21                     ` Patrick McHardy
                                       ` (2 more replies)
  2 siblings, 3 replies; 44+ messages in thread
From: Maxime Bizon @ 2009-03-24 15:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Joakim Tjernlund, avorontsov, Patrick McHardy, netdev,
	Paul E. McKenney


On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:

Hi Eric,

> We use RCU to defer freeing of conntrack structures. In DOS situation,
> RCU might accumulate about 10.000 elements per CPU in its internal
> queues. To get accurate conntrack counts (at the expense of slightly
> more RAM used), we might consider conntrack counter not taking into
> account "about to be freed elements, waiting in RCU queues". We thus
> decrement it in nf_conntrack_free(), not in the RCU callback.

Your patch fixes the problem on my board too (embedded mips router
250Mhz), thanks.

Yet I'm concerned about what you said concerning RAM usage. I have a
very small amount on memory left on my board (less than 4M), and I tuned
ip route cache size and nf_conntrack_max to make sure I won't go OOM.

With your patch, does it mean 10000 conntrack entries can be allocated
while nf_conntrack_max is say only 2048 ?

Regards,

-- 
Maxime



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 15:17                   ` Maxime Bizon
@ 2009-03-24 15:21                     ` Patrick McHardy
  2009-03-24 15:27                     ` Eric Dumazet
  2009-03-24 18:29                     ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Joakim Tjernlund
  2 siblings, 0 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-24 15:21 UTC (permalink / raw)
  To: mbizon; +Cc: Eric Dumazet, Joakim Tjernlund, avorontsov, netdev,
	Paul E. McKenney

Maxime Bizon wrote:
> On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:
> 
>> We use RCU to defer freeing of conntrack structures. In DOS situation,
>> RCU might accumulate about 10.000 elements per CPU in its internal
>> queues. To get accurate conntrack counts (at the expense of slightly
>> more RAM used), we might consider conntrack counter not taking into
>> account "about to be freed elements, waiting in RCU queues". We thus
>> decrement it in nf_conntrack_free(), not in the RCU callback.
> 
> Your patch fixes the problem on my board too (embedded mips router
> 250Mhz), thanks.
> 
> Yet I'm concerned about what you said concerning RAM usage. I have a
> very small amount on memory left on my board (less than 4M), and I tuned
> ip route cache size and nf_conntrack_max to make sure I won't go OOM.
> 
> With your patch, does it mean 10000 conntrack entries can be allocated
> while nf_conntrack_max is say only 2048 ?

Temporarily under worst-case circumstances, yes. Eric is already working
on his proposed improvement though :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 15:17                   ` Maxime Bizon
  2009-03-24 15:21                     ` Patrick McHardy
@ 2009-03-24 15:27                     ` Eric Dumazet
  2009-03-24 19:54                       ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet
  2009-03-25 17:53                       ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet
  2009-03-24 18:29                     ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Joakim Tjernlund
  2 siblings, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 15:27 UTC (permalink / raw)
  To: mbizon, Paul E. McKenney
  Cc: Joakim Tjernlund, avorontsov, Patrick McHardy, netdev

Maxime Bizon a écrit :
> On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:
> 
> Hi Eric,
> 
>> We use RCU to defer freeing of conntrack structures. In DOS situation,
>> RCU might accumulate about 10.000 elements per CPU in its internal
>> queues. To get accurate conntrack counts (at the expense of slightly
>> more RAM used), we might consider conntrack counter not taking into
>> account "about to be freed elements, waiting in RCU queues". We thus
>> decrement it in nf_conntrack_free(), not in the RCU callback.
> 
> Your patch fixes the problem on my board too (embedded mips router
> 250Mhz), thanks.
> 
> Yet I'm concerned about what you said concerning RAM usage. I have a
> very small amount on memory left on my board (less than 4M), and I tuned
> ip route cache size and nf_conntrack_max to make sure I won't go OOM.
> 
> With your patch, does it mean 10000 conntrack entries can be allocated
> while nf_conntrack_max is say only 2048 ?

Well... yes, RCU can have this 'interesting' OOM property.

For small machines, you really want to lower RCU parameters, because
as you said, we also push route cache entries in RCU queue, my patch
being applied or not (But using call_rcu_bh(), so we have lower latencies
I think)

We are working on a SLAB_DESTROY_BY_RCU implementation so that
conntrack wont use call_rcu() anymore, give us a couple of days :)

Paul, could we have /sys knobs to be able to tune qhimark, blimit & qlowmark ?

Thanks


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize()
  2009-03-24 15:27                     ` Eric Dumazet
@ 2009-03-24 19:54                       ` Eric Dumazet
  2009-03-25 16:26                         ` Patrick McHardy
  2009-03-25 17:53                       ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-24 19:54 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev

Eric Dumazet a écrit :
> 
> We are working on a SLAB_DESTROY_BY_RCU implementation so that
> conntrack wont use call_rcu() anymore, give us a couple of days :)
> 

While working on this stuff, I found one suspect use of hlist_add_head()

Its not a hot path, I believe following patch would make sure nothing
wrong happens.

If a chain contains element A and B, then we might build a new table
with a new chain containing B and A (in this reverse order), and
a cpu could see A->next = B (new pointer),  B->next = A (old pointer)

Thanks

[PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize()

Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous.
Without any barrier, one CPU could see a loop while doing its lookup.
Its true new table cannot be seen by another cpu, but previous table is still
readable.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 55befe5..54e983f 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1121,7 +1121,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 					struct nf_conntrack_tuple_hash, hnode);
 			hlist_del_rcu(&h->hnode);
 			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
-			hlist_add_head(&h->hnode, &hash[bucket]);
+			hlist_add_head_rcu(&h->hnode, &hash[bucket]);
 		}
 	}
 	old_size = nf_conntrack_htable_size;



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize()
  2009-03-24 19:54                       ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet
@ 2009-03-25 16:26                         ` Patrick McHardy
  0 siblings, 0 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 16:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev

Eric Dumazet wrote:
> While working on this stuff, I found one suspect use of hlist_add_head()
> 
> Its not a hot path, I believe following patch would make sure nothing
> wrong happens.
> 
> If a chain contains element A and B, then we might build a new table
> with a new chain containing B and A (in this reverse order), and
> a cpu could see A->next = B (new pointer),  B->next = A (old pointer)

I think you're right.

> [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize()
> 
> Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous.
> Without any barrier, one CPU could see a loop while doing its lookup.
> Its true new table cannot be seen by another cpu, but previous table is still
> readable.

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-24 15:27                     ` Eric Dumazet
  2009-03-24 19:54                       ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet
@ 2009-03-25 17:53                       ` Eric Dumazet
  2009-03-25 18:05                         ` Patrick McHardy
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25 17:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Hi Patrick

Here is the patch I had the time to test this time...
No problem so far on my machine.
I did a UDP flood stress.

Thank you

[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   16 -
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  123 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 net/netfilter/nf_conntrack_helper.c                   |    7
 net/netfilter/nf_conntrack_netlink.c                  |   10
 net/netfilter/nf_conntrack_standalone.c               |   16 -
 net/netfilter/xt_connlimit.c                          |    4
 11 files changed, 114 insertions(+), 91 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 4dfb793..6c3f964 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -91,8 +91,7 @@ struct nf_conn_help {
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
-struct nf_conn
-{
+struct nf_conn {
 	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
            plus 1 for any connection(s) we are `master' for */
 	struct nf_conntrack ct_general;
@@ -126,7 +125,6 @@ struct nf_conn
 #ifdef CONFIG_NET_NS
 	struct net *ct_net;
 #endif
-	struct rcu_head rcu;
 };
 
 static inline struct nf_conn *
@@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct)
 extern int nf_ct_l3proto_try_module_get(unsigned short l3proto);
 extern void nf_ct_l3proto_module_put(unsigned short l3proto);
 
-extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced);
-extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced,
-				 unsigned int size);
+/*
+ * Allocate a hashtable of hlist_head (if nulls == 0),
+ * or hlist_nulls_head (if nulls == 1)
+ */
+extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls);
+
+extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size);
 
 extern struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple);
diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h
index f2f6aa7..2628c15 100644
--- a/include/net/netfilter/nf_conntrack_tuple.h
+++ b/include/net/netfilter/nf_conntrack_tuple.h
@@ -12,6 +12,7 @@
 
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_conntrack_tuple_common.h>
+#include <linux/list_nulls.h>
 
 /* A `tuple' is a structure containing the information to uniquely
   identify a connection.  ie. if two packets have the same tuple, they
@@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t)
 	((enum ip_conntrack_dir)(h)->tuple.dst.dir)
 
 /* Connections have two entries in the hash table: one for each way */
-struct nf_conntrack_tuple_hash
-{
-	struct hlist_node hnode;
+struct nf_conntrack_tuple_hash {
+	struct hlist_nulls_node hnnode;
 	struct nf_conntrack_tuple tuple;
 };
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..9dc5840 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -2,6 +2,7 @@
 #define __NETNS_CONNTRACK_H
 
 #include <linux/list.h>
+#include <linux/list_nulls.h>
 #include <asm/atomic.h>
 
 struct ctl_table_header;
@@ -10,9 +11,9 @@ struct nf_conntrack_ecache;
 struct netns_ct {
 	atomic_t		count;
 	unsigned int		expect_count;
-	struct hlist_head	*hash;
+	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
-	struct hlist_head	unconfirmed;
+	struct hlist_nulls_head	unconfirmed;
 	struct ip_conntrack_stat *stat;
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	struct nf_conntrack_ecache *ecache;
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 6ba5c55..fcbcf62 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -25,30 +25,30 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
+	while (is_a_nulls(head)) {
 		if (++st->bucket >= nf_conntrack_htable_size)
 			return NULL;
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
@@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index a65cf69..fe65187 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct,
 static int __net_init nf_nat_net_init(struct net *net)
 {
 	net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size,
-						      &net->ipv4.nat_vmalloced);
+						      &net->ipv4.nat_vmalloced, 0);
 	if (!net->ipv4.nat_bysource)
 		return -ENOMEM;
 	return 0;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 54e983f..9ed7a6b 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -29,6 +29,7 @@
 #include <linux/netdevice.h>
 #include <linux/socket.h>
 #include <linux/mm.h>
+#include <linux/rculist_nulls.h>
 
 #include <net/netfilter/nf_conntrack.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
@@ -163,8 +164,8 @@ static void
 clean_from_lists(struct nf_conn *ct)
 {
 	pr_debug("clean_from_lists(%p)\n", ct);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 
 	/* Destroy all pending expectations */
 	nf_ct_remove_expectations(ct);
@@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	/* We overload first tuple to link into unconfirmed list. */
 	if (!nf_ct_is_confirmed(ct)) {
-		BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode));
-		hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+		BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
+		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 	}
 
 	NF_CT_STAT_INC(net, delete);
@@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+/*
+ * Warning :
+ * - Caller must take a reference on returned object
+ *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
+ * OR
+ * - Caller must lock nf_conntrack_lock before calling this function
+ */
 struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
 	 */
 	local_bh_disable();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+begin:
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
@@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 		}
 		NF_CT_STAT_INC(net, searched);
 	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 * We probably met an item that was moved to another chain.
+	 */
+	if (get_nulls_value(n) != hash)
+		goto begin;
 	local_bh_enable();
 
 	return NULL;
@@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
 	struct nf_conn *ct;
 
 	rcu_read_lock();
+begin:
 	h = __nf_conntrack_find(net, tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
+		else {
+			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
+				nf_ct_put(ct);
+				goto begin;
+			}
+		}
 	}
 	rcu_read_unlock();
 
@@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 {
 	struct net *net = nf_ct_net(ct);
 
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 			   &net->ct.hash[hash]);
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
 			   &net->ct.hash[repl_hash]);
 }
 
@@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct nf_conn_help *help;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
 	struct net *net;
 
@@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-	hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 				      &h->tuple))
 			goto out;
-	hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				      &h->tuple))
 			goto out;
 
 	/* Remove from unconfirmed list */
-	hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 	/* Timer relative to confirmation time, not original
@@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 {
 	struct net *net = nf_ct_net(ignored_conntrack);
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we need to disable them at
 	 * least once for the stats anyway.
 	 */
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
 		    nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
@@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	/* Use oldest entry, which is roughly LRU */
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct = NULL, *tmp;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int i, cnt = 0;
 	int dropped = 0;
 
 	rcu_read_lock();
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash],
+					 hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 			if (!test_bit(IPS_ASSURED_BIT, &tmp->status))
 				ct = tmp;
@@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
-	INIT_RCU_HEAD(&ct->rcu);
 
 	return ct;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 
-static void nf_conntrack_free_rcu(struct rcu_head *head)
-{
-	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-
-	nf_ct_ext_free(ct);
-	kmem_cache_free(nf_conntrack_cachep, ct);
-}
-
 void nf_conntrack_free(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_destroy(ct);
 	atomic_dec(&net->ct.count);
-	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
+	nf_ct_ext_free(ct);
+	kmem_cache_free(nf_conntrack_cachep, ct);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);
 
@@ -594,7 +609,7 @@ init_conntrack(struct net *net,
 	}
 
 	/* Overload tuple linked list to put us in unconfirmed list. */
-	hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 		       &net->ct.unconfirmed);
 
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < nf_conntrack_htable_size; (*bucket)++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) {
+		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			if (iter(ct, data))
 				goto found;
 		}
 	}
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) {
+	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (iter(ct, data))
 			set_bit(IPS_DYING_BIT, &ct->status);
@@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data)
 	return 1;
 }
 
-void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size)
+void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size)
 {
 	if (vmalloced)
 		vfree(hash);
@@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net)
 	}
 }
 
-struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced)
+void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls)
 {
-	struct hlist_head *hash;
-	unsigned int size, i;
+	struct hlist_nulls_head *hash;
+	unsigned int nr_slots, i;
+	size_t sz;
 
 	*vmalloced = 0;
 
-	size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head));
-	hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN,
-				       get_order(sizeof(struct hlist_head)
-						 * size));
+	BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head));
+	nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head));
+	sz = nr_slots * sizeof(struct hlist_nulls_head);
+	hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
+					get_order(sz));
 	if (!hash) {
 		*vmalloced = 1;
 		printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n");
-		hash = vmalloc(sizeof(struct hlist_head) * size);
+		hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 	}
 
-	if (hash)
-		for (i = 0; i < size; i++)
-			INIT_HLIST_HEAD(&hash[i]);
+	if (hash && nulls)
+		for (i = 0; i < nr_slots; i++)
+			INIT_HLIST_NULLS_HEAD(&hash[i], i);
 
 	return hash;
 }
@@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	int i, bucket, vmalloced, old_vmalloced;
 	unsigned int hashsize, old_size;
 	int rnd;
-	struct hlist_head *hash, *old_hash;
+	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
 
 	/* On boot, we can set this without any fancy locking. */
@@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	if (!hashsize)
 		return -EINVAL;
 
-	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced);
+	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1);
 	if (!hash)
 		return -ENOMEM;
 
@@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	 */
 	spin_lock_bh(&nf_conntrack_lock);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		while (!hlist_empty(&init_net.ct.hash[i])) {
-			h = hlist_entry(init_net.ct.hash[i].first,
-					struct nf_conntrack_tuple_hash, hnode);
-			hlist_del_rcu(&h->hnode);
+		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
+			h = hlist_nulls_entry(init_net.ct.hash[i].first,
+					struct nf_conntrack_tuple_hash, hnnode);
+			hlist_nulls_del_rcu(&h->hnode);
 			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
-			hlist_add_head_rcu(&h->hnode, &hash[bucket]);
+			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
 	old_size = nf_conntrack_htable_size;
@@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void)
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
 						sizeof(struct nf_conn),
-						0, 0, NULL);
+						0, SLAB_DESTROY_BY_RCU, NULL);
 	if (!nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
 		ret = -ENOMEM;
@@ -1202,7 +1219,7 @@ static int nf_conntrack_init_net(struct net *net)
 	int ret;
 
 	atomic_set(&net->ct.count, 0);
-	INIT_HLIST_HEAD(&net->ct.unconfirmed);
+	INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0);
 	net->ct.stat = alloc_percpu(struct ip_conntrack_stat);
 	if (!net->ct.stat) {
 		ret = -ENOMEM;
@@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net)
 	if (ret < 0)
 		goto err_ecache;
 	net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size,
-						  &net->ct.hash_vmalloc);
+					     &net->ct.hash_vmalloc, 1);
 	if (!net->ct.hash) {
 		ret = -ENOMEM;
 		printk(KERN_ERR "Unable to create nf_conntrack_hash\n");
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 357ba39..3940f99 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net)
 
 	net->ct.expect_count = 0;
 	net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize,
-						  &net->ct.expect_vmalloc);
+						  &net->ct.expect_vmalloc, 0);
 	if (net->ct.expect_hash == NULL)
 		goto err1;
 
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index a51bdac..6066144 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conntrack_expect *exp;
 	const struct hlist_node *n, *next;
+	const struct hlist_nulls_node *nn;
 	unsigned int i;
 
 	/* Get rid of expectations */
@@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	}
 
 	/* Get rid of expecteds, set helpers to NULL. */
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode)
+	hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode)
 		unhelp(h, me);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[i], hnode)
+		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
 			unhelp(h, me);
 	}
 }
@@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void)
 
 	nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
 	nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize,
-						  &nf_ct_helper_vmalloc);
+						  &nf_ct_helper_vmalloc, 0);
 	if (!nf_ct_helper_hash)
 		return -ENOMEM;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 1b75c9e..e73272b 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -19,6 +19,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/rculist.h>
+#include <linux/rculist_nulls.h>
 #include <linux/types.h>
 #include <linux/timer.h>
 #include <linux/skbuff.h>
@@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct nf_conn *ct, *last;
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
 
@@ -544,8 +545,8 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	last = (struct nf_conn *)cb->args[1];
 	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
-		hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
+					 hnnode) {
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
@@ -1242,13 +1243,12 @@ ctnetlink_create_conntrack(struct nlattr *cda[],
 		if (err < 0)
 			goto err2;
 
-		master_h = __nf_conntrack_find(&init_net, &master);
+		master_h = nf_conntrack_find_get(&init_net, &master);
 		if (master_h == NULL) {
 			err = -ENOENT;
 			goto err2;
 		}
 		master_ct = nf_ct_tuplehash_to_ctrack(master_h);
-		nf_conntrack_get(&master_ct->ct_general);
 		__set_bit(IPS_EXPECTED_BIT, &ct->status);
 		ct->master = master_ct;
 	}
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 4da54b0..f768368 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -44,30 +44,30 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
+	while (is_a_nulls(head)) {
 		if (++st->bucket >= nf_conntrack_htable_size)
 			return NULL;
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
@@ -75,9 +75,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 7f404cc..3bf47d8 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -123,7 +123,7 @@ static int count_them(struct xt_connlimit_data *data,
 
 	/* check the saved connections */
 	list_for_each_entry_safe(conn, tmp, hash, list) {
-		found    = __nf_conntrack_find(&init_net, &conn->tuple);
+		found    = nf_conntrack_find_get(&init_net, &conn->tuple);
 		found_ct = NULL;
 
 		if (found != NULL)
@@ -151,6 +151,7 @@ static int count_them(struct xt_connlimit_data *data,
 			 * we do not care about connections which are
 			 * closed already -> ditch it
 			 */
+			nf_ct_put(found_ct);
 			list_del(&conn->list);
 			kfree(conn);
 			continue;
@@ -160,6 +161,7 @@ static int count_them(struct xt_connlimit_data *data,
 		    match->family))
 			/* same source network -> be counted! */
 			++matches;
+		nf_ct_put(found_ct);
 	}
 
 	rcu_read_unlock();


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 17:53                       ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet
@ 2009-03-25 18:05                         ` Patrick McHardy
  2009-03-25 18:06                           ` Patrick McHardy
  2009-03-25 18:15                           ` Eric Dumazet
  0 siblings, 2 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Eric Dumazet wrote:
> Hi Patrick
> 
> Here is the patch I had the time to test this time...
> No problem so far on my machine.
> I did a UDP flood stress.

Thanks Eric. Most parts looks good, just two questions below:

> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
> index 6ba5c55..fcbcf62 100644
> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
> @@ -25,30 +25,30 @@ struct ct_iter_state {
>  	unsigned int bucket;
>  };
>  
> -static struct hlist_node *ct_get_first(struct seq_file *seq)
> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
>  {
>  	struct net *net = seq_file_net(seq);
>  	struct ct_iter_state *st = seq->private;
> -	struct hlist_node *n;
> +	struct hlist_nulls_node *n;
>  
>  	for (st->bucket = 0;
>  	     st->bucket < nf_conntrack_htable_size;
>  	     st->bucket++) {
>  		n = rcu_dereference(net->ct.hash[st->bucket].first);
> -		if (n)
> +		if (!is_a_nulls(n))
>  			return n;
>  	}
>  	return NULL;
>  }

Don't we need to make sure the entry is not reused while dumping
it?

> diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
> index a51bdac..6066144 100644
> --- a/net/netfilter/nf_conntrack_helper.c
> +++ b/net/netfilter/nf_conntrack_helper.c
> @@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
>  	struct nf_conntrack_tuple_hash *h;
>  	struct nf_conntrack_expect *exp;
>  	const struct hlist_node *n, *next;
> +	const struct hlist_nulls_node *nn;
>  	unsigned int i;
>  
>  	/* Get rid of expectations */
> @@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
>  	}
>  
>  	/* Get rid of expecteds, set helpers to NULL. */
> -	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode)
> +	hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode)
>  		unhelp(h, me);
>  	for (i = 0; i < nf_conntrack_htable_size; i++) {
> -		hlist_for_each_entry(h, n, &net->ct.hash[i], hnode)
> +		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
>  			unhelp(h, me);
>  	}
>  }
> @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void)
>  
>  	nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
>  	nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize,
> -						  &nf_ct_helper_vmalloc);
> +						  &nf_ct_helper_vmalloc, 0);

This should be "1" I think since it wants a hlist_nulls hash.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 18:05                         ` Patrick McHardy
@ 2009-03-25 18:06                           ` Patrick McHardy
  2009-03-25 18:15                           ` Eric Dumazet
  1 sibling, 0 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 18:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Patrick McHardy wrote:
>> @@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void)
>>  
>>      nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
>>      nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize,
>> -                          &nf_ct_helper_vmalloc);
>> +                          &nf_ct_helper_vmalloc, 0);
> 
> This should be "1" I think since it wants a hlist_nulls hash.

OK I just realized my mistake, please ignore :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 18:05                         ` Patrick McHardy
  2009-03-25 18:06                           ` Patrick McHardy
@ 2009-03-25 18:15                           ` Eric Dumazet
  2009-03-25 18:24                             ` Patrick McHardy
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25 18:15 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Hi Patrick
>>
>> Here is the patch I had the time to test this time...
>> No problem so far on my machine.
>> I did a UDP flood stress.
> 
> Thanks Eric. Most parts looks good, just two questions below:
> 
>> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> index 6ba5c55..fcbcf62 100644
>> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> @@ -25,30 +25,30 @@ struct ct_iter_state {
>>      unsigned int bucket;
>>  };
>>  
>> -static struct hlist_node *ct_get_first(struct seq_file *seq)
>> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
>>  {
>>      struct net *net = seq_file_net(seq);
>>      struct ct_iter_state *st = seq->private;
>> -    struct hlist_node *n;
>> +    struct hlist_nulls_node *n;
>>  
>>      for (st->bucket = 0;
>>           st->bucket < nf_conntrack_htable_size;
>>           st->bucket++) {
>>          n = rcu_dereference(net->ct.hash[st->bucket].first);
>> -        if (n)
>> +        if (!is_a_nulls(n))
>>              return n;
>>      }
>>      return NULL;
>>  }
> 
> Don't we need to make sure the entry is not reused while dumping
> it?
> 

Ah yes, I forgot that for UDP/TCP I had to change locking on this part.
Because messing with reference count was crazy...
But in UDP/TCP we have different spinlock for each chain, so hold time
was small enough.

So I guess that with central conntrack lock, we need to take references on entries
while dumping them.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 18:15                           ` Eric Dumazet
@ 2009-03-25 18:24                             ` Patrick McHardy
  2009-03-25 18:53                               ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 18:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>>> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
>>>  {
>>>      struct net *net = seq_file_net(seq);
>>>      struct ct_iter_state *st = seq->private;
>>> -    struct hlist_node *n;
>>> +    struct hlist_nulls_node *n;
>>>  
>>>      for (st->bucket = 0;
>>>           st->bucket < nf_conntrack_htable_size;
>>>           st->bucket++) {
>>>          n = rcu_dereference(net->ct.hash[st->bucket].first);
>>> -        if (n)
>>> +        if (!is_a_nulls(n))
>>>              return n;
>>>      }
>>>      return NULL;
>>>  }
>> Don't we need to make sure the entry is not reused while dumping
>> it?
>>
> 
> Ah yes, I forgot that for UDP/TCP I had to change locking on this part.
> Because messing with reference count was crazy...
> But in UDP/TCP we have different spinlock for each chain, so hold time
> was small enough.
> 
> So I guess that with central conntrack lock, we need to take references on entries
> while dumping them.

Yes, I think so too.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 18:24                             ` Patrick McHardy
@ 2009-03-25 18:53                               ` Eric Dumazet
  2009-03-25 19:00                                 ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25 18:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Patrick McHardy a écrit :
>>>> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
>>>>  {
>>>>      struct net *net = seq_file_net(seq);
>>>>      struct ct_iter_state *st = seq->private;
>>>> -    struct hlist_node *n;
>>>> +    struct hlist_nulls_node *n;
>>>>  
>>>>      for (st->bucket = 0;
>>>>           st->bucket < nf_conntrack_htable_size;
>>>>           st->bucket++) {
>>>>          n = rcu_dereference(net->ct.hash[st->bucket].first);
>>>> -        if (n)
>>>> +        if (!is_a_nulls(n))
>>>>              return n;
>>>>      }
>>>>      return NULL;
>>>>  }
>>> Don't we need to make sure the entry is not reused while dumping
>>> it?
>>>
>>
>> Ah yes, I forgot that for UDP/TCP I had to change locking on this part.
>> Because messing with reference count was crazy...
>> But in UDP/TCP we have different spinlock for each chain, so hold time
>> was small enough.
>>
>> So I guess that with central conntrack lock, we need to take
>> references on entries
>> while dumping them.
> 
> Yes, I think so too.

Here is take 2 of the patch with proper ref counting on dumping.

Thank you

[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   57 ++--
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  123 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 net/netfilter/nf_conntrack_helper.c                   |    7
 net/netfilter/nf_conntrack_netlink.c                  |   20 -
 net/netfilter/nf_conntrack_standalone.c               |   51 ++--
 net/netfilter/xt_connlimit.c                          |    6
 11 files changed, 166 insertions(+), 127 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 4dfb793..6c3f964 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -91,8 +91,7 @@ struct nf_conn_help {
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
-struct nf_conn
-{
+struct nf_conn {
 	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
            plus 1 for any connection(s) we are `master' for */
 	struct nf_conntrack ct_general;
@@ -126,7 +125,6 @@ struct nf_conn
 #ifdef CONFIG_NET_NS
 	struct net *ct_net;
 #endif
-	struct rcu_head rcu;
 };
 
 static inline struct nf_conn *
@@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct)
 extern int nf_ct_l3proto_try_module_get(unsigned short l3proto);
 extern void nf_ct_l3proto_module_put(unsigned short l3proto);
 
-extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced);
-extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced,
-				 unsigned int size);
+/*
+ * Allocate a hashtable of hlist_head (if nulls == 0),
+ * or hlist_nulls_head (if nulls == 1)
+ */
+extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls);
+
+extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size);
 
 extern struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple);
diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h
index f2f6aa7..2628c15 100644
--- a/include/net/netfilter/nf_conntrack_tuple.h
+++ b/include/net/netfilter/nf_conntrack_tuple.h
@@ -12,6 +12,7 @@
 
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_conntrack_tuple_common.h>
+#include <linux/list_nulls.h>
 
 /* A `tuple' is a structure containing the information to uniquely
   identify a connection.  ie. if two packets have the same tuple, they
@@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t)
 	((enum ip_conntrack_dir)(h)->tuple.dst.dir)
 
 /* Connections have two entries in the hash table: one for each way */
-struct nf_conntrack_tuple_hash
-{
-	struct hlist_node hnode;
+struct nf_conntrack_tuple_hash {
+	struct hlist_nulls_node hnnode;
 	struct nf_conntrack_tuple tuple;
 };
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..9dc5840 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -2,6 +2,7 @@
 #define __NETNS_CONNTRACK_H
 
 #include <linux/list.h>
+#include <linux/list_nulls.h>
 #include <asm/atomic.h>
 
 struct ctl_table_header;
@@ -10,9 +11,9 @@ struct nf_conntrack_ecache;
 struct netns_ct {
 	atomic_t		count;
 	unsigned int		expect_count;
-	struct hlist_head	*hash;
+	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
-	struct hlist_head	unconfirmed;
+	struct hlist_nulls_head	unconfirmed;
 	struct ip_conntrack_stat *stat;
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	struct nf_conntrack_ecache *ecache;
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 6ba5c55..0b870b9 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -25,30 +25,30 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
+	while (is_a_nulls(head)) {
 		if (++st->bucket >= nf_conntrack_htable_size)
 			return NULL;
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
@@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
@@ -87,69 +87,76 @@ static void ct_seq_stop(struct seq_file *s, void *v)
 
 static int ct_seq_show(struct seq_file *s, void *v)
 {
-	const struct nf_conntrack_tuple_hash *hash = v;
-	const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
+	struct nf_conntrack_tuple_hash *hash = v;
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
 	const struct nf_conntrack_l3proto *l3proto;
 	const struct nf_conntrack_l4proto *l4proto;
+	int ret = 0;
 
 	NF_CT_ASSERT(ct);
+	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
+		return 0;
+
 
 	/* we only want to print DIR_ORIGINAL */
 	if (NF_CT_DIRECTION(hash))
-		return 0;
+		goto release;
 	if (nf_ct_l3num(ct) != AF_INET)
-		return 0;
+		goto release;
 
 	l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct));
 	NF_CT_ASSERT(l3proto);
 	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	NF_CT_ASSERT(l4proto);
 
+	ret = -ENOSPC;
 	if (seq_printf(s, "%-8s %u %ld ",
 		      l4proto->name, nf_ct_protonum(ct),
 		      timer_pending(&ct->timeout)
 		      ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0)
-		return -ENOSPC;
+		goto release;
 
 	if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct))
-		return -ENOSPC;
+		goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL))
-		return -ENOSPC;
+		goto release;
 
 	if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status)))
 		if (seq_printf(s, "[UNREPLIED] "))
-			return -ENOSPC;
+			goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_REPLY))
-		return -ENOSPC;
+		goto release;
 
 	if (test_bit(IPS_ASSURED_BIT, &ct->status))
 		if (seq_printf(s, "[ASSURED] "))
-			return -ENOSPC;
+			goto release;
 
 #ifdef CONFIG_NF_CONNTRACK_MARK
 	if (seq_printf(s, "mark=%u ", ct->mark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
 	if (seq_printf(s, "secmark=%u ", ct->secmark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 	if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use)))
-		return -ENOSPC;
-
-	return 0;
+		goto release;
+	ret = 0;
+release:
+	nf_ct_put(ct);
+	return ret;
 }
 
 static const struct seq_operations ct_seq_ops = {
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index a65cf69..fe65187 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct,
 static int __net_init nf_nat_net_init(struct net *net)
 {
 	net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size,
-						      &net->ipv4.nat_vmalloced);
+						      &net->ipv4.nat_vmalloced, 0);
 	if (!net->ipv4.nat_bysource)
 		return -ENOMEM;
 	return 0;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 54e983f..c55bbdc 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -29,6 +29,7 @@
 #include <linux/netdevice.h>
 #include <linux/socket.h>
 #include <linux/mm.h>
+#include <linux/rculist_nulls.h>
 
 #include <net/netfilter/nf_conntrack.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
@@ -163,8 +164,8 @@ static void
 clean_from_lists(struct nf_conn *ct)
 {
 	pr_debug("clean_from_lists(%p)\n", ct);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 
 	/* Destroy all pending expectations */
 	nf_ct_remove_expectations(ct);
@@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	/* We overload first tuple to link into unconfirmed list. */
 	if (!nf_ct_is_confirmed(ct)) {
-		BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode));
-		hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+		BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
+		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 	}
 
 	NF_CT_STAT_INC(net, delete);
@@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+/*
+ * Warning :
+ * - Caller must take a reference on returned object
+ *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
+ * OR
+ * - Caller must lock nf_conntrack_lock before calling this function
+ */
 struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
 	 */
 	local_bh_disable();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+begin:
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
@@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 		}
 		NF_CT_STAT_INC(net, searched);
 	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 * We probably met an item that was moved to another chain.
+	 */
+	if (get_nulls_value(n) != hash)
+		goto begin;
 	local_bh_enable();
 
 	return NULL;
@@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
 	struct nf_conn *ct;
 
 	rcu_read_lock();
+begin:
 	h = __nf_conntrack_find(net, tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
+		else {
+			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
+				nf_ct_put(ct);
+				goto begin;
+			}
+		}
 	}
 	rcu_read_unlock();
 
@@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 {
 	struct net *net = nf_ct_net(ct);
 
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 			   &net->ct.hash[hash]);
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
 			   &net->ct.hash[repl_hash]);
 }
 
@@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct nf_conn_help *help;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
 	struct net *net;
 
@@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-	hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 				      &h->tuple))
 			goto out;
-	hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				      &h->tuple))
 			goto out;
 
 	/* Remove from unconfirmed list */
-	hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 	/* Timer relative to confirmation time, not original
@@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 {
 	struct net *net = nf_ct_net(ignored_conntrack);
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we need to disable them at
 	 * least once for the stats anyway.
 	 */
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
 		    nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
@@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	/* Use oldest entry, which is roughly LRU */
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct = NULL, *tmp;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int i, cnt = 0;
 	int dropped = 0;
 
 	rcu_read_lock();
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash],
+					 hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 			if (!test_bit(IPS_ASSURED_BIT, &tmp->status))
 				ct = tmp;
@@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
-	INIT_RCU_HEAD(&ct->rcu);
 
 	return ct;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 
-static void nf_conntrack_free_rcu(struct rcu_head *head)
-{
-	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-
-	nf_ct_ext_free(ct);
-	kmem_cache_free(nf_conntrack_cachep, ct);
-}
-
 void nf_conntrack_free(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_destroy(ct);
 	atomic_dec(&net->ct.count);
-	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
+	nf_ct_ext_free(ct);
+	kmem_cache_free(nf_conntrack_cachep, ct);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);
 
@@ -594,7 +609,7 @@ init_conntrack(struct net *net,
 	}
 
 	/* Overload tuple linked list to put us in unconfirmed list. */
-	hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 		       &net->ct.unconfirmed);
 
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < nf_conntrack_htable_size; (*bucket)++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) {
+		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			if (iter(ct, data))
 				goto found;
 		}
 	}
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) {
+	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (iter(ct, data))
 			set_bit(IPS_DYING_BIT, &ct->status);
@@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data)
 	return 1;
 }
 
-void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size)
+void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size)
 {
 	if (vmalloced)
 		vfree(hash);
@@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net)
 	}
 }
 
-struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced)
+void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls)
 {
-	struct hlist_head *hash;
-	unsigned int size, i;
+	struct hlist_nulls_head *hash;
+	unsigned int nr_slots, i;
+	size_t sz;
 
 	*vmalloced = 0;
 
-	size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head));
-	hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN,
-				       get_order(sizeof(struct hlist_head)
-						 * size));
+	BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head));
+	nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head));
+	sz = nr_slots * sizeof(struct hlist_nulls_head);
+	hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
+					get_order(sz));
 	if (!hash) {
 		*vmalloced = 1;
 		printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n");
-		hash = vmalloc(sizeof(struct hlist_head) * size);
+		hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 	}
 
-	if (hash)
-		for (i = 0; i < size; i++)
-			INIT_HLIST_HEAD(&hash[i]);
+	if (hash && nulls)
+		for (i = 0; i < nr_slots; i++)
+			INIT_HLIST_NULLS_HEAD(&hash[i], i);
 
 	return hash;
 }
@@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	int i, bucket, vmalloced, old_vmalloced;
 	unsigned int hashsize, old_size;
 	int rnd;
-	struct hlist_head *hash, *old_hash;
+	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
 
 	/* On boot, we can set this without any fancy locking. */
@@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	if (!hashsize)
 		return -EINVAL;
 
-	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced);
+	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1);
 	if (!hash)
 		return -ENOMEM;
 
@@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	 */
 	spin_lock_bh(&nf_conntrack_lock);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		while (!hlist_empty(&init_net.ct.hash[i])) {
-			h = hlist_entry(init_net.ct.hash[i].first,
-					struct nf_conntrack_tuple_hash, hnode);
-			hlist_del_rcu(&h->hnode);
+		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
+			h = hlist_nulls_entry(init_net.ct.hash[i].first,
+					struct nf_conntrack_tuple_hash, hnnode);
+			hlist_nulls_del_rcu(&h->hnnode);
 			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
-			hlist_add_head_rcu(&h->hnode, &hash[bucket]);
+			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
 	old_size = nf_conntrack_htable_size;
@@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void)
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
 						sizeof(struct nf_conn),
-						0, 0, NULL);
+						0, SLAB_DESTROY_BY_RCU, NULL);
 	if (!nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
 		ret = -ENOMEM;
@@ -1202,7 +1219,7 @@ static int nf_conntrack_init_net(struct net *net)
 	int ret;
 
 	atomic_set(&net->ct.count, 0);
-	INIT_HLIST_HEAD(&net->ct.unconfirmed);
+	INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0);
 	net->ct.stat = alloc_percpu(struct ip_conntrack_stat);
 	if (!net->ct.stat) {
 		ret = -ENOMEM;
@@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net)
 	if (ret < 0)
 		goto err_ecache;
 	net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size,
-						  &net->ct.hash_vmalloc);
+					     &net->ct.hash_vmalloc, 1);
 	if (!net->ct.hash) {
 		ret = -ENOMEM;
 		printk(KERN_ERR "Unable to create nf_conntrack_hash\n");
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 357ba39..3940f99 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net)
 
 	net->ct.expect_count = 0;
 	net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize,
-						  &net->ct.expect_vmalloc);
+						  &net->ct.expect_vmalloc, 0);
 	if (net->ct.expect_hash == NULL)
 		goto err1;
 
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index a51bdac..6066144 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conntrack_expect *exp;
 	const struct hlist_node *n, *next;
+	const struct hlist_nulls_node *nn;
 	unsigned int i;
 
 	/* Get rid of expectations */
@@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	}
 
 	/* Get rid of expecteds, set helpers to NULL. */
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode)
+	hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode)
 		unhelp(h, me);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[i], hnode)
+		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
 			unhelp(h, me);
 	}
 }
@@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void)
 
 	nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
 	nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize,
-						  &nf_ct_helper_vmalloc);
+						  &nf_ct_helper_vmalloc, 0);
 	if (!nf_ct_helper_hash)
 		return -ENOMEM;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 1b75c9e..349bbef 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -19,6 +19,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/rculist.h>
+#include <linux/rculist_nulls.h>
 #include <linux/types.h>
 #include <linux/timer.h>
 #include <linux/skbuff.h>
@@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct nf_conn *ct, *last;
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
 
@@ -544,27 +545,27 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	last = (struct nf_conn *)cb->args[1];
 	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
-		hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
+					 hnnode) {
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (!atomic_inc_not_zero(&ct->ct_general.use))
+				continue;
 			/* Dump entries of a given L3 protocol number.
 			 * If it is not specified, ie. l3proto == 0,
 			 * then dump everything. */
 			if (l3proto && nf_ct_l3num(ct) != l3proto)
-				continue;
+				goto releasect;
 			if (cb->args[1]) {
 				if (ct != last)
-					continue;
+					goto releasect;
 				cb->args[1] = 0;
 			}
 			if (ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).pid,
 						cb->nlh->nlmsg_seq,
 						IPCTNL_MSG_CT_NEW,
 						1, ct) < 0) {
-				if (!atomic_inc_not_zero(&ct->ct_general.use))
-					continue;
 				cb->args[1] = (unsigned long)ct;
 				goto out;
 			}
@@ -577,6 +578,8 @@ restart:
 				if (acct)
 					memset(acct, 0, sizeof(struct nf_conn_counter[IP_CT_DIR_MAX]));
 			}
+releasect:
+		nf_ct_put(ct);
 		}
 		if (cb->args[1]) {
 			cb->args[1] = 0;
@@ -1242,13 +1245,12 @@ ctnetlink_create_conntrack(struct nlattr *cda[],
 		if (err < 0)
 			goto err2;
 
-		master_h = __nf_conntrack_find(&init_net, &master);
+		master_h = nf_conntrack_find_get(&init_net, &master);
 		if (master_h == NULL) {
 			err = -ENOENT;
 			goto err2;
 		}
 		master_ct = nf_ct_tuplehash_to_ctrack(master_h);
-		nf_conntrack_get(&master_ct->ct_general);
 		__set_bit(IPS_EXPECTED_BIT, &ct->status);
 		ct->master = master_ct;
 	}
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 4da54b0..3f10a1e 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -44,30 +44,30 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
+	while (is_a_nulls(head)) {
 		if (++st->bucket >= nf_conntrack_htable_size)
 			return NULL;
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
@@ -75,9 +75,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
@@ -107,67 +107,74 @@ static void ct_seq_stop(struct seq_file *s, void *v)
 /* return 0 on success, 1 in case of error */
 static int ct_seq_show(struct seq_file *s, void *v)
 {
-	const struct nf_conntrack_tuple_hash *hash = v;
-	const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
+	struct nf_conntrack_tuple_hash *hash = v;
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
 	const struct nf_conntrack_l3proto *l3proto;
 	const struct nf_conntrack_l4proto *l4proto;
+	int ret = 0;
 
 	NF_CT_ASSERT(ct);
+	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
+		return 0;
 
 	/* we only want to print DIR_ORIGINAL */
 	if (NF_CT_DIRECTION(hash))
-		return 0;
+		goto release;
 
 	l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct));
 	NF_CT_ASSERT(l3proto);
 	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	NF_CT_ASSERT(l4proto);
 
+	ret = -ENOSPC;
 	if (seq_printf(s, "%-8s %u %-8s %u %ld ",
 		       l3proto->name, nf_ct_l3num(ct),
 		       l4proto->name, nf_ct_protonum(ct),
 		       timer_pending(&ct->timeout)
 		       ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0)
-		return -ENOSPC;
+		goto release;
 
 	if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct))
-		return -ENOSPC;
+		goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL))
-		return -ENOSPC;
+		goto release;
 
 	if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status)))
 		if (seq_printf(s, "[UNREPLIED] "))
-			return -ENOSPC;
+			goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_REPLY))
-		return -ENOSPC;
+		goto release;
 
 	if (test_bit(IPS_ASSURED_BIT, &ct->status))
 		if (seq_printf(s, "[ASSURED] "))
-			return -ENOSPC;
+			goto release;
 
 #if defined(CONFIG_NF_CONNTRACK_MARK)
 	if (seq_printf(s, "mark=%u ", ct->mark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
 	if (seq_printf(s, "secmark=%u ", ct->secmark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 	if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use)))
-		return -ENOSPC;
+		goto release;
 
+	ret = 0;
+release:
+	nf_ct_put(ct);
 	return 0;
 }
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 7f404cc..6809809 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -108,7 +108,7 @@ static int count_them(struct xt_connlimit_data *data,
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
 	struct xt_connlimit_conn *tmp;
-	const struct nf_conn *found_ct;
+	struct nf_conn *found_ct;
 	struct list_head *hash;
 	bool addit = true;
 	int matches = 0;
@@ -123,7 +123,7 @@ static int count_them(struct xt_connlimit_data *data,
 
 	/* check the saved connections */
 	list_for_each_entry_safe(conn, tmp, hash, list) {
-		found    = __nf_conntrack_find(&init_net, &conn->tuple);
+		found    = nf_conntrack_find_get(&init_net, &conn->tuple);
 		found_ct = NULL;
 
 		if (found != NULL)
@@ -151,6 +151,7 @@ static int count_them(struct xt_connlimit_data *data,
 			 * we do not care about connections which are
 			 * closed already -> ditch it
 			 */
+			nf_ct_put(found_ct);
 			list_del(&conn->list);
 			kfree(conn);
 			continue;
@@ -160,6 +161,7 @@ static int count_them(struct xt_connlimit_data *data,
 		    match->family))
 			/* same source network -> be counted! */
 			++matches;
+		nf_ct_put(found_ct);
 	}
 
 	rcu_read_unlock();

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 18:53                               ` Eric Dumazet
@ 2009-03-25 19:00                                 ` Patrick McHardy
  2009-03-25 19:17                                   ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 19:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Eric Dumazet wrote:
> Here is take 2 of the patch with proper ref counting on dumping.

Thanks, one final question about the seq-file handling:

> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
> index 6ba5c55..0b870b9 100644
> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
> @@ -25,30 +25,30 @@ struct ct_iter_state {
>  	unsigned int bucket;
>  };
>  
> -static struct hlist_node *ct_get_first(struct seq_file *seq)
> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
>  {
>  	struct net *net = seq_file_net(seq);
>  	struct ct_iter_state *st = seq->private;
> -	struct hlist_node *n;
> +	struct hlist_nulls_node *n;
>  
>  	for (st->bucket = 0;
>  	     st->bucket < nf_conntrack_htable_size;
>  	     st->bucket++) {
>  		n = rcu_dereference(net->ct.hash[st->bucket].first);
> -		if (n)
> +		if (!is_a_nulls(n))
>  			return n;
>  	}
>  	return NULL;
>  }
>  
> -static struct hlist_node *ct_get_next(struct seq_file *seq,
> -				      struct hlist_node *head)
> +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
> +				      struct hlist_nulls_node *head)
>  {
>  	struct net *net = seq_file_net(seq);
>  	struct ct_iter_state *st = seq->private;
>  
>  	head = rcu_dereference(head->next);
> -	while (head == NULL) {
> +	while (is_a_nulls(head)) {
>  		if (++st->bucket >= nf_conntrack_htable_size)
>  			return NULL;
>  		head = rcu_dereference(net->ct.hash[st->bucket].first);
> @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct seq_file *seq,
>  	return head;
>  }
>  
> -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
> +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
>  {
> -	struct hlist_node *head = ct_get_first(seq);
> +	struct hlist_nulls_node *head = ct_get_first(seq);
>  
>  	if (head)
>  		while (pos && (head = ct_get_next(seq, head)))
> @@ -87,69 +87,76 @@ static void ct_seq_stop(struct seq_file *s, void *v)
>  
>  static int ct_seq_show(struct seq_file *s, void *v)
>  {
> -	const struct nf_conntrack_tuple_hash *hash = v;
> -	const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
> +	struct nf_conntrack_tuple_hash *hash = v;
> +	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
>  	const struct nf_conntrack_l3proto *l3proto;
>  	const struct nf_conntrack_l4proto *l4proto;
> +	int ret = 0;
>  
>  	NF_CT_ASSERT(ct);
> +	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
> +		return 0;

Can we assume the next pointer still points to the next entry
in the same chain after the refcount dropped to zero?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 19:00                                 ` Patrick McHardy
@ 2009-03-25 19:17                                   ` Eric Dumazet
  2009-03-25 19:41                                     ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25 19:17 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Here is take 2 of the patch with proper ref counting on dumping.
> 
> Thanks, one final question about the seq-file handling:
> 
>> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> index 6ba5c55..0b870b9 100644
>> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
>> @@ -25,30 +25,30 @@ struct ct_iter_state {
>>      unsigned int bucket;
>>  };
>>  
>> -static struct hlist_node *ct_get_first(struct seq_file *seq)
>> +static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
>>  {
>>      struct net *net = seq_file_net(seq);
>>      struct ct_iter_state *st = seq->private;
>> -    struct hlist_node *n;
>> +    struct hlist_nulls_node *n;
>>  
>>      for (st->bucket = 0;
>>           st->bucket < nf_conntrack_htable_size;
>>           st->bucket++) {
>>          n = rcu_dereference(net->ct.hash[st->bucket].first);
>> -        if (n)
>> +        if (!is_a_nulls(n))
>>              return n;
>>      }
>>      return NULL;
>>  }
>>  
>> -static struct hlist_node *ct_get_next(struct seq_file *seq,
>> -                      struct hlist_node *head)
>> +static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
>> +                      struct hlist_nulls_node *head)
>>  {
>>      struct net *net = seq_file_net(seq);
>>      struct ct_iter_state *st = seq->private;
>>  
>>      head = rcu_dereference(head->next);
>> -    while (head == NULL) {
>> +    while (is_a_nulls(head)) {
>>          if (++st->bucket >= nf_conntrack_htable_size)
>>              return NULL;
>>          head = rcu_dereference(net->ct.hash[st->bucket].first);
>> @@ -56,9 +56,9 @@ static struct hlist_node *ct_get_next(struct
>> seq_file *seq,
>>      return head;
>>  }
>>  
>> -static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
>> +static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq,
>> loff_t pos)
>>  {
>> -    struct hlist_node *head = ct_get_first(seq);
>> +    struct hlist_nulls_node *head = ct_get_first(seq);
>>  
>>      if (head)
>>          while (pos && (head = ct_get_next(seq, head)))
>> @@ -87,69 +87,76 @@ static void ct_seq_stop(struct seq_file *s, void *v)
>>  
>>  static int ct_seq_show(struct seq_file *s, void *v)
>>  {
>> -    const struct nf_conntrack_tuple_hash *hash = v;
>> -    const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
>> +    struct nf_conntrack_tuple_hash *hash = v;
>> +    struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
>>      const struct nf_conntrack_l3proto *l3proto;
>>      const struct nf_conntrack_l4proto *l4proto;
>> +    int ret = 0;
>>  
>>      NF_CT_ASSERT(ct);
>> +    if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
>> +        return 0;
> 
> Can we assume the next pointer still points to the next entry
> in the same chain after the refcount dropped to zero?
> 
> 
> 

We are looking chain N.
If we cannot atomic_inc() refcount, we got some deleted entry.
If we could atomic_inc, we can meet an entry that just moved to another chain X

When hitting its end, we continue the search to the N+1 chain so we only 
skip the end of previous chain (N). We can 'forget' some entries, we can print
several time one given entry.


We could solve this by :

1) Checking hash value : if not one expected -> 
   Going back to head of chain N, (potentially re-printing already handled entries)
   So it is not a *perfect* solution.

2) Use a locking to forbid writers (as done in UDP/TCP), but it is expensive and
wont solve other problem :

We wont avoid emitting same entry several time anyway (this is a flaw of 
current seq_file handling, since we 'count' entries to be skiped, and this is
wrong if some entries were deleted or inserted meanwhile)

We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure we should care...

Also, current resizing code can give to a /proc/net/ip_conntrack reader a problem, since
hash table can switch while its doing its dumping : many entries might be lost or regiven...



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 19:17                                   ` Eric Dumazet
@ 2009-03-25 19:41                                     ` Patrick McHardy
  2009-03-25 19:58                                       ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 19:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>>>      NF_CT_ASSERT(ct);
>>> +    if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
>>> +        return 0;
>> Can we assume the next pointer still points to the next entry
>> in the same chain after the refcount dropped to zero?
>>
> 
> We are looking chain N.
> If we cannot atomic_inc() refcount, we got some deleted entry.
> If we could atomic_inc, we can meet an entry that just moved to another chain X
> 
> When hitting its end, we continue the search to the N+1 chain so we only 
> skip the end of previous chain (N). We can 'forget' some entries, we can print
> several time one given entry.
> 
> 
> We could solve this by :
> 
> 1) Checking hash value : if not one expected -> 
>    Going back to head of chain N, (potentially re-printing already handled entries)
>    So it is not a *perfect* solution.
> 
> 2) Use a locking to forbid writers (as done in UDP/TCP), but it is expensive and
> wont solve other problem :
> 
> We wont avoid emitting same entry several time anyway (this is a flaw of 
> current seq_file handling, since we 'count' entries to be skiped, and this is
> wrong if some entries were deleted or inserted meanwhile)
> 
> We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure we should care...

I think double entries are not a problem, as you say, there
are already other cases where this can happen. But I think we
should try our best that every entry present at the start and
still present at the end of a dump is also contained in the
dump, otherwise the guantees seem to weak to still be useful.
Your first proposal would do exactly that, right?

> Also, current resizing code can give to a /proc/net/ip_conntrack reader a problem, since
> hash table can switch while its doing its dumping : many entries might be lost or regiven...

Thats true. But its a very rare operation, so I think its mainly
a documentation issue.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 19:41                                     ` Patrick McHardy
@ 2009-03-25 19:58                                       ` Eric Dumazet
  2009-03-25 20:10                                         ` Patrick McHardy
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-25 19:58 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Patrick McHardy a écrit :
>>>>      NF_CT_ASSERT(ct);
>>>> +    if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
>>>> +        return 0;
>>> Can we assume the next pointer still points to the next entry
>>> in the same chain after the refcount dropped to zero?
>>>
>>
>> We are looking chain N.
>> If we cannot atomic_inc() refcount, we got some deleted entry.
>> If we could atomic_inc, we can meet an entry that just moved to
>> another chain X
>>
>> When hitting its end, we continue the search to the N+1 chain so we
>> only skip the end of previous chain (N). We can 'forget' some entries,
>> we can print
>> several time one given entry.
>>
>>
>> We could solve this by :
>>
>> 1) Checking hash value : if not one expected ->    Going back to head
>> of chain N, (potentially re-printing already handled entries)
>>    So it is not a *perfect* solution.
>>
>> 2) Use a locking to forbid writers (as done in UDP/TCP), but it is
>> expensive and
>> wont solve other problem :
>>
>> We wont avoid emitting same entry several time anyway (this is a flaw
>> of current seq_file handling, since we 'count' entries to be skiped,
>> and this is
>> wrong if some entries were deleted or inserted meanwhile)
>>
>> We have same problem on /proc/net/udp & /proc/net/tcp, I am not sure
>> we should care...
> 
> I think double entries are not a problem, as you say, there
> are already other cases where this can happen. But I think we
> should try our best that every entry present at the start and
> still present at the end of a dump is also contained in the
> dump, otherwise the guantees seem to weak to still be useful.
> Your first proposal would do exactly that, right?

If your concern is to not forget entries, and we are allowed to print some entries several times,
then we can just check the final "nulls" value, and if we find a different value than expected for
chain N, go back to begining of chain N.

No need to check hash value (this could help not print several time same entry, we dont care that much)

+       while (is_a_nulls(head)) {
+               if (likely(get_nulls_value(head) == st->bucket)) {
+                       if (++st->bucket >= nf_conntrack_htable_size)
+                               return NULL;
+               }

Thank you

[PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/netfilter/nf_conntrack.h                  |   14 -
 include/net/netfilter/nf_conntrack_tuple.h            |    6
 include/net/netns/conntrack.h                         |    5
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   63 ++---
 net/ipv4/netfilter/nf_nat_core.c                      |    2
 net/netfilter/nf_conntrack_core.c                     |  123 +++++-----
 net/netfilter/nf_conntrack_expect.c                   |    2
 net/netfilter/nf_conntrack_helper.c                   |    7
 net/netfilter/nf_conntrack_netlink.c                  |   20 -
 net/netfilter/nf_conntrack_standalone.c               |   57 ++--
 net/netfilter/xt_connlimit.c                          |    6
 11 files changed, 174 insertions(+), 131 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 4dfb793..6c3f964 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -91,8 +91,7 @@ struct nf_conn_help {
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
-struct nf_conn
-{
+struct nf_conn {
 	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
            plus 1 for any connection(s) we are `master' for */
 	struct nf_conntrack ct_general;
@@ -126,7 +125,6 @@ struct nf_conn
 #ifdef CONFIG_NET_NS
 	struct net *ct_net;
 #endif
-	struct rcu_head rcu;
 };
 
 static inline struct nf_conn *
@@ -190,9 +188,13 @@ static inline void nf_ct_put(struct nf_conn *ct)
 extern int nf_ct_l3proto_try_module_get(unsigned short l3proto);
 extern void nf_ct_l3proto_module_put(unsigned short l3proto);
 
-extern struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced);
-extern void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced,
-				 unsigned int size);
+/*
+ * Allocate a hashtable of hlist_head (if nulls == 0),
+ * or hlist_nulls_head (if nulls == 1)
+ */
+extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls);
+
+extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size);
 
 extern struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple);
diff --git a/include/net/netfilter/nf_conntrack_tuple.h b/include/net/netfilter/nf_conntrack_tuple.h
index f2f6aa7..2628c15 100644
--- a/include/net/netfilter/nf_conntrack_tuple.h
+++ b/include/net/netfilter/nf_conntrack_tuple.h
@@ -12,6 +12,7 @@
 
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_conntrack_tuple_common.h>
+#include <linux/list_nulls.h>
 
 /* A `tuple' is a structure containing the information to uniquely
   identify a connection.  ie. if two packets have the same tuple, they
@@ -146,9 +147,8 @@ static inline void nf_ct_dump_tuple(const struct nf_conntrack_tuple *t)
 	((enum ip_conntrack_dir)(h)->tuple.dst.dir)
 
 /* Connections have two entries in the hash table: one for each way */
-struct nf_conntrack_tuple_hash
-{
-	struct hlist_node hnode;
+struct nf_conntrack_tuple_hash {
+	struct hlist_nulls_node hnnode;
 	struct nf_conntrack_tuple tuple;
 };
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..9dc5840 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -2,6 +2,7 @@
 #define __NETNS_CONNTRACK_H
 
 #include <linux/list.h>
+#include <linux/list_nulls.h>
 #include <asm/atomic.h>
 
 struct ctl_table_header;
@@ -10,9 +11,9 @@ struct nf_conntrack_ecache;
 struct netns_ct {
 	atomic_t		count;
 	unsigned int		expect_count;
-	struct hlist_head	*hash;
+	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
-	struct hlist_head	unconfirmed;
+	struct hlist_nulls_head	unconfirmed;
 	struct ip_conntrack_stat *stat;
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	struct nf_conntrack_ecache *ecache;
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 6ba5c55..8668a3d 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -25,40 +25,42 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
-		if (++st->bucket >= nf_conntrack_htable_size)
-			return NULL;
+	while (is_a_nulls(head)) {
+		if (likely(get_nulls_value(head) == st->bucket)) {
+			if (++st->bucket >= nf_conntrack_htable_size)
+				return NULL;
+		}
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
 	}
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
@@ -87,69 +89,76 @@ static void ct_seq_stop(struct seq_file *s, void *v)
 
 static int ct_seq_show(struct seq_file *s, void *v)
 {
-	const struct nf_conntrack_tuple_hash *hash = v;
-	const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
+	struct nf_conntrack_tuple_hash *hash = v;
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
 	const struct nf_conntrack_l3proto *l3proto;
 	const struct nf_conntrack_l4proto *l4proto;
+	int ret = 0;
 
 	NF_CT_ASSERT(ct);
+	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
+		return 0;
+
 
 	/* we only want to print DIR_ORIGINAL */
 	if (NF_CT_DIRECTION(hash))
-		return 0;
+		goto release;
 	if (nf_ct_l3num(ct) != AF_INET)
-		return 0;
+		goto release;
 
 	l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct));
 	NF_CT_ASSERT(l3proto);
 	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	NF_CT_ASSERT(l4proto);
 
+	ret = -ENOSPC;
 	if (seq_printf(s, "%-8s %u %ld ",
 		      l4proto->name, nf_ct_protonum(ct),
 		      timer_pending(&ct->timeout)
 		      ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0)
-		return -ENOSPC;
+		goto release;
 
 	if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct))
-		return -ENOSPC;
+		goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL))
-		return -ENOSPC;
+		goto release;
 
 	if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status)))
 		if (seq_printf(s, "[UNREPLIED] "))
-			return -ENOSPC;
+			goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_REPLY))
-		return -ENOSPC;
+		goto release;
 
 	if (test_bit(IPS_ASSURED_BIT, &ct->status))
 		if (seq_printf(s, "[ASSURED] "))
-			return -ENOSPC;
+			goto release;
 
 #ifdef CONFIG_NF_CONNTRACK_MARK
 	if (seq_printf(s, "mark=%u ", ct->mark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
 	if (seq_printf(s, "secmark=%u ", ct->secmark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 	if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use)))
-		return -ENOSPC;
-
-	return 0;
+		goto release;
+	ret = 0;
+release:
+	nf_ct_put(ct);
+	return ret;
 }
 
 static const struct seq_operations ct_seq_ops = {
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index a65cf69..fe65187 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -679,7 +679,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct,
 static int __net_init nf_nat_net_init(struct net *net)
 {
 	net->ipv4.nat_bysource = nf_ct_alloc_hashtable(&nf_nat_htable_size,
-						      &net->ipv4.nat_vmalloced);
+						      &net->ipv4.nat_vmalloced, 0);
 	if (!net->ipv4.nat_bysource)
 		return -ENOMEM;
 	return 0;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 54e983f..c55bbdc 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -29,6 +29,7 @@
 #include <linux/netdevice.h>
 #include <linux/socket.h>
 #include <linux/mm.h>
+#include <linux/rculist_nulls.h>
 
 #include <net/netfilter/nf_conntrack.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
@@ -163,8 +164,8 @@ static void
 clean_from_lists(struct nf_conn *ct)
 {
 	pr_debug("clean_from_lists(%p)\n", ct);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
-	hlist_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
 
 	/* Destroy all pending expectations */
 	nf_ct_remove_expectations(ct);
@@ -204,8 +205,8 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	/* We overload first tuple to link into unconfirmed list. */
 	if (!nf_ct_is_confirmed(ct)) {
-		BUG_ON(hlist_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode));
-		hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+		BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
+		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 	}
 
 	NF_CT_STAT_INC(net, delete);
@@ -242,18 +243,26 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+/*
+ * Warning :
+ * - Caller must take a reference on returned object
+ *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
+ * OR
+ * - Caller must lock nf_conntrack_lock before calling this function
+ */
 struct nf_conntrack_tuple_hash *
 __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
 	 */
 	local_bh_disable();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+begin:
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
@@ -261,6 +270,13 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 		}
 		NF_CT_STAT_INC(net, searched);
 	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 * We probably met an item that was moved to another chain.
+	 */
+	if (get_nulls_value(n) != hash)
+		goto begin;
 	local_bh_enable();
 
 	return NULL;
@@ -275,11 +291,18 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
 	struct nf_conn *ct;
 
 	rcu_read_lock();
+begin:
 	h = __nf_conntrack_find(net, tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
+		else {
+			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
+				nf_ct_put(ct);
+				goto begin;
+			}
+		}
 	}
 	rcu_read_unlock();
 
@@ -293,9 +316,9 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 {
 	struct net *net = nf_ct_net(ct);
 
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 			   &net->ct.hash[hash]);
-	hlist_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
 			   &net->ct.hash[repl_hash]);
 }
 
@@ -318,7 +341,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct nf_conn_help *help;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
 	struct net *net;
 
@@ -350,17 +373,17 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-	hlist_for_each_entry(h, n, &net->ct.hash[hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 				      &h->tuple))
 			goto out;
-	hlist_for_each_entry(h, n, &net->ct.hash[repl_hash], hnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				      &h->tuple))
 			goto out;
 
 	/* Remove from unconfirmed list */
-	hlist_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode);
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 	/* Timer relative to confirmation time, not original
@@ -399,14 +422,14 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 {
 	struct net *net = nf_ct_net(ignored_conntrack);
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(tuple);
 
 	/* Disable BHs the entire time since we need to disable them at
 	 * least once for the stats anyway.
 	 */
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnode) {
+	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
 		    nf_ct_tuple_equal(tuple, &h->tuple)) {
 			NF_CT_STAT_INC(net, found);
@@ -430,14 +453,14 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	/* Use oldest entry, which is roughly LRU */
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct = NULL, *tmp;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	unsigned int i, cnt = 0;
 	int dropped = 0;
 
 	rcu_read_lock();
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry_rcu(h, n, &net->ct.hash[hash],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash],
+					 hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 			if (!test_bit(IPS_ASSURED_BIT, &tmp->status))
 				ct = tmp;
@@ -508,27 +531,19 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
-	INIT_RCU_HEAD(&ct->rcu);
 
 	return ct;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 
-static void nf_conntrack_free_rcu(struct rcu_head *head)
-{
-	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-
-	nf_ct_ext_free(ct);
-	kmem_cache_free(nf_conntrack_cachep, ct);
-}
-
 void nf_conntrack_free(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_destroy(ct);
 	atomic_dec(&net->ct.count);
-	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
+	nf_ct_ext_free(ct);
+	kmem_cache_free(nf_conntrack_cachep, ct);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);
 
@@ -594,7 +609,7 @@ init_conntrack(struct net *net,
 	}
 
 	/* Overload tuple linked list to put us in unconfirmed list. */
-	hlist_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnode,
+	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 		       &net->ct.unconfirmed);
 
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -934,17 +949,17 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < nf_conntrack_htable_size; (*bucket)++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[*bucket], hnode) {
+		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			if (iter(ct, data))
 				goto found;
 		}
 	}
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode) {
+	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (iter(ct, data))
 			set_bit(IPS_DYING_BIT, &ct->status);
@@ -992,7 +1007,7 @@ static int kill_all(struct nf_conn *i, void *data)
 	return 1;
 }
 
-void nf_ct_free_hashtable(struct hlist_head *hash, int vmalloced, unsigned int size)
+void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size)
 {
 	if (vmalloced)
 		vfree(hash);
@@ -1060,26 +1075,28 @@ void nf_conntrack_cleanup(struct net *net)
 	}
 }
 
-struct hlist_head *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced)
+void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int nulls)
 {
-	struct hlist_head *hash;
-	unsigned int size, i;
+	struct hlist_nulls_head *hash;
+	unsigned int nr_slots, i;
+	size_t sz;
 
 	*vmalloced = 0;
 
-	size = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_head));
-	hash = (void*)__get_free_pages(GFP_KERNEL|__GFP_NOWARN,
-				       get_order(sizeof(struct hlist_head)
-						 * size));
+	BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head));
+	nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head));
+	sz = nr_slots * sizeof(struct hlist_nulls_head);
+	hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
+					get_order(sz));
 	if (!hash) {
 		*vmalloced = 1;
 		printk(KERN_WARNING "nf_conntrack: falling back to vmalloc.\n");
-		hash = vmalloc(sizeof(struct hlist_head) * size);
+		hash = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 	}
 
-	if (hash)
-		for (i = 0; i < size; i++)
-			INIT_HLIST_HEAD(&hash[i]);
+	if (hash && nulls)
+		for (i = 0; i < nr_slots; i++)
+			INIT_HLIST_NULLS_HEAD(&hash[i], i);
 
 	return hash;
 }
@@ -1090,7 +1107,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	int i, bucket, vmalloced, old_vmalloced;
 	unsigned int hashsize, old_size;
 	int rnd;
-	struct hlist_head *hash, *old_hash;
+	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
 
 	/* On boot, we can set this without any fancy locking. */
@@ -1101,7 +1118,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	if (!hashsize)
 		return -EINVAL;
 
-	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced);
+	hash = nf_ct_alloc_hashtable(&hashsize, &vmalloced, 1);
 	if (!hash)
 		return -ENOMEM;
 
@@ -1116,12 +1133,12 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	 */
 	spin_lock_bh(&nf_conntrack_lock);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		while (!hlist_empty(&init_net.ct.hash[i])) {
-			h = hlist_entry(init_net.ct.hash[i].first,
-					struct nf_conntrack_tuple_hash, hnode);
-			hlist_del_rcu(&h->hnode);
+		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
+			h = hlist_nulls_entry(init_net.ct.hash[i].first,
+					struct nf_conntrack_tuple_hash, hnnode);
+			hlist_nulls_del_rcu(&h->hnnode);
 			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
-			hlist_add_head_rcu(&h->hnode, &hash[bucket]);
+			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
 	old_size = nf_conntrack_htable_size;
@@ -1172,7 +1189,7 @@ static int nf_conntrack_init_init_net(void)
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
 						sizeof(struct nf_conn),
-						0, 0, NULL);
+						0, SLAB_DESTROY_BY_RCU, NULL);
 	if (!nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
 		ret = -ENOMEM;
@@ -1202,7 +1219,7 @@ static int nf_conntrack_init_net(struct net *net)
 	int ret;
 
 	atomic_set(&net->ct.count, 0);
-	INIT_HLIST_HEAD(&net->ct.unconfirmed);
+	INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0);
 	net->ct.stat = alloc_percpu(struct ip_conntrack_stat);
 	if (!net->ct.stat) {
 		ret = -ENOMEM;
@@ -1212,7 +1229,7 @@ static int nf_conntrack_init_net(struct net *net)
 	if (ret < 0)
 		goto err_ecache;
 	net->ct.hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size,
-						  &net->ct.hash_vmalloc);
+					     &net->ct.hash_vmalloc, 1);
 	if (!net->ct.hash) {
 		ret = -ENOMEM;
 		printk(KERN_ERR "Unable to create nf_conntrack_hash\n");
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 357ba39..3940f99 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -604,7 +604,7 @@ int nf_conntrack_expect_init(struct net *net)
 
 	net->ct.expect_count = 0;
 	net->ct.expect_hash = nf_ct_alloc_hashtable(&nf_ct_expect_hsize,
-						  &net->ct.expect_vmalloc);
+						  &net->ct.expect_vmalloc, 0);
 	if (net->ct.expect_hash == NULL)
 		goto err1;
 
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index a51bdac..6066144 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -158,6 +158,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conntrack_expect *exp;
 	const struct hlist_node *n, *next;
+	const struct hlist_nulls_node *nn;
 	unsigned int i;
 
 	/* Get rid of expectations */
@@ -174,10 +175,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	}
 
 	/* Get rid of expecteds, set helpers to NULL. */
-	hlist_for_each_entry(h, n, &net->ct.unconfirmed, hnode)
+	hlist_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode)
 		unhelp(h, me);
 	for (i = 0; i < nf_conntrack_htable_size; i++) {
-		hlist_for_each_entry(h, n, &net->ct.hash[i], hnode)
+		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
 			unhelp(h, me);
 	}
 }
@@ -217,7 +218,7 @@ int nf_conntrack_helper_init(void)
 
 	nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
 	nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize,
-						  &nf_ct_helper_vmalloc);
+						  &nf_ct_helper_vmalloc, 0);
 	if (!nf_ct_helper_hash)
 		return -ENOMEM;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 1b75c9e..349bbef 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -19,6 +19,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/rculist.h>
+#include <linux/rculist_nulls.h>
 #include <linux/types.h>
 #include <linux/timer.h>
 #include <linux/skbuff.h>
@@ -536,7 +537,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct nf_conn *ct, *last;
 	struct nf_conntrack_tuple_hash *h;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = NLMSG_DATA(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
 
@@ -544,27 +545,27 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	last = (struct nf_conn *)cb->args[1];
 	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
-		hlist_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
-					 hnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &init_net.ct.hash[cb->args[0]],
+					 hnnode) {
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (!atomic_inc_not_zero(&ct->ct_general.use))
+				continue;
 			/* Dump entries of a given L3 protocol number.
 			 * If it is not specified, ie. l3proto == 0,
 			 * then dump everything. */
 			if (l3proto && nf_ct_l3num(ct) != l3proto)
-				continue;
+				goto releasect;
 			if (cb->args[1]) {
 				if (ct != last)
-					continue;
+					goto releasect;
 				cb->args[1] = 0;
 			}
 			if (ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).pid,
 						cb->nlh->nlmsg_seq,
 						IPCTNL_MSG_CT_NEW,
 						1, ct) < 0) {
-				if (!atomic_inc_not_zero(&ct->ct_general.use))
-					continue;
 				cb->args[1] = (unsigned long)ct;
 				goto out;
 			}
@@ -577,6 +578,8 @@ restart:
 				if (acct)
 					memset(acct, 0, sizeof(struct nf_conn_counter[IP_CT_DIR_MAX]));
 			}
+releasect:
+		nf_ct_put(ct);
 		}
 		if (cb->args[1]) {
 			cb->args[1] = 0;
@@ -1242,13 +1245,12 @@ ctnetlink_create_conntrack(struct nlattr *cda[],
 		if (err < 0)
 			goto err2;
 
-		master_h = __nf_conntrack_find(&init_net, &master);
+		master_h = nf_conntrack_find_get(&init_net, &master);
 		if (master_h == NULL) {
 			err = -ENOENT;
 			goto err2;
 		}
 		master_ct = nf_ct_tuplehash_to_ctrack(master_h);
-		nf_conntrack_get(&master_ct->ct_general);
 		__set_bit(IPS_EXPECTED_BIT, &ct->status);
 		ct->master = master_ct;
 	}
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 4da54b0..1935153 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -44,40 +44,42 @@ struct ct_iter_state {
 	unsigned int bucket;
 };
 
-static struct hlist_node *ct_get_first(struct seq_file *seq)
+static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
-	struct hlist_node *n;
+	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
 	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(net->ct.hash[st->bucket].first);
-		if (n)
+		if (!is_a_nulls(n))
 			return n;
 	}
 	return NULL;
 }
 
-static struct hlist_node *ct_get_next(struct seq_file *seq,
-				      struct hlist_node *head)
+static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
+				      struct hlist_nulls_node *head)
 {
 	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(head->next);
-	while (head == NULL) {
-		if (++st->bucket >= nf_conntrack_htable_size)
-			return NULL;
+	while (is_a_nulls(head)) {
+		if (likely(get_nulls_value(head) == st->bucket)) {
+			if (++st->bucket >= nf_conntrack_htable_size)
+				return NULL;
+		}
 		head = rcu_dereference(net->ct.hash[st->bucket].first);
 	}
 	return head;
 }
 
-static struct hlist_node *ct_get_idx(struct seq_file *seq, loff_t pos)
+static struct hlist_nulls_node *ct_get_idx(struct seq_file *seq, loff_t pos)
 {
-	struct hlist_node *head = ct_get_first(seq);
+	struct hlist_nulls_node *head = ct_get_first(seq);
 
 	if (head)
 		while (pos && (head = ct_get_next(seq, head)))
@@ -107,67 +109,74 @@ static void ct_seq_stop(struct seq_file *s, void *v)
 /* return 0 on success, 1 in case of error */
 static int ct_seq_show(struct seq_file *s, void *v)
 {
-	const struct nf_conntrack_tuple_hash *hash = v;
-	const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
+	struct nf_conntrack_tuple_hash *hash = v;
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(hash);
 	const struct nf_conntrack_l3proto *l3proto;
 	const struct nf_conntrack_l4proto *l4proto;
+	int ret = 0;
 
 	NF_CT_ASSERT(ct);
+	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
+		return 0;
 
 	/* we only want to print DIR_ORIGINAL */
 	if (NF_CT_DIRECTION(hash))
-		return 0;
+		goto release;
 
 	l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct));
 	NF_CT_ASSERT(l3proto);
 	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	NF_CT_ASSERT(l4proto);
 
+	ret = -ENOSPC;
 	if (seq_printf(s, "%-8s %u %-8s %u %ld ",
 		       l3proto->name, nf_ct_l3num(ct),
 		       l4proto->name, nf_ct_protonum(ct),
 		       timer_pending(&ct->timeout)
 		       ? (long)(ct->timeout.expires - jiffies)/HZ : 0) != 0)
-		return -ENOSPC;
+		goto release;
 
 	if (l4proto->print_conntrack && l4proto->print_conntrack(s, ct))
-		return -ENOSPC;
+		goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_ORIGINAL))
-		return -ENOSPC;
+		goto release;
 
 	if (!(test_bit(IPS_SEEN_REPLY_BIT, &ct->status)))
 		if (seq_printf(s, "[UNREPLIED] "))
-			return -ENOSPC;
+			goto release;
 
 	if (print_tuple(s, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 			l3proto, l4proto))
-		return -ENOSPC;
+		goto release;
 
 	if (seq_print_acct(s, ct, IP_CT_DIR_REPLY))
-		return -ENOSPC;
+		goto release;
 
 	if (test_bit(IPS_ASSURED_BIT, &ct->status))
 		if (seq_printf(s, "[ASSURED] "))
-			return -ENOSPC;
+			goto release;
 
 #if defined(CONFIG_NF_CONNTRACK_MARK)
 	if (seq_printf(s, "mark=%u ", ct->mark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
 	if (seq_printf(s, "secmark=%u ", ct->secmark))
-		return -ENOSPC;
+		goto release;
 #endif
 
 	if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use)))
-		return -ENOSPC;
+		goto release;
 
+	ret = 0;
+release:
+	nf_ct_put(ct);
 	return 0;
 }
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 7f404cc..6809809 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -108,7 +108,7 @@ static int count_them(struct xt_connlimit_data *data,
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
 	struct xt_connlimit_conn *tmp;
-	const struct nf_conn *found_ct;
+	struct nf_conn *found_ct;
 	struct list_head *hash;
 	bool addit = true;
 	int matches = 0;
@@ -123,7 +123,7 @@ static int count_them(struct xt_connlimit_data *data,
 
 	/* check the saved connections */
 	list_for_each_entry_safe(conn, tmp, hash, list) {
-		found    = __nf_conntrack_find(&init_net, &conn->tuple);
+		found    = nf_conntrack_find_get(&init_net, &conn->tuple);
 		found_ct = NULL;
 
 		if (found != NULL)
@@ -151,6 +151,7 @@ static int count_them(struct xt_connlimit_data *data,
 			 * we do not care about connections which are
 			 * closed already -> ditch it
 			 */
+			nf_ct_put(found_ct);
 			list_del(&conn->list);
 			kfree(conn);
 			continue;
@@ -160,6 +161,7 @@ static int count_them(struct xt_connlimit_data *data,
 		    match->family))
 			/* same source network -> be counted! */
 			++matches;
+		nf_ct_put(found_ct);
 	}
 
 	rcu_read_unlock();


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs
  2009-03-25 19:58                                       ` Eric Dumazet
@ 2009-03-25 20:10                                         ` Patrick McHardy
  0 siblings, 0 replies; 44+ messages in thread
From: Patrick McHardy @ 2009-03-25 20:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mbizon, Paul E. McKenney, Joakim Tjernlund, avorontsov, netdev,
	Netfilter Developers

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> I think double entries are not a problem, as you say, there
>> are already other cases where this can happen. But I think we
>> should try our best that every entry present at the start and
>> still present at the end of a dump is also contained in the
>> dump, otherwise the guantees seem to weak to still be useful.
>> Your first proposal would do exactly that, right?
> 
> If your concern is to not forget entries, and we are allowed to print some entries several times,
> then we can just check the final "nulls" value, and if we find a different value than expected for
> chain N, go back to begining of chain N.
> 
> No need to check hash value (this could help not print several time same entry, we dont care that much)
> 
> +       while (is_a_nulls(head)) {
> +               if (likely(get_nulls_value(head) == st->bucket)) {
> +                       if (++st->bucket >= nf_conntrack_htable_size)
> +                               return NULL;
> +               }

Looks perfect :)

> [PATCH] conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()
> 
> Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.
> 
> This permits an easy conversion from call_rcu() based hash lists to a
> SLAB_DESTROY_BY_RCU one.
> 
> Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.
> 
> First, it doesnt fill RCU queues (up to 10000 elements per cpu).
> This reduces OOM possibility, if queued elements are not taken into account
> This reduces latency problems when RCU queue size hits hilimit and triggers
> emergency mode.
> 
> - It allows fast reuse of just freed elements, permitting better use of
> CPU cache.
> 
> - We delete rcu_head from "struct nf_conn", shrinking size of this structure
> by 8 or 16 bytes.
> 
> This patch only takes care of "struct nf_conn".
> call_rcu() is still used for less critical conntrack parts, that may
> be converted later if necessary.

Applied, thanks a lot.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
  2009-03-24 15:17                   ` Maxime Bizon
  2009-03-24 15:21                     ` Patrick McHardy
  2009-03-24 15:27                     ` Eric Dumazet
@ 2009-03-24 18:29                     ` Joakim Tjernlund
  2 siblings, 0 replies; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-24 18:29 UTC (permalink / raw)
  To: mbizon; +Cc: avorontsov, Eric Dumazet, Patrick McHardy, netdev,
	Paul E. McKenney

Maxime Bizon <mbizon@freebox.fr> wrote on 24/03/2009 16:17:30:
> 
> 
> On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:
> 
> Hi Eric,
> 
> > We use RCU to defer freeing of conntrack structures. In DOS situation,
> > RCU might accumulate about 10.000 elements per CPU in its internal
> > queues. To get accurate conntrack counts (at the expense of slightly
> > more RAM used), we might consider conntrack counter not taking into
> > account "about to be freed elements, waiting in RCU queues". We thus
> > decrement it in nf_conntrack_free(), not in the RCU callback.
> 
> Your patch fixes the problem on my board too (embedded mips router
> 250Mhz), thanks.
> 
> Yet I'm concerned about what you said concerning RAM usage. I have a
> very small amount on memory left on my board (less than 4M), and I tuned
> ip route cache size and nf_conntrack_max to make sure I won't go OOM.
> 
> With your patch, does it mean 10000 conntrack entries can be allocated
> while nf_conntrack_max is say only 2048 ?

Just add "rcuclassic.qhimark=2048" to your cmdline.

 Jocke


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 17:42       ` Joakim Tjernlund
  2009-03-23 17:49         ` Patrick McHardy
@ 2009-03-23 17:49         ` Eric Dumazet
  2009-03-23 18:04           ` Joakim Tjernlund
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-03-23 17:49 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Patrick McHardy, avorontsov, netdev

Joakim Tjernlund a écrit :
> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> 
>> Joakim Tjernlund wrote:
>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
>>>> Joakim Tjernlund wrote:
>>>>> doing a "ping -f -l 3" on my host towards my board on linus tree as 
> of 
>>>>> Friday results in lots of:
>>>>> nf_conntrack: table full, dropping packet.
>>>>> nf_conntrack: table full, dropping packet.
>>>>> nf_conntrack: table full, dropping packet.
>>>>> __ratelimit: 11 callbacks suppressed
>>>>> nf_conntrack: table full, dropping packet.
>>>>> nf_conntrack: table full, dropping packet.
>>>>> nf_conntrack: table full, dropping packet.
>>>>> nf_conntrack: table full, dropping packet.
>>>>>
>>>>> for ucc_geth on a MPC832x.
>>>>> This really looks strange to me, ideas?
>>>> What does /proc/net/netfilter/nf_conntrack show?
>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>> to the board I see:
>> That means that something is leaking conntrack references, most likely
>> by leaking skbs. Since I haven't seen any other reports, my guess would
>> be the ucc_geth driver.
>>
> 
> Mucking around with the ucc_geth driver I found that if I:
>  - Move TX from IRQ to NAPI context
>  - double the weight.
>  - after booting up, wait a few mins until the JFFS2 GC kernel thread has 
> stopped
>    scanning the FS 
> 
> Then the "nf_conntrack: table full, dropping packet." msgs stops.
> Does this seem right to you guys?
> 

How many cpus do you have ?

What kernel version do you use ?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 17:49         ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet
@ 2009-03-23 18:04           ` Joakim Tjernlund
  2009-03-23 18:08             ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Joakim Tjernlund @ 2009-03-23 18:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avorontsov, Patrick McHardy, netdev

Eric Dumazet <dada1@cosmosbay.com> wrote on 23/03/2009 18:49:38:

> 
> Joakim Tjernlund a écrit :
> > Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> > 
> >> Joakim Tjernlund wrote:
> >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
> >>>> Joakim Tjernlund wrote:
> >>>>> doing a "ping -f -l 3" on my host towards my board on linus tree 
as 
> > of 
> >>>>> Friday results in lots of:
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>> __ratelimit: 11 callbacks suppressed
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>> nf_conntrack: table full, dropping packet.
> >>>>>
> >>>>> for ucc_geth on a MPC832x.
> >>>>> This really looks strange to me, ideas?
> >>>> What does /proc/net/netfilter/nf_conntrack show?
> >>> There is no /proc/net/netfilter/nf_conntrack. There is a
> >>> /proc/net/nf_conntrack though and it is empty. If I telnet
> >>> to the board I see:
> >> That means that something is leaking conntrack references, most 
likely
> >> by leaking skbs. Since I haven't seen any other reports, my guess 
would
> >> be the ucc_geth driver.
> >>
> > 
> > Mucking around with the ucc_geth driver I found that if I:
> >  - Move TX from IRQ to NAPI context
> >  - double the weight.
> >  - after booting up, wait a few mins until the JFFS2 GC kernel thread 
has 
> > stopped
> >    scanning the FS 
> > 
> > Then the "nf_conntrack: table full, dropping packet." msgs stops.
> > Does this seem right to you guys?
> > 
> 
> How many cpus do you have ?

Just one, it is an embedded board running at 266 MHz

> 
> What kernel version do you use ?

Linus tree as of Friday


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ucc_geth: nf_conntrack: table full, dropping packet.
  2009-03-23 18:04           ` Joakim Tjernlund
@ 2009-03-23 18:08             ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-03-23 18:08 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: avorontsov, Patrick McHardy, netdev

Joakim Tjernlund a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote on 23/03/2009 18:49:38:
> 
>> Joakim Tjernlund a écrit :
>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>
>>>> Joakim Tjernlund wrote:
>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:15:45:
>>>>>> Joakim Tjernlund wrote:
>>>>>>> doing a "ping -f -l 3" on my host towards my board on linus tree 
> as 
>>> of 
>>>>>>> Friday results in lots of:
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>> __ratelimit: 11 callbacks suppressed
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>> nf_conntrack: table full, dropping packet.
>>>>>>>
>>>>>>> for ucc_geth on a MPC832x.
>>>>>>> This really looks strange to me, ideas?
>>>>>> What does /proc/net/netfilter/nf_conntrack show?
>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>> to the board I see:
>>>> That means that something is leaking conntrack references, most 
> likely
>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
> would
>>>> be the ucc_geth driver.
>>>>
>>> Mucking around with the ucc_geth driver I found that if I:
>>>  - Move TX from IRQ to NAPI context
>>>  - double the weight.
>>>  - after booting up, wait a few mins until the JFFS2 GC kernel thread 
> has 
>>> stopped
>>>    scanning the FS 
>>>
>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>> Does this seem right to you guys?
>>>
>> How many cpus do you have ?
> 
> Just one, it is an embedded board running at 266 MHz
> 
>> What kernel version do you use ?
> 
> Linus tree as of Friday
> 
>

I suspect RCU problem. Maybe the GC kernel threads blocks synchronize_rcu() ?



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-03-25 20:10 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-23 10:42 ucc_geth: nf_conntrack: table full, dropping packet Joakim Tjernlund
2009-03-23 12:15 ` Patrick McHardy
2009-03-23 12:25   ` Joakim Tjernlund
2009-03-23 12:29     ` Patrick McHardy
2009-03-23 12:59       ` Joakim Tjernlund
     [not found]       ` <OF387EC803.F810F72A-ONC1257582.00468C6E-C1257582.00475783@LocalDomain>
2009-03-23 13:09         ` Joakim Tjernlund
2009-03-23 17:42       ` Joakim Tjernlund
2009-03-23 17:49         ` Patrick McHardy
2009-03-24  8:22           ` Joakim Tjernlund
2009-03-24  9:12             ` Eric Dumazet
2009-03-24 10:55               ` Joakim Tjernlund
2009-03-24 12:07                 ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Eric Dumazet
2009-03-24 12:25                   ` Eric Dumazet
2009-03-24 12:43                     ` Patrick McHardy
2009-03-24 13:32                       ` Eric Dumazet
2009-03-24 13:38                         ` Patrick McHardy
2009-03-24 13:47                           ` Eric Dumazet
     [not found]                             ` <49C8F871.9070600@cosmosbay.com>
     [not found]                               ` <49C8F8E0.9050502@trash.net>
2009-03-25  3:53                                 ` Eric Dumazet
2009-03-25 13:39                                   ` Patrick McHardy
2009-03-25 13:44                                     ` Eric Dumazet
2009-03-24 13:20                   ` Joakim Tjernlund
2009-03-24 13:28                     ` Patrick McHardy
2009-03-24 13:29                     ` Eric Dumazet
2009-03-24 13:41                       ` Joakim Tjernlund
2009-03-24 15:17                   ` Maxime Bizon
2009-03-24 15:21                     ` Patrick McHardy
2009-03-24 15:27                     ` Eric Dumazet
2009-03-24 19:54                       ` [PATCH] netfilter: Use hlist_add_head_rcu() in nf_conntrack_set_hashsize() Eric Dumazet
2009-03-25 16:26                         ` Patrick McHardy
2009-03-25 17:53                       ` [PATCH] conntrack: use SLAB_DESTROY_BY_RCU for nf_conn structs Eric Dumazet
2009-03-25 18:05                         ` Patrick McHardy
2009-03-25 18:06                           ` Patrick McHardy
2009-03-25 18:15                           ` Eric Dumazet
2009-03-25 18:24                             ` Patrick McHardy
2009-03-25 18:53                               ` Eric Dumazet
2009-03-25 19:00                                 ` Patrick McHardy
2009-03-25 19:17                                   ` Eric Dumazet
2009-03-25 19:41                                     ` Patrick McHardy
2009-03-25 19:58                                       ` Eric Dumazet
2009-03-25 20:10                                         ` Patrick McHardy
2009-03-24 18:29                     ` [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Joakim Tjernlund
2009-03-23 17:49         ` ucc_geth: nf_conntrack: table full, dropping packet Eric Dumazet
2009-03-23 18:04           ` Joakim Tjernlund
2009-03-23 18:08             ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).